#### Exploratory Data Analysis (EDA) - Initial Data Visualization: Distribution and Relations

Dataset: 
- _xxx_clean.csv_
- _yyy_clean.csv_
- _zzz_clean.csv_

Author: Luis Sergio Pastrana Lemus  
Date: 2025-MM-DD

# Exploratory Data Analysis – Name XXX Dataset

## __1. Libraries__.

In [None]:
from pathlib import Path
import sys

# Define project root dynamically, gets the current directory from which the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:

    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *


from IPython.display import display, HTML
import os
import pandas as pd
import numpy as np

## __2. Path to Data file__.

In [None]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed" / "clean"

df_xxx_clean = load_dataset_from_csv(data_file_path, "xxx_clean.csv", header='infer', parse_dates=['join_date'])

# data_file_path = project_root / "data" / "processed" / "feature"

# df_xxx_feature = load_dataset_from_csv(data_file_path, "xxx_feature.csv", sep=',', header='infer')

In [None]:
# Format notebook output
format_notebook()

## __3. Exploratory Data Analysis__.

### 3.0 Casting Data types.

In [None]:
# Call casting dtypes function from features.py and Identifying correctly missing values qith pd.NA

# missing values to pd.NA
df_xxx_clean = replace_missing_values(df_xxx_clean, include=['column_name'])

# object to string
df_xxx_clean = cast_datatypes(df_xxx_clean, 'string', c_include=['column_name'])

# object to numeric
df_xxx_clean = cast_datatypes(df_xxx_clean, 'numeric', numeric_type='Float64', c_include=['column_name''])

# object to category
df_xxx_clean = cast_datatypes(df_xxx_clean, 'category', c_include=['category', 'column_name''])

# object to datetime
df_xxx_clean['date'] = pd.to_datetime(df_xxx_clean['column_name''], errors='coerce', utc=True)

### 3.2 Data Visualization: Distributions and Relationships.

#### 3.2.1 Covariance and Correlation Analysis.

##### 3.2.1.1 Covariance Matrix.

In [None]:
# Covariance for xxx
df_xxx_clean[['column_name_01'', 'column_name_02'']].cov()

##### 3.2.1.2 Correlation Matrix.

| Correlation Value     | Interpretation                |
| --------------------- | ----------------------------- |
| `+0.7` to `+1.0`      | Strong positive correlation   |
| `+0.3` to `+0.7`      | Moderate positive correlation |
| `0.0` to `+0.3`       | Weak positive correlation     |
| `0`                   | No correlation                |
| `-0.3` to `0`         | Weak negative correlation     |
| `-0.7` to `-0.3`      | Moderate negative correlation |
| `-1.0` to `-0.7`      | Strong negative correlation   |


In [None]:
# Correlation for xxx
evaluate_correlation(df_xxx_clean, columns=['column_name_01', 'column_name_02', 'column_name_03'])

In [None]:
plot_scatter_matrixpx(df_xxx_clean, columns=['column_name_01', 'column_name_02', 'column_name_03'])