# 3.0.0. EDA

### Methodology

As part of our data preparation and understanding phase, we conducted an extensive exploratory data analysis on the training dataset using the `pandas_profiling` package. This tool enables an automated and comprehensive EDA, generating a detailed report that includes:

- Statistics: Descriptive statistics that summarize a dataset's distribution's central tendency, dispersion, and shape.
- Correlations: Analysis of the relationships between features, identifying which pairs have the strongest correlations with the target variable.
- Missing Values: Identifying and visualizing missing data patterns, helping to decide necessary preprocessing steps.
- Distributions: Visualizations of data distributions and variance to understand the skewness and outliers that might influence model performance.

Note: profile.to_file() generate a data profiling html, but it is too large to load into a GitHub repository.

In [12]:
import yaml
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pathlib import Path
from ydata_profiling import ProfileReport

### 1. Load Data 

In [13]:
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)
    
numeric_features = config["features"]["numerical"]
features = numeric_features
target = config["main"]["target"]
data_train_path = Path.cwd().parent / config["main"]["data_train_path"]

train_df = pd.read_pickle(data_train_path)[features]
train_df.shape

(9479, 142)

In [14]:
report_name = "application_train_report.pdf"
report_file = Path.cwd() / report_name

profile = ProfileReport(train_df, title="Pandas Profiling Report", explorative=True)
#profile.to_file(report_name)