# 3.0.0. EDA

### Methodology

As part of our data preparation and understanding phase, we conducted an extensive exploratory data analysis on the training dataset using the `pandas_profiling` package. This tool enables an automated and comprehensive EDA, generating a detailed report that includes:

- **Statistics:** Descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution.
- **Correlations:** Analysis of the relationships between features, identifying which pairs have the strongest correlations with the target variable.
- **Missing Values:** Identification and visualization of missing data patterns, helping to decide necessary preprocessing steps.
- **Distributions:** Visualizations of data distributions and variance to understand the skewness and outliers which might influence model performance.


In [1]:
import yaml
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pathlib import Path
from ydata_profiling import ProfileReport

  from .autonotebook import tqdm as notebook_tqdm


### 1. Load Data 

In [2]:
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)
    
numeric_features = config["features"]["numerical"]
features = numeric_features
target = config["main"]["target"]
data_train_path = Path.cwd().parent / config["main"]["data_train_path"]
train_validation_path = Path.cwd().parent / config["main"]["data_validation_path"]

train_df = pd.read_pickle(data_train_path)
train_df.shape

(9479, 291)

In [None]:
report_name = "application_train_report.html"
report_file = Path.cwd() / report_name

profile = ProfileReport(train_df, title="Pandas Profiling Report", explorative=True)
profile.to_file(report_name)


Summarize dataset:  10%|███▋                                  | 5708/59834 [05:15<50:42, 17.79it/s, scatter credit_reports__open_loans_debt_ratio_max, credit_reports__business_type_nunique]