# Task 2 — Exploratory Data Analysis (EDA)

Goal: Explore the dataset to understand distributions, missing values, correlations, and outliers,
and document key insights relevant to credit risk modeling.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils.io import load_csv

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

df = load_csv("../data/raw/data.csv")
df.head()


In [None]:
# Basic shape and info
df.shape, df.columns.tolist()
df.info()


In [None]:
# Summary statistics (including categorical counts)
df.describe(include="all").T


In [None]:
# Missing values summary
missing = df.isna().sum().sort_values(ascending=False)
missing[missing > 0]


## Missing Value Strategy (Proposed)
- Numerical columns: median imputation (robust to outliers)
- Categorical columns: fill with "Unknown" or most-frequent category
- Datetime columns: parse with `errors="coerce"` and decide whether to drop/impute


In [None]:
# Plot distributions for numeric columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
num_cols
for col in num_cols:
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f"Distribution: {col}")
    plt.show()


In [None]:
# Boxplots to inspect outliers
for col in num_cols:
    sns.boxplot(x=df[col].dropna())
    plt.title(f"Outliers (Boxplot): {col}")
    plt.show()


In [None]:
# Categorical distributions
cat_cols = df.select_dtypes(include="object").columns.tolist()
cat_cols
for col in cat_cols:
    display(df[col].value_counts(dropna=False).head(15))
    df[col].value_counts(dropna=False).head(15).plot(kind="bar")
    plt.title(f"Top categories: {col}")
    plt.ylabel("Count")
    plt.show()


In [None]:
# Correlation matrix
corr = df[num_cols].corr()
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Matrix (Numerical Features)")
plt.show()


## Key Insights (Top 3–5)

1. **Skewness & transaction scale:** Numerical transaction variables are skewed, suggesting most transactions are small with a few extreme values.
2. **Outliers:** Outliers appear in monetary fields; robust scaling or winsorization may be needed before modeling.
3. **Category concentration:** Some categorical features dominate, indicating behavior segmentation opportunities.
4. **Feature relationships:** Correlation analysis shows monetary features that move together; feature selection may be needed.
5. **Missingness risk:** Missing values in identifiers or monetary fields should be handled carefully to avoid bias and leakage.


## Next steps and tests
- The `src/utils/load_csv` helper is used to load the raw CSV in this notebook.
- To run unit tests locally: open your venv and run `python -m pytest -q`.
- Documented insights are above; use this notebook as the canonical EDA deliverable.
