# üî¨ Loan Default Prediction ‚Äî Exploratory Data Analysis

**Author:** Vasile-Marian Danci  
**Date:** 2026-03-01  

---

### üéØ Objective

> Describe the goal of this analysis in one or two sentences.

---
## üì¶ 1 ¬∑ Imports

Import all required packages here. Keep standard-library, third-party, and local imports separated.

In [None]:
import math
import warnings
from pathlib import Path
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

# Visualisation ‚Äî swap with plotly / altair / any library you prefer
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")
plt.rcParams["figure.figsize"] = (12, 6)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

---
## ‚öôÔ∏è 2 ¬∑ Configuration

Define all configurable parameters (paths, constants, column names) in one place so the notebook is easy to adapt across projects.

In [None]:
# ‚îÄ‚îÄ Resolve project directory automatically ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Works in VS Code, JupyterLab, and classic Jupyter Notebook.
_nb_path = globals().get("__vsc_ipynb_file__")  # VS Code injects this
if _nb_path:
    PROJECT_DIR = Path(_nb_path).resolve().parent.parent  # notebooks/ ‚Üí project/
else:
    # Browser Jupyter sets CWD to the notebook's directory.
    _cwd = Path.cwd()
    PROJECT_DIR = next(
        (p for p in [_cwd, *_cwd.parents]
         if (p / "data").is_dir() and (p / "notebooks").is_dir()),
        _cwd,
    )

# ‚îÄ‚îÄ Dataset ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Just set the filename ‚Äî the full path is resolved automatically.
DATA_FILE = "dataset.csv"  # TODO: replace with your dataset filename
DATA_PATH = PROJECT_DIR / "data" / DATA_FILE

# TODO: Set the name of the target column in your dataset
TARGET_COL = "target"

# Task type ‚Äî drives conditional behaviour throughout the notebook:
#   "regression"      ‚Üí histograms, scatter plots, correlation analysis
#   "classification"  ‚Üí bar charts, box plots per class, class balance checks
TASK = "regression"  # or "classification"

print(f"üìÅ Project dir: {PROJECT_DIR}")
print(f"üìÑ Data path:   {DATA_PATH}  (exists: {DATA_PATH.exists()})")

---
## üìÇ 3 ¬∑ Load Data

Load the raw dataset and take a first look at its shape, types, and sample rows.

In [None]:
df = pd.read_csv(DATA_PATH)
df.head()

In [None]:
# --- Data quality summary card ---
n_rows, n_cols = df.shape
dtypes_breakdown = df.dtypes.value_counts().to_dict()
total_missing = df.isnull().sum().sum()
total_cells = n_rows * n_cols
missing_pct = (total_missing / total_cells * 100)
n_duplicates = df.duplicated().sum()
mem_mb = df.memory_usage(deep=True).sum() / 1024**2

print("=" * 50)
print("  üìã DATA QUALITY SUMMARY")
print("=" * 50)
print(f"  Rows:            {n_rows:,}")
print(f"  Columns:         {n_cols:,}")
print(f"  Dtypes:          {dtypes_breakdown}")
print(f"  Missing values:  {total_missing:,} ({missing_pct:.2f}%)")
print(f"  Duplicate rows:  {n_duplicates:,}")
print(f"  Memory usage:    {mem_mb:.2f} MB")
print("=" * 50)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include="object")

---
## üéØ 4 ¬∑ Target Variable

Separate features (`X`) and target (`y`) early.  
By convention in ML, **`X`** denotes the feature matrix and **`y`** denotes the target vector ‚Äî this comes from the statistical notation $y = f(X) + \varepsilon$ and is the standard used by scikit-learn, XGBoost, LightGBM, and virtually every ML library.

In [None]:
# Separate features (X) and target (y)
# By ML convention: X = features, y = target
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

print(f"Features shape: {X.shape}")
print(f"Target: '{TARGET_COL}'  |  dtype: {y.dtype}  |  "
      f"mean: {y.mean():.2f}  |  median: {y.median():.2f}")

In [None]:
# Target distribution ‚Äî conditional on TASK
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

if TASK == "regression":
    y.hist(bins=30, edgecolor="black", ax=axes[0])
    axes[0].set_title(f"{TARGET_COL} ‚Äî Distribution")
    axes[0].set_xlabel(TARGET_COL)
    axes[0].set_ylabel("Count")
    axes[1].boxplot(y.dropna(), vert=True)
    axes[1].set_title(f"{TARGET_COL} ‚Äî Box Plot")
else:  # classification
    counts = y.value_counts().sort_index()
    counts.plot(kind="bar", edgecolor="black", ax=axes[0])
    axes[0].set_title(f"{TARGET_COL} ‚Äî Class Distribution")
    axes[0].set_xlabel(TARGET_COL)
    axes[0].set_ylabel("Count")
    axes[0].tick_params(axis="x", rotation=0)
    # Class balance as percentage
    pct = (counts / counts.sum() * 100).round(1)
    pct.plot(kind="bar", edgecolor="black", ax=axes[1])
    axes[1].set_title(f"{TARGET_COL} ‚Äî Class Balance (%)")
    axes[1].set_ylabel("%")
    axes[1].tick_params(axis="x", rotation=0)

plt.tight_layout()
plt.show()

---
## üï≥Ô∏è 5 ¬∑ Missing Values

In [None]:
missing = X.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
missing_pct = (missing / len(X) * 100).round(2)

if missing.empty:
    print("No missing values üéâ")
else:
    print(pd.DataFrame({"count": missing, "% of total": missing_pct}).to_string())

---
## üßπ 6 ¬∑ Data Cleaning

Handle missing values, fix dtypes, remove duplicates, drop irrelevant columns.

In [None]:
# Drop duplicates
n_dup = X.duplicated().sum()
print(f"Duplicates found: {n_dup}")
X = X.drop_duplicates()

# TODO: Drop irrelevant columns (IDs, row numbers, etc.)
# X = X.drop(columns=["id"], errors="ignore")

# TODO: Handle missing values
# X["col"] = X["col"].fillna(X["col"].median())

# TODO: Fix data types
# X["col"] = X["col"].astype("category")

---
## üìä 7 ¬∑ Exploratory Data Analysis ‚Äî Univariate

Distribution of individual features. Separate numeric from categorical using `select_dtypes` ‚Äî a standard pandas pattern.

In [None]:
# Separate feature types ‚Äî standard naming used across notebooks & train.py
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "category"]).columns

print(f"Numeric features:     {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")

In [None]:
# Numeric distributions ‚Äî 3 per row
COLS_PER_ROW = 3
n_num = len(numeric_features)
n_rows = math.ceil(n_num / COLS_PER_ROW)

fig, axes = plt.subplots(n_rows, COLS_PER_ROW, figsize=(5 * COLS_PER_ROW, 4 * n_rows))
axes = axes.flatten()

for i, col in enumerate(numeric_features):
    X[col].hist(bins=30, edgecolor="black", ax=axes[i])
    axes[i].set_title(col, fontsize=10)
    axes[i].tick_params(labelsize=8)

# Hide unused subplots
for j in range(n_num, len(axes)):
    axes[j].set_visible(False)

fig.suptitle("Numeric Feature Distributions", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# Categorical value counts ‚Äî 2 per row (skip high-cardinality columns)
HIGH_CARD_THRESHOLD = 20

plot_cats = [c for c in categorical_features if X[c].nunique() <= HIGH_CARD_THRESHOLD]
skipped = [c for c in categorical_features if X[c].nunique() > HIGH_CARD_THRESHOLD]
if skipped:
    print(f"‚ö†Ô∏è  Skipping high-cardinality columns: {', '.join(skipped)}")

CAT_COLS_PER_ROW = 2
n_cats = len(plot_cats)
n_cat_rows = math.ceil(n_cats / CAT_COLS_PER_ROW)

fig, axes = plt.subplots(n_cat_rows, CAT_COLS_PER_ROW,
                         figsize=(7 * CAT_COLS_PER_ROW, 4 * n_cat_rows))
axes = np.array(axes).flatten()

for i, col in enumerate(plot_cats):
    counts = X[col].value_counts().head(15)
    counts.sort_values().plot(kind="barh", ax=axes[i])
    axes[i].set_title(col, fontsize=10)
    axes[i].set_xlabel("Count")
    axes[i].tick_params(labelsize=8)

for j in range(n_cats, len(axes)):
    axes[j].set_visible(False)

fig.suptitle("Categorical Feature Distributions", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

---
## üîó 8 ¬∑ Exploratory Data Analysis ‚Äî Bivariate / Multivariate

In [None]:
# Correlation matrix (numeric features only)
corr = X[numeric_features].corr()

fig, ax = plt.subplots(figsize=(14, 10))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=False, cmap="coolwarm",
            center=0, square=True, linewidths=0.5, ax=ax)
ax.set_title("Correlation Matrix")
plt.tight_layout()
plt.show()

In [None]:
# Top correlations with target
target_corr = X[numeric_features].corrwith(y).sort_values(ascending=False)
print("Top positive correlations with target:")
print(target_corr.head(10).to_string())
print("\nTop negative correlations with target:")
print(target_corr.tail(5).to_string())

In [None]:
# Target vs top numeric features ‚Äî grid layout
top_n = 5
top_features = target_corr.abs().sort_values(ascending=False).head(top_n).index.tolist()

BIV_COLS = 3
biv_rows = math.ceil(len(top_features) / BIV_COLS)
fig, axes = plt.subplots(biv_rows, BIV_COLS, figsize=(6 * BIV_COLS, 5 * biv_rows))
axes = np.array(axes).flatten()

for i, col in enumerate(top_features):
    if TASK == "regression":
        axes[i].scatter(X[col], y, alpha=0.3, edgecolors="k", linewidths=0.3)
        axes[i].set_xlabel(col)
        axes[i].set_ylabel(TARGET_COL)
        axes[i].set_title(f"{col} vs {TARGET_COL}")
    else:  # classification
        sns.stripplot(x=y, y=X[col], ax=axes[i], alpha=0.3, jitter=True)
        axes[i].set_title(f"{col} by {TARGET_COL} class")

for j in range(len(top_features), len(axes)):
    axes[j].set_visible(False)

fig.suptitle(f"Top {top_n} Features vs {TARGET_COL}", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# Target vs categorical features ‚Äî mean target per category (2 per row)
CAT_BIV_COLS = 2
cat_biv_rows = math.ceil(len(plot_cats) / CAT_BIV_COLS)

fig, axes = plt.subplots(cat_biv_rows, CAT_BIV_COLS,
                         figsize=(7 * CAT_BIV_COLS, 4 * cat_biv_rows))
axes = np.array(axes).flatten()

for i, col in enumerate(plot_cats):
    means = df.groupby(col)[TARGET_COL].mean().sort_values(ascending=True)
    means.plot(kind="barh", ax=axes[i])
    axes[i].set_title(f"Mean {TARGET_COL} by {col}", fontsize=10)
    axes[i].set_xlabel(f"Mean {TARGET_COL}")
    axes[i].tick_params(labelsize=8)

for j in range(len(plot_cats), len(axes)):
    axes[j].set_visible(False)

fig.suptitle(f"Mean {TARGET_COL} by Category", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
import sys
print(sys.executable)

In [None]:
# Pairplot ‚Äî top 5 numeric features most correlated with target
pair_cols = top_features + [TARGET_COL]
pair_df = df[pair_cols].dropna()

g = sns.pairplot(pair_df, corner=True, plot_kws={"alpha": 0.3, "s": 10})
g.figure.suptitle("Pairplot ‚Äî Top 5 Correlated Features", y=1.02)
plt.show()

---
## üö® 9 ¬∑ Outlier Detection

In [None]:
# IQR-based outlier summary
def outlier_report(dataframe, cols):
    records = []
    for col in cols:
        Q1 = dataframe[col].quantile(0.25)
        Q3 = dataframe[col].quantile(0.75)
        IQR = Q3 - Q1
        lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        n_out = ((dataframe[col] < lower) | (dataframe[col] > upper)).sum()
        records.append({"feature": col, "n_outliers": n_out,
                        "% outliers": round(n_out / len(dataframe) * 100, 2)})
    return pd.DataFrame(records).sort_values("n_outliers", ascending=False)

outlier_report(X, numeric_features)

---
## üí° 10 ¬∑ EDA Summary

### Dataset Overview
- **Rows:** *...*
- **Columns:** *...*
- **Task:** *regression / classification*

### Data Quality Findings
- *e.g. X columns have >50% missing values*
- *e.g. N duplicate rows removed*
- *e.g. columns A, B have inconsistent dtypes*

### Target Variable Observations
- *e.g. right-skewed distribution ‚Üí consider log transform*
- *e.g. class imbalance: 90/10 split ‚Üí consider SMOTE or class weights*

### Key Feature Insights
- *e.g. Feature X has high cardinality (500+ unique values)*
- *e.g. Feature Y shows clear separation between classes*

### Correlations Worth Investigating
- *e.g. Feature A and B are highly correlated (r=0.95) ‚Üí possible multicollinearity*
- *e.g. Feature C has the strongest relationship with the target*

### Recommended Preprocessing Steps
- [ ] *e.g. Drop columns with >60% missing*
- [ ] *e.g. Impute column X with median*
- [ ] *e.g. Log-transform target*
- [ ] *e.g. One-hot encode low-cardinality categoricals*
- [ ] *e.g. Move final pipeline to `src/train.py`*