# Exploratory Data Analysis - German Credit Dataset

Quick look at the data before building anything. Want to understand the class balance,
feature distributions, and if there are any obvious patterns.

In [None]:
import sys
sys.path.insert(0, "..")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data.loader import CreditDataLoader

sns.set_style("whitegrid")
%matplotlib inline

## Load the data

Using our loader that pulls from UCI and decodes the cryptic A11/A12 codes into readable labels.

In [None]:
loader = CreditDataLoader(data_dir="../data")
df = loader.load_with_labels()
print(f"Shape: {df.shape}")
df.head()

In [None]:
df.dtypes

In [None]:
df.describe()

## Target distribution

Let's see how balanced the classes are. target=1 means the person defaulted (bad credit).

In [None]:
print(df["target"].value_counts())
print(f"\nDefault rate: {df['target'].mean():.1%}")

fig, ax = plt.subplots(figsize=(6, 4))
df["target"].value_counts().plot(kind="bar", ax=ax, color=["steelblue", "salmon"])
ax.set_xticklabels(["Good (0)", "Bad (1)"], rotation=0)
ax.set_ylabel("Count")
ax.set_title("Credit Risk Distribution")
plt.tight_layout()
plt.show()

Ok so it's a 70/30 split - not terrible but definitely imbalanced. Will need to keep that in mind
when training. Might need stratified splits or class weights.

## Missing values

In [None]:
missing = df.isnull().sum()
print(missing[missing > 0] if missing.any() else "No missing values!")

Nice, no missing values. UCI datasets are usually clean like that. Real-world data won't be this nice lol.

## Numerical feature distributions

Let's look at the main numerical features: credit amount, loan duration, and age.

In [None]:
num_cols = ["credit_amount", "duration_months", "age"]

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for i, col in enumerate(num_cols):
    sns.histplot(data=df, x=col, hue="target", kde=True, ax=axes[i], bins=30)
    axes[i].set_title(col)
plt.tight_layout()
plt.show()

Some observations:
- Credit amount is right-skewed (lots of small loans, few big ones). Defaults seem slightly more common for larger amounts which makes sense.
- Duration: longer loans have higher default rates. Also makes sense - more time for things to go wrong.
- Age: younger applicants seem to default more. Interesting but not surprising.

## Correlation heatmap

Just the numerical columns to see if anything jumps out.

In [None]:
numeric_df = df.select_dtypes(include=[np.number])

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(numeric_df.corr(), annot=True, fmt=".2f", cmap="coolwarm", center=0, ax=ax)
ax.set_title("Correlation Matrix")
plt.tight_layout()
plt.show()

Duration and credit amount are correlated (0.6ish) which makes total sense - bigger loans take longer to pay off.
Nothing else super correlated, no multicollinearity issues to worry about really.

## Categorical features vs target

Let's cross-tab a few important ones to see default rates across categories.

In [None]:
# checking account status vs default
ct = pd.crosstab(df["checking_account_status"], df["target"], normalize="index")
ct.columns = ["Good", "Bad"]
print("Default rate by checking account status:")
print(ct.sort_values("Bad", ascending=False).round(3))
print()

In [None]:
# purpose of loan vs default
ct2 = pd.crosstab(df["purpose"], df["target"], normalize="index")
ct2.columns = ["Good", "Bad"]
print("Default rate by loan purpose:")
print(ct2.sort_values("Bad", ascending=False).round(3))

In [None]:
# quick bar chart for checking account
fig, ax = plt.subplots(figsize=(8, 4))
default_rates = df.groupby("checking_account_status")["target"].mean().sort_values(ascending=False)
default_rates.plot(kind="bar", ax=ax, color="salmon")
ax.set_ylabel("Default Rate")
ax.set_title("Default Rate by Checking Account Status")
ax.axhline(y=df["target"].mean(), color="black", linestyle="--", label="overall avg")
ax.legend()
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.show()

Checking account status looks like a really strong predictor. People with low balance (< 0 DM) have
way higher default rates. People with no checking account actually default less - maybe they're more
conservative with money? Or it's a data quirk.

For loan purpose, education and used cars seem riskier. Used cars makes sense since the asset depreciates fast.

## Takeaways

- **1000 samples, 20 features** - small dataset, need to be careful with overfitting
- **30% default rate** - imbalanced but not extreme. Should use stratified splits and maybe class weights
- **No missing values** - one less thing to deal with
- **Key predictors** (from first look): checking account status, duration, credit amount, age
- **Duration and credit amount are correlated** - could try dropping one or using PCA but probably fine for tree models
- Need to **encode the categorical features** - they're strings right now. Will handle in preprocessing.

Next step: build the preprocessing pipeline and train a baseline model.