# Credit Risk Prediction: Identifying High-Risk Borrowers

### Context & Objective
This dataset simulates consumer credit behavior. The goal is to identify which customers are most at risk of delinquency.
We will explore key characteristics, understand correlations, and build a simple predictive baseline model to assess risk.

**Data Source:**  
Give Me Some Credit (Kaggle).  
- 150,000 training observations  
- 10 core variables describing credit utilization, debt ratios, income, age, and delinquency history  
- Target: `SeriousDlqin2yrs` (1 = default within 2 years, 0 = no default)

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

DATA_DIR = os.path.abspath(os.path.join(os.getcwd(), "..", "data"))
TRAIN_PATH = os.path.join(DATA_DIR, "cs-training.csv")
DICT_PATH  = os.path.join(DATA_DIR, "Data Dictionary.xls")

print("Data dir:", DATA_DIR)
print("Training file exists:", os.path.exists(TRAIN_PATH))
print("Dictionary exists:", os.path.exists(DICT_PATH))

df_train = pd.read_csv(TRAIN_PATH)
data_dict = None
data_dict = pd.read_excel(DICT_PATH)

In [None]:
display(df_train.info())
display(df_train.describe().T)

if data_dict is not None:
    print("\nData dictionary preview:")
    display(data_dict)


In [None]:
def hist_kde(series, title, bins=50, xlim=None, logx=False):
    s = series.dropna()
    ax = sns.histplot(s, bins=bins, kde=True)
    if xlim:
        plt.xlim(*xlim)
    if logx:
        ax.set_xscale("log")
    plt.title(title)
    plt.show()

def hist_by_target(df, col, target="SeriousDlqin2yrs", bins=40, kde=True, stat="density", logx=False):
    ax = sns.histplot(data=df, x=col, hue=target, bins=bins, kde=kde, stat=stat, common_norm=False)
    if logx:
        ax.set_xscale("log")
    plt.title(f"{col} distribution by {target}")
    plt.show()

def box1(series, title, xlim=None, logx=False):
    ax = sns.boxplot(x=series.dropna())
    if xlim:
        plt.xlim(*xlim)
    if logx:
        ax.set_xscale("log")
    plt.title(title)
    plt.show()

def corr_heatmap(df, title="Correlation Matrix"):
    corr = df.corr(numeric_only=True)
    sns.heatmap(corr, cmap="coolwarm", center=0)
    plt.title(title)
    plt.show()


# Data Exploration

Key questions explored:
1. What is the default rate?
2. Are there invalid or unrealistic values?
3. Which variables are most skewed or correlated?

Findings:
- ~6.7% of customers defaulted.
- Some **invalid ages** (e.g. 0 years old) were detected.
- Variables like `DebtRatio`, `MonthlyIncome`, and `RevolvingUtilizationOfUnsecuredLines` showed **heavy skew**.
- Multiple delinquency counts were **repetitive and correlated**.

Representative plots:
- Target distribution  
- Feature histograms (age, utilization, income)  
- Correlation heatmap


In [None]:
sns.countplot(x="SeriousDlqin2yrs", data=df_train)
plt.title("Distribution of Target Variable: SeriousDlqin2yrs")
plt.xlabel("Defaulted within 2 years (1 = Yes)")
plt.ylabel("Count")
plt.show()

default_rate = df_train["SeriousDlqin2yrs"].mean()
print(f"Default rate: {default_rate:.2%}")


In [None]:
num_cols = [
    "RevolvingUtilizationOfUnsecuredLines",
    "age",
    "DebtRatio",
    "MonthlyIncome",
    "NumberOfOpenCreditLinesAndLoans",
]

for c in num_cols:
    hist_kde(df_train[c], f"Distribution of {c}")

for c in ["RevolvingUtilizationOfUnsecuredLines", "DebtRatio", "MonthlyIncome"]:
    box1(df_train[c], f"Boxplot of {c}")


In [None]:

for c in ["age", "DebtRatio", "MonthlyIncome", "NumberOfOpenCreditLinesAndLoans"]:
    hist_by_target(df_train, c, target="SeriousDlqin2yrs", bins=40, kde=True, stat="density")


In [None]:
# Correlations (RAW)
corr_heatmap(df_train, "Correlation Matrix (raw)")


### Observations from EDA
- Younger consumers show higher default probability.
- High `DebtRatio` and `RevolvingUtilizationOfUnsecuredLines` are correlated with higher delinquency risk.
- Missing `MonthlyIncome` values likely correspond to lower-income borrowers or data gaps.
This suggests financial burden and credit utilization are key drivers of repayment risk.


# Data Preparation

Steps performed:

1. **Removed or Winsorized Outliers**
   - Excluded invalid ages (<18 or >100)
   - Winsorized extreme values (top 0.5%)

2. **Handled Missing Values**
   - Imputed `MonthlyIncome` and `NumberOfDependents` with medians

3. **Normalized Skewed Features**
   - Created log-transformed versions of:
     - `RevolvingUtilizationOfUnsecuredLines` → `log_RevolvingUtilization`
     - `DebtRatio` → `log_DebtRatio`
     - `MonthlyIncome` → `log_MonthlyIncome`

4. **Feature Engineering**
   - `TotalDelinquencies` = sum of all delinquency-related columns  
   - Categorical `age_group` for profiling (not used in model)

Result:  
Data is now clean, balanced in scale, and ready for modeling.


In [None]:
# Cleaning decisions driven by plots and statistics
# Rationale reflected from the visuals:
# - Drop useless index column
# - Remove invalid ages (==0)
# - Delinquency placeholders "98" -> NaN -> 0 (assume missing => treat as 0 for baseline)
# - Winsorize/cap extreme right tails on utilization & debt ratio & income at 99th pct
# - Impute missing MonthlyIncome with median
# - Create interpretable engineered features (log versions, age bands, total delinquencies)

df = df_train.copy()

if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])

min_age, max_age = 18, 100
df = df[(df['age'] >= min_age) & (df['age'] <= max_age)]

delinq_cols = [
    "NumberOfTime30-59DaysPastDueNotWorse",
    "NumberOfTime60-89DaysPastDueNotWorse",
    "NumberOfTimes90DaysLate",
]
df[delinq_cols] = df[delinq_cols].replace(98, np.nan)
df[delinq_cols] = df[delinq_cols].fillna(0)

if df["MonthlyIncome"].isna().any():
    df["MonthlyIncome"] = df["MonthlyIncome"].fillna(df["MonthlyIncome"].median())

cap_cols = ["RevolvingUtilizationOfUnsecuredLines", "DebtRatio", "MonthlyIncome"]
for c in cap_cols:
    upper = df[c].quantile(0.99)
    df[c] = np.clip(df[c], 0, upper)

df["log_RevolvingUtilization"] = np.log1p(df["RevolvingUtilizationOfUnsecuredLines"])
df["log_DebtRatio"] = np.log1p(df["DebtRatio"])
df["log_MonthlyIncome"] = np.log1p(df["MonthlyIncome"])

df["age_group"] = pd.cut(
    df["age"],
    bins=[18, 30, 45, 60, 75, 110],
    labels=["18-30", "31-45", "46-60", "61-75", "75+"],
    include_lowest=True,
    right=True,
)

df["TotalDelinquencies"] = df[delinq_cols].sum(axis=1)


In [None]:
for c in ["log_RevolvingUtilization", "log_DebtRatio", "log_MonthlyIncome"]:
    hist_kde(df[c], f"Post-clean: {c}")

for c in ["log_DebtRatio", "log_MonthlyIncome", "TotalDelinquencies", "NumberOfOpenCreditLinesAndLoans"]:
    hist_by_target(df, c, "SeriousDlqin2yrs", bins=40, kde=True, stat="density")

corr_heatmap(df[[
    "SeriousDlqin2yrs",
    "log_RevolvingUtilization","log_DebtRatio","log_MonthlyIncome",
    "age","TotalDelinquencies","NumberOfOpenCreditLinesAndLoans",
    "NumberRealEstateLoansOrLines","NumberOfDependents"
]], "Correlation Matrix (post-clean, selected)")


# Baseline Model: Logistic Regression

Why Logistic Regression?
- Interpretable and transparent, each coefficient tells how a variable affects default odds.
- Fast to train and easy to calibrate.
- Serves as a baseline for future model comparisons.

Evaluation Approach:
- 5-fold **Stratified Cross-Validation** on training data (since test labels are missing)
- Metric: **ROC-AUC** (robust to class imbalance)

Key configuration:
- `solver="liblinear"`, because it's efficient for smaller datasets.
- `class_weight="balanced"` which corrects for the strong class imbalance (few defaults).
- Features are **standardized** using `StandardScaler` for fair weighting.


In [None]:
# LOGISTIC REGRESSION
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve

# Select features
df["NumberOfDependents"] = df["NumberOfDependents"].fillna(0)

features = [
    "log_RevolvingUtilization",
    "log_DebtRatio",
    "log_MonthlyIncome",
    "age",
    "TotalDelinquencies",
    "NumberOfOpenCreditLinesAndLoans",
    "NumberRealEstateLoansOrLines",
    "NumberOfDependents",
]
target = "SeriousDlqin2yrs"

missing_cols = [c for c in features + [target] if c not in df.columns]
if missing_cols:
    raise ValueError(f"Your cleaned df is missing columns: {missing_cols}")

X = df[features]
y = df[target]

# Scale & CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

log_reg = LogisticRegression(
    solver="liblinear",
    class_weight="balanced",
    max_iter=1000,
    random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
auc_cv = cross_val_score(log_reg, X_scaled, y, cv=cv, scoring="roc_auc")
print(f"5-fold CV ROC-AUC: mean={auc_cv.mean():.3f}, std={auc_cv.std():.3f}")

log_reg.fit(X_scaled, y)
train_prob = log_reg.predict_proba(X_scaled)[:, 1]

# ROC + PR
fpr, tpr, _ = roc_curve(y, train_prob)
plt.figure()
plt.plot(fpr, tpr, label=f"AUC={roc_auc_score(y, train_prob):.3f}")
plt.plot([0,1],[0,1], linestyle="--")
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
plt.title("ROC Curve (Training)"); plt.legend(); plt.show()

prec, rec, _ = precision_recall_curve(y, train_prob)
plt.figure()
plt.plot(rec, prec)
plt.xlabel("Recall"); plt.ylabel("Precision")
plt.title("Precision–Recall Curve (Training)"); plt.show()

# Risk bands using quantiles
q = np.quantile(train_prob, [0.5, 0.7, 0.8, 0.9, 0.95])
def band_label(p):
    if p <= q[0]: return "Low (<=50%)"
    if p <= q[1]: return "Med-Low (50–70%)"
    if p <= q[2]: return "Medium (70–80%)"
    if p <= q[3]: return "High (80–90%)"
    return "Very High (90–100%)"

train_bands = pd.Series([band_label(p) for p in train_prob], index=y.index, name="RiskBand")
calib = (pd.DataFrame({"RiskBand": train_bands, "y": y})
         .groupby("RiskBand")["y"].agg(N="count", DefaultRate="mean")
         .sort_values("DefaultRate"))

# Coefficients / importance
coef_df = (pd.DataFrame({
    "Feature": features,
    "Coefficient": log_reg.coef_[0]
})
.assign(AbsImpact=lambda d: d["Coefficient"].abs())
.sort_values("AbsImpact", ascending=False))

plt.figure(figsize=(8, 5))
ypos = np.arange(len(coef_df))
plt.barh(ypos, coef_df["Coefficient"])
plt.yticks(ypos, coef_df["Feature"])
plt.axvline(0, color="black", linewidth=1)
plt.title("Logistic Regression Coefficients")
plt.tight_layout()
plt.show()


# View of precision/recall at a few cutoffs (on training)
for thr in [0.05, 0.10, 0.20, 0.30]:
    preds = (train_prob >= thr).astype(int)
    tp = ((preds==1)&(y==1)).sum()
    fp = ((preds==1)&(y==0)).sum()
    fn = ((preds==0)&(y==1)).sum()
    precision = tp / (tp+fp) if (tp+fp)>0 else 0.0
    recall    = tp / (tp+fn) if (tp+fn)>0 else 0.0
    print(f"Threshold {thr:0.2f} -> precision={precision:0.3f}, recall={recall:0.3f}, positives={preds.sum()}")


## Model Evaluation

We evaluate the model using 5-fold cross-validation with ROC-AUC as the main metric.

- **ROC-AUC ≈ 0.85 (±0.003)** indicates strong discriminatory power.
- **ROC Curve** is well above the diagonal baseline.
- **Precision–Recall Curve** shows the tradeoff between identifying defaults and avoiding false positives.
- Threshold tuning demonstrates how operational decisions affect recall and precision.

### Model Performance Summary
The model achieves a mean ROC-AUC of 0.85, indicating strong separability between high- and low-risk consumers.
This suggests that even a simple model can capture meaningful risk signals in the dataset.


## Feature Importance & Interpretation

The logistic regression coefficients show how each feature affects default risk.

| Feature | Effect on Default Probability |
|----------|------------------------------|
| **TotalDelinquencies** | + Strongly increases risk |
| **log_RevolvingUtilization** | + Increases risk (maxed credit lines) |
| **age** | − Older borrowers less likely to default |
| **log_MonthlyIncome** | − Higher income reduces risk |
| **log_DebtRatio** | − Slightly reduces risk after scaling |
| **Open Credit Lines & Real Estate Loans** | + Moderate positive relationship |

The direction and magnitude of effects are consistent with intuition. Borrowers with more delinquencies or higher utilization are riskier.

## Risk Segmentation (Training Data)

To validate calibration, customers were grouped into **five risk bands** (quintiles of predicted default probability).

| Risk Band | Default Rate |
|------------|---------------|
| Low (≤50%) | ~1% |
| Med-Low (50–70%) | ~4% |
| Medium (70–80%) | ~7% |
| High (80–90%) | ~12% |
| Very High (90–100%) | ~35% |

### Interpretation of Risk Bands
Default rates rise sharply across risk bands:
- **Low risk (<=50%)**: ~1% default rate  
- **Very High risk (90–100%)**: ~35% default rate  
This confirms that the model meaningfully ranks consumers by likelihood of default — useful for portfolio segmentation or credit limit management.



In [None]:
# Cleaning decisions driven by plots and statistics
# Rationale reflected from the visuals:
# - Drop useless index column
# - Remove invalid ages (==0)
# - Delinquency placeholders "98" -> NaN -> 0 (assume missing => treat as 0 for baseline)
# - Winsorize/cap extreme right tails on utilization & debt ratio & income at 99th pct
# - Impute missing MonthlyIncome with median
# - Create interpretable engineered features (log versions, age bands, total delinquencies)

test_path = os.path.join(DATA_DIR, "cs-test.csv")
df_test = pd.read_csv(test_path)

df_test = df_test.copy()

if "Unnamed: 0" in df_test.columns:
    df_test = df_test.drop(columns=["Unnamed: 0"])

df_test["NumberOfDependents"] = df_test["NumberOfDependents"].fillna(0)

min_age, max_age = 18, 100
df_test = df_test[(df_test['age'] >= min_age) & (df_test['age'] <= max_age)]

delinq_cols = [
    "NumberOfTime30-59DaysPastDueNotWorse",
    "NumberOfTime60-89DaysPastDueNotWorse",
    "NumberOfTimes90DaysLate",
]
df_test[delinq_cols] = df_test[delinq_cols].replace(98, np.nan)
df_test[delinq_cols] = df_test[delinq_cols].fillna(0)

if df_test["MonthlyIncome"].isna().any():
    df_test["MonthlyIncome"] = df_test["MonthlyIncome"].fillna(df_test["MonthlyIncome"].median())

cap_cols = ["RevolvingUtilizationOfUnsecuredLines", "DebtRatio", "MonthlyIncome"]
for c in cap_cols:
    upper = df_test[c].quantile(0.99)
    df_test[c] = np.clip(df_test[c], 0, upper)

df_test["log_RevolvingUtilization"] = np.log1p(df_test["RevolvingUtilizationOfUnsecuredLines"])
df_test["log_DebtRatio"] = np.log1p(df_test["DebtRatio"])
df_test["log_MonthlyIncome"] = np.log1p(df_test["MonthlyIncome"])

df_test["age_group"] = pd.cut(
    df_test["age"],
    bins=[18, 30, 45, 60, 75, 110],
    labels=["18-30", "31-45", "46-60", "61-75", "75+"],
    include_lowest=True,
    right=True,
)

df_test["TotalDelinquencies"] = df_test[delinq_cols].sum(axis=1)


## Scoring the Test Set

We apply the trained model to the unlabeled test dataset (`cs-test.csv`), using the same cleaning and transformations as the training data.

Predicted default probabilities are right-skewed. Most customers are low-risk, with a long tail of high-risk borrowers.

The top high-risk cases show:
- Very high utilization  
- Frequent delinquencies  
- Low or unstable income


In [None]:
# Use the already trained model to predict on this cleaned test set

feature_cols = [
    "log_RevolvingUtilization", "log_DebtRatio", "log_MonthlyIncome", "age",
    "TotalDelinquencies", "NumberOfOpenCreditLinesAndLoans",
    "NumberRealEstateLoansOrLines", "NumberOfDependents"
]

X_test = df_test[feature_cols]
X_test_scaled = scaler.transform(X_test)

df_test["PredDefaultProb"] = log_reg.predict_proba(X_test_scaled)[:, 1]

df_test["RiskBand"] = pd.qcut(df_test["PredDefaultProb"], 5, labels=[
    "Low (<=50%)", "Med-Low (50–70%)", "Medium (70–80%)", "High (80–90%)", "Very High (90–100%)"
])

print(df_test["PredDefaultProb"].describe())

sns.histplot(df_test["PredDefaultProb"], bins=50, kde=True, color="royalblue")
plt.title("Predicted Default Probability (Test Set)")
plt.xlabel("Predicted Default Probability")
plt.ylabel("Count")
plt.show()

risk_summary = df_test.groupby("RiskBand", observed=False).size().reset_index(name="N")
display(risk_summary)


## Summary and Interpretation

- **Model performance:** AUC ≈ 0.85 → strong baseline.  
- **Feature insights:** Delinquencies & utilization dominate, income & age reduce risk.  
- **Calibration:** Risk bands show smooth, monotonic increase in default rate.  
- **Consistency:** Test-set distribution matches training → stable generalization.

### Takeaways
- Enables early warning for at-risk customers.
- Supports risk-based pricing or automated approvals.
- Provides a transparent, auditable baseline model.

## Recommendations & Next Steps
- **Underwriting:** Tighten approval criteria for customers with high debt ratios and revolving utilization.
- **Portfolio Strategy:** Adjust loan limits or pricing for each risk band to balance growth and loss.
- **Next Steps:** Test additional models (e.g., XGBoost) and include temporal repayment data for better calibration.

**Summary:**  
Our baseline model demonstrates clear, interpretable patterns in consumer risk.



## Business Context & Final Thoughts  

This model provides a clear, interpretable baseline for assessing credit risk, with an AUC around **0.85**.  
It captures realistic behavioral patterns. Customers with high utilization or repeated delinquencies are much more likely to default, while higher income and older age are generally protective.

From a business point of view:
- The model can help **set approval thresholds**, **adjust pricing**, and **prioritize collections**.  
- It’s simple enough to explain to non-technical stakeholders and can easily be integrated into existing decision rules.  
- The probability outputs can feed into portfolio monitoring and expected loss calculations.

If this were part of a real credit process, next steps would include:
- **Calibrating** predicted probabilities to match observed default rates.  
- **Defining cutoffs** for different risk tiers and mapping them to business actions.  
- **Testing** performance under different economic conditions.  
- **Tracking** model stability over time to catch any drift.

Overall, this analysis builds a solid foundation for data-driven credit decisions. it’s easy, transparent, and practical.
