---
title: "Credit Risk Prediction: Analyzing Loan Applicant Risk Based on Demographic and Financial Attributes"
author: "Ayush Joshi, Stallon Pinto, & Zhanerke Zhumash"
format: 
  html:
    toc: true
    toc-depth: 3
    number-sections: true
    theme: cosmo
    fig-width: 8
    fig-height: 6
    fig-cap-location: bottom
    tbl-cap-location: top
  pdf:
    toc: true
    fig-pos: "H"
bibliography: references.bib
execute:
  echo: false
  warning: false
---

In [None]:
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Summary

This project aims to develop a predictive model to assess the credit risk of loan applicants using the Statlog (German Credit Data) dataset. The primary objective is to determine whether an applicant is a good or bad credit risk based on various demographic and financial attributes. We explored a few models and landed on a Random Forest classifier as it had the highest accuracy and was also able to minimize false negatives, which is crucial for reducing the risk of approving applicants with poor creditworthiness.

# Introduction

## Background
Credit risk assessment is a crucial process in banking and financial services. Lenders evaluate borrowers based on financial stability and past credit behavior to determine their likelihood of defaulting on a loan. With the advancements in machine learning, automated credit risk assessment has gained traction as it can efficiently analyze large datasets, identify risk patterns, and improve lending decisions [@basel2001]. This project explores whether a machine learning model can predict whether a loan applicant is a good or bad credit risk based on demographic and financial data.  

## Research Question
Can we classify a loan applicant as a **good** or **bad** credit risk using a combination of **demographic, financial, and loan-specific attributes**?  

## Dataset: German Credit Data
The dataset used in this analysis is the **German Credit Dataset**, originally compiled by **Professor Dr. Hans Hofmann** from **Universität Hamburg** [@statlog]. It contains 1,000 instances with 20 attributes describing various aspects of a loan applicant's financial and personal profile. The dataset has both categorical and numerical attributes and provides a labeled classification of good vs. bad credit risk.

### Target Variable (Credit Standing)
- **0** → Good Credit Risk (low risk, likely to repay)
- **1** → Bad Credit Risk (high risk, potential default)

### Key Features
The dataset consists of three broad categories of features:

1. **Demographic Information:**
   - **Age** (numerical): The applicant's age in years.
   - **Employment Status** (categorical): The applicant's work experience categorized into different groups.
   - **Foreign Worker Status** (categorical): Whether the applicant is a foreign worker (Yes/No).
   - **Personal Status & Gender** (categorical): Applicant's marital status and gender.

2. **Financial Attributes:**
   - **Credit History** (categorical): Previous credit behavior (e.g., no previous credit, delayed payments, fully repaid).
   - **Status of Checking Account** (categorical): Information on the applicant's checking account balance.
   - **Savings Account/Bonds** (categorical): Level of savings held by the applicant.
   - **Credit Amount** (numerical): The total loan amount requested.
   - **Other Debtors/Guarantors** (categorical): Whether the applicant has co-applicants or guarantors.

3. **Loan & Payment Behavior:**
   - **Loan Purpose** (categorical): The purpose for which the loan is requested (e.g., car, education, business).
   - **Loan Duration (Months)** (numerical): The length of the loan term.
   - **Installment Rate** (numerical): Loan repayment amount as a percentage of disposable income.
   - **Existing Credits at Bank** (numerical): The number of current outstanding loans with the bank.
   - **Other Installment Plans** (categorical): Whether the applicant has other loans with banks or stores.
   - **Housing Status** (categorical): Whether the applicant owns, rents, or lives rent-free.

# Methods & Results

This section outlines the step-by-step methodology used to preprocess the dataset, perform exploratory data analysis (EDA), and build machine learning models for credit risk classification.

## Data Preprocessing

### Loading and Cleaning the Data
- The dataset was loaded from a **CSV file** into a Jupyter Notebook.
- Column names were **added to the dataset** for readability.
- **Ambiguous categorical feature names** were mapped to **interpretable labels** for improved understanding.


In [None]:
#| label: tbl-raw-data
#| tbl-cap: Raw German Credit Data (First 5 Rows)
file_path = "data/german.data"
df = pd.read_csv(file_path, sep=" ", header=None)
df.head().style.set_table_attributes('class="dataframe"')

In [None]:
column_names = [
"Checking_Acc_Status", "Duration (in months)", "Credit_History", "Purpose",
"Credit_Amount", "Savings_Acc", "Employment", "Installment_Rate",
"Personal_Status", "Other_Debtors", "Residence_Since", "Property",
"Age", "Other_Installment", "Housing", "Existing_Credits",
"Job", "Num_People_Maintained", "Telephone", "Foreign_Worker", "Credit Standing"
]

df.columns = column_names
mappings = {
"Checking_Acc_Status": {
"A11": "< 0 DM",
"A12": "0-200 DM",
"A13": ">= 200 DM or Salary Assigned",
"A14": "No Checking Account"
},
"Credit_History": {
"A30": "No Credit Taken / All Paid",
"A31": "All Paid (Same Bank)",
"A32": "All Paid (Other Banks)",
"A33": "Past Delays in Payment",
"A34": "Critical Account / Other Existing Credits"
},
"Purpose": {
"A40": "New Car",
"A41": "Used Car",
"A42": "Furniture/Equipment",
"A43": "Radio/TV",
"A44": "Domestic Appliances",
"A45": "Repairs",
"A46": "Education",
"A47": "Vacation",
"A48": "Retraining",
"A49": "Business",
"A410": "Others"
},
"Savings_Acc": {
"A61": "< 100 DM",
"A62": "100-500 DM",
"A63": "500-1000 DM",
"A64": ">= 1000 DM",
"A65": "No Savings Account"
},
"Employment": {
"A71": "Unemployed",
"A72": "< 1 Year",
"A73": "1-4 Years",
"A74": "4-7 Years",
"A75": ">= 7 Years"
},
"Personal_Status": {
"A91": "Male: Divorced/Separated",
"A92": "Female: Divorced/Separated/Married",
"A93": "Male: Single",
"A94": "Male: Married/Widowed",
"A95": "Female: Single"
},
"Other_Debtors": {
"A101": "None",
"A102": "Co-applicant",
"A103": "Guarantor"
},
"Property": {
"A121": "Real Estate",
"A122": "Building Society Savings / Life Insurance",
"A123": "Car or Other Property",
"A124": "No Property"
},
"Other_Installment": {
"A141": "Bank",
"A142": "Stores",
"A143": "None"
},
"Housing": {
"A151": "Rent",
"A152": "Own",
"A153": "For Free"
},
"Job": {
"A171": "Unemployed / Unskilled (Non-Resident)",
"A172": "Unskilled (Resident)",
"A173": "Skilled Employee / Official",
"A174": "Management / Self-Employed / Highly Qualified"
},
"Telephone": {
"A191": "No Telephone",
"A192": "Yes, Registered"
},
"Foreign_Worker": {
"A201": "Yes",
"A202": "No"
}
}
for col, mapping in mappings.items():
df[col] = df[col].map(mapping)

In [None]:
#| label: tbl-mapped-data
#| tbl-cap: German Credit Data with Mapped Values (First 5 Rows)
df.head().style.set_table_attributes('class="dataframe"')

In [None]:
df_encoded = pd.get_dummies(df, drop_first=True).astype(int)
df_encoded['Credit Standing'] = df_encoded['Credit Standing'].map({1: 0, 2: 1})
for col in df_encoded.columns:
if df_encoded[col].dtype == 'object':
df_encoded[col] = df_encoded[col].astype('category').cat.codes
bad_credit_percentage = df_encoded['Credit Standing'].mean() 100

## Exploratory Data Analysis

The dataset contains ```{python} len(df)``` instances with ```{python} len(df.columns)``` attributes. Approximately ```{python} f"{bad_credit_percentage:.1f}"```% of the applicants are classified as bad credit risks.

::: {#fig-credit-distribution}
![](../results/eda/credit_standing_distribution.png){width=60%}


As shown in @fig-credit-distribution, the dataset is imbalanced, with a majority of applicants having good credit standing. This imbalance is important to consider when building and evaluating our predictive models.

### Correlation Analysis


In [None]:
correlations = df_encoded.corr()['Credit Standing'].sort_values(ascending=False)
top_correlations = correlations[1:11] # Exclude Credit Standing itself

@tbl-correlations shows the top features correlated with credit standing. The strongest correlations are with checking account status, loan duration, and savings account status.


In [None]:
#| label: tbl-correlations
#| tbl-cap: Top 10 Features Correlated with Credit Standing
pd.DataFrame({
'Feature': top_correlations.index,
'Correlation with Credit Standing': top_correlations.values
}).style.format({'Correlation with Credit Standing': '{:.3f}'}).set_table_attributes('class="dataframe"')

Here is the complete plot for the correlations:

::: {#fig-corr-analysis}
![](../results/eda/correlation_analysis.png){width=60%}

## **Feature Distributions**
Key numerical features were analyzed to observe differences between **good and bad credit applicants**.

::: {#fig-feat-dist}
![](../results/eda/feature_distributions.png){width=60%}


#### **Credit Amount Distribution**
_(See plot 1 in @fig_feat_dist)_
- **Good credit applicants** borrowed, on average, **~2,985 DM**.
- **Bad credit applicants** borrowed **~3,938 DM**, which is **~1,000 DM more**.
- The distribution is **right-skewed**, meaning a few applicants borrowed **significantly higher amounts**.

**Key Finding:**  
- Higher loan amounts are **associated with a higher likelihood of bad credit standing**.

#### **Age Distribution**
_(See plot 2 in @fig_feat_dist)_
- **Good credit applicants** had an **average age of 36.22 years**.
- **Bad credit applicants** had an **average age of 33.96 years**.
- The age distribution is **slightly right-skewed**, meaning there are **fewer older borrowers**.

**Key Finding:**  
- **Younger applicants** tend to have **worse credit standing**.
- Older applicants are slightly **less likely to default**.

#### **Loan Duration Distribution**
_(See plot 3 in @fig_feat_dist)_
- **Good credit applicants** held loans for **~19.21 months on average**.
- **Bad credit applicants** held loans for **~24.86 months on average**.
- **Longer loan durations correlate with a higher likelihood of bad credit.**

**Key Finding:**  
- Applicants with **longer loan durations** have a **higher risk of default**.



## Model Performance Analysis

To evaluate the effectiveness of different classification models for predicting credit standing, we tested three models:

1. **Baseline Model (Majority Class Classifier)**
2. **K-Nearest Neighbors (KNN) - Optimized**
3. **Random Forest - Optimized**

Each model was assessed based on accuracy, precision, recall, F1-score, and false negative rate (FNR), considering the imbalance in the dataset.


In [None]:
X = df_encoded.drop('Credit Standing', axis=1)
y = df_encoded['Credit Standing']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

def evaluate_model(model, X_test, y_test, model_name, scaled=False):

X_test_eval = X_test_scaled if scaled else X_test

y_pred = model.predict(X_test_eval)
y_pred_proba = model.predict_proba(X_test_eval)[:, 1] if hasattr(model, "predict_proba") else None


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

cm = confusion_matrix(y_test, y_pred)
fn = cm[1, 0] # False Negatives
tp = cm[1, 1] # True Positives
fnr = fn / (fn + tp) if (fn + tp) > 0 else 0

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Good Credit', 'Bad Credit'],
yticklabels=['Good Credit', 'Bad Credit'])
plt.title(f'Confusion Matrix - {model_name}', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()

return {
'model_name': model_name,
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'fnr': fnr,
'confusion_matrix': plt
}

### Baseline Model (Majority Class Classifier)


In [None]:
baseline_model = DummyClassifier(strategy='most_frequent')
baseline_model.fit(X_train, y_train)
baseline_metrics = evaluate_model(baseline_model, X_test, y_test, "Baseline (Majority Class)")

#| label: fig-baseline-cm
#| fig-cap: "Confusion Matrix for Baseline Model"

![](results/models/baseline_confusion_matrix.png)

**Key Takeaways:**
- The baseline model highlights the importance of **building a predictive model**, as it **completely fails to detect bad credit applicants**.
- **Misclassification Cost:** This model would **approve every bad credit applicant**, leading to financial losses.

### K-Nearest Neighbors (Optimized)

- **Hyperparameter tuning (Grid Search) was used to optimize:**
  - **Distance Metric:** Manhattan
  - **Number of Neighbors (k):** 3
  - **Weighting:** Distance-based


In [None]:
param_grid = {
'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']
}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
best_knn = grid_search.best_estimator_
knn_metrics = evaluate_model(best_knn, X_test, y_test, "KNN (Optimized)", scaled=True)

#| label: fig-knn-cm
#| fig-cap: "Confusion Matrix for K-Nearest Neighbors Model"

![](results/models/knn_confusion_matrix.png)


**Key Takeaways:**
- The KNN model shows **improvement over the baseline**.
- **Bad Credit Recall improved to ```{python} f"{knn_metrics['recall']*100:.1f}"```%**, meaning the model **detects more defaulters**.
- However, **it still struggles with false positives and false negatives**.

### Random Forest (Optimized)

- **Hyperparameter tuning (Grid Search) was used to optimize:**
  - **Class Weight:** Balanced
  - **Max Depth:** 20
  - **Min Samples per Split:** 10
  - **Number of Estimators:** 50


In [None]:
param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'class_weight': [None, 'balanced']
}
rf = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='f1', n_jobs=-1)
grid_search_rf.fit(X_train, y_train)
best_rf = grid_search_rf.best_estimator_
rf_metrics = evaluate_model(best_rf, X_test, y_test, "Random Forest (Optimized)")

#| label: fig-rf-cm
#| fig-cap: "Confusion Matrix for Random Forest Model"

![](results/models/randomforest_confusion_matrix.png)

#| label: fig-feat-imp
#| fig-cap: "Feature Importance for Random Forest Model"

![](results/models/feature_importance.png)

**Key Takeaways:**
- **Random Forest outperforms both Baseline and KNN models.**
- **Bad Credit Recall improved to ```{python} f"{rf_metrics['recall']*100:.1f}"```%**, meaning the model detects **more defaulters**.
- **Balanced performance** with **good precision and recall trade-off**.

The **Random Forest model** provides insight into which features contribute most to predicting **credit standing**, as shown in @fig-feature-importance.

**Key Takeaways:**
- **Credit Amount and Loan Duration** are the **two most important predictors** of credit risk.
- **Checking Account Status (No Account or Negative Balance)** strongly correlates with credit standing.
- **Younger applicants and those with poor credit history** tend to have worse credit standing.

## Model Comparison

#| label: fig-model-comp
#| fig-cap: "Model comparison between all models"

![](results/models/model_comparison.png)

{python}
#| label: tbl-model-metrics
#| tbl-cap: "Performance Metrics for All Models"

metrics_table = models_comparison[['accuracy', 'precision', 'recall', 'f1', 'fnr']].reset_index()
metrics_table.style.format({
'accuracy': '{:.3f}',
'precision': '{:.3f}',
'recall': '{:.3f}',
'f1': '{:.3f}',
'fnr': '{:.3f}'
}).set_table_attributes('class="dataframe")
```


# Discussion

In our analysis, we evaluated three models: a Dummy Classifier, a k-Nearest Neighbors (k-NN) model, and a Random Forest model, to predict credit risk using the Statlog (German Credit Data) dataset. Each model's performance was assessed using key metrics: accuracy, precision, recall, F1 score, and false negative rate, as shown in @tbl-model-metrics and @fig-model-comparison. We also consider the false negative rate in our analysis since we believe it's crucial to minimize the false negative count when it comes to financial reliability.

## Dummy Classifier Performance
- **Purpose**: The Dummy Classifier serves as a baseline, predicting the majority class without learning from the data.
- **Metrics**:
  - **Recall, Precision, and F1 Score**: These metrics are typically low for the Dummy Classifier, as it fails to identify minority class instances (bad credit risks) effectively.
  - **Accuracy**: While it might show reasonable accuracy in imbalanced datasets, this is misleading as it doesn't reflect true predictive capability.

## k-Nearest Neighbors (k-NN) Performance
- **Purpose**: The k-NN model is a simple, instance-based learning algorithm that classifies data points based on the majority class of their nearest neighbors.
- **Metrics**:
  - **Recall**: k-NN can achieve better recall than the Dummy Classifier by considering the local structure of the data, but it may still struggle with imbalanced datasets.
  - **Precision**: Precision can vary depending on the choice of `k` and the distance metric, but it generally improves over the Dummy Classifier.
  - **F1 Score**: The F1 score for k-NN is typically higher than that of the Dummy Classifier, indicating a better balance between precision and recall.
  - **Accuracy**: k-NN often shows improved accuracy over the Dummy Classifier, but it may not match the performance of more sophisticated models like Random Forests.

## Random Forest Classifier Performance
- **Purpose**: The Random Forest model is an ensemble method that builds multiple decision trees to improve prediction accuracy and robustness [@kuhn2013].
- **Metrics**:
  - **Recall**: Random Forests excel in recall, particularly for the minority class, by effectively capturing complex patterns in the data.
  - **Precision**: It also achieves high precision, reducing false positives and ensuring reliable predictions.
  - **F1 Score**: The F1 score is significantly higher for Random Forests, reflecting its superior ability to balance precision and recall.
  - **Accuracy**: Random Forests typically achieve the highest accuracy among the models tested, providing a comprehensive and reliable classification.

## Comparison
- **Dummy Classifier**: Serves as a baseline with limited predictive power, primarily due to its simplistic approach.
- **k-NN**: Offers improvements over the Dummy Classifier by leveraging local data structures, but its performance is sensitive to parameter choices and data imbalance.
- **Random Forest**: Outperforms both the Dummy Classifier and k-NN by providing a robust, accurate, and balanced classification, making it the preferred choice for credit risk assessment.

These findings highlight the importance of selecting appropriate models for credit risk prediction [@lessmann2015]. While k-NN offers some improvements over a baseline, the Random Forest model's ability to handle complex datasets and provide reliable predictions makes it the most effective tool for this task.

# References
:::{#refs}
:::