# Credit Risk Assessment - Model Development

**Name:** Neetu
**Student ID:** IITP_AIML_2506115

This Notebook implements four classification models:

- Logistic Regression(Baseline)
- Decision Tree
- Random Forest
- XGBoost

Hyperparameter tuning is performed using 5-fold cross -validation.

# Section 1- Notebook Overview

# Credit Risk Assessment- Model Development & Hyperparameter Tuning
  

  ## Overview

  In this notebook, multiple classification models are developed to predict  credit default risk.

The goal is to identify high-risk applicants (Bad=1) while prioritizing recall to minimize financial losses due to loan defaults.

  The following models  are implemented:

  - Logistic Regression
  - Decision Tree
  - Random Forest
  - XGBoost

  Hyperparameter tuning is performed using 5-fold cross-validation to optimize performance.

   Business priority:
   Achieve at least 75% recall on default cases.

In [None]:
#Install ML flow
!pip install mlflow
import mlflow
import mlflow.sklearn
mlflow.set_experiment("Credit_Risk_Model_Development")


# Section 2- Load Processed data
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

# Load files from Data folder
X_train = pd.read_csv('/content/drive/MyDrive/credit-risk-ml-pipeline/Data/X_train.csv')
X_test = pd.read_csv('/content/drive/MyDrive/credit-risk-ml-pipeline/Data/X_test.csv')
y_train = pd.read_csv('/content/drive/MyDrive/credit-risk-ml-pipeline/Data/y_train.csv').values.ravel()
y_test = pd.read_csv('/content/drive/MyDrive/credit-risk-ml-pipeline/Data/y_test.csv').values.ravel()

print("Data loaded successfully ")

Collecting mlflow
  Downloading mlflow-3.10.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-skinny==3.10.0 (from mlflow)
  Downloading mlflow_skinny-3.10.0-py3-none-any.whl.metadata (32 kB)
Collecting mlflow-tracing==3.10.0 (from mlflow)
  Downloading mlflow_tracing-3.10.0-py3-none-any.whl.metadata (19 kB)
Collecting Flask-CORS<7 (from mlflow)
  Downloading flask_cors-6.0.2-py3-none-any.whl.metadata (5.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<26 (from mlflow)
  Downloading gunicorn-25.1.0-py3-none-any.whl.metadata (5.5 kB)
Collecting huey<3,>=2.5.4 (from mlflow)
  Downloading huey-2.6.0-py3-none-any.whl.metadata (4.3 kB)
Collecting skops<1 (from mlflow)
  Downloading skops-0.13.0-py3-none-any.whl.metadata (5.6 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==3.10.0->mlflow)
  D

2026/02/25 18:20:05 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/02/25 18:20:05 INFO mlflow.store.db.utils: Updating database tables
2026/02/25 18:20:08 INFO mlflow.tracking.fluent: Experiment with name 'Credit_Risk_Model_Development' does not exist. Creating a new experiment.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Data loaded successfully 


In [None]:
# Section 1- Import liabraries

import pandas as pd
import numpy as np

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

#Model Selection
from sklearn.model_selection import cross_val_score, GridSearchCV

#Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

#Metrics
from sklearn.metrics import accuracy_score,  recall_score

import joblib

## Logistic Regression Explanation

Logistic Regression is used as the baseline model because:

- It is simple  and interpretable.
- Coefficients help understand feature impact.
- It performs well for linearly separable problems.

A Pipeline is used to ensure proper scaling  and reproducibility.

This model establishes a performance benchmark for comparison.

In [None]:
# Section 3- Logistic Regression(Pipeline Required)

with mlflow.start_run(run_name="Logistic_Regression"):

    lr_model = LogisticRegression(random_state=42)
    lr_model.fit(X_train, y_train)

    y_pred = lr_model.predict(X_test)

    recall = recall_score(y_test, y_pred, pos_label=1)
    accuracy = accuracy_score(y_test, y_pred)

    # Log parameters
    mlflow.log_param("model_type", "Logistic Regression")

    # Log metrics
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("accuracy", accuracy)

    # Log model
    mlflow.sklearn.log_model(lr_model, name="logistic_regression_model")


    print("LR Recall:", recall)



LR Recall: 0.8448275862068966


In [None]:
import os

# Define the directory path
models_dir = '/content/drive/MyDrive/credit-risk-ml-pipeline/Models/'

# Create the directory if it doesn't exist
os.makedirs(models_dir, exist_ok=True)

joblib.dump(lr_pipe, os.path.join(models_dir, 'logistic_regression.pkl'))

['/content/drive/MyDrive/credit-risk-ml-pipeline/Models/logistic_regression.pkl']

## Decision Tree Explanation

Decision Tree is a non-linear model that captures complex feature interactions.

Advantages:

- Easy to interpret.
- Handles non-linear relationships.
- No need for feature scaling.

However, it may overfit if not properly tuned.

In [None]:
# Section 4 - Decision Tree

dt_model= DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

y_pred_dt= dt_model.predict(X_test)

print("Decision Tree  Accuracy :", accuracy_score(y_test, y_pred_dt))

print("Decision Tree  Recall:", recall_score(y_test, y_pred_dt, pos_label=1))

cv_dt = cross_val_score(dt_model, X_train, y_train, cv=5, scoring='recall')
print("Decision Tree CV Recall:", cv_dt.mean())


joblib.dump(dt_model, '/content/drive/MyDrive/credit-risk-ml-pipeline/Models/decision_tree.pkl')

Decision Tree  Accuracy : 1.0
Decision Tree  Recall: 1.0
Decision Tree CV Recall: 0.9740980573543017


['/content/drive/MyDrive/credit-risk-ml-pipeline/Models/decision_tree.pkl']

## Random Forest Explanation

Random Forest is an ensemble learning method that combines multiple decision trees.


Advantages-

* Reduce overfitting
* Improves generalization
* Provides feature importance insights.

Since recall is critical for this business problem, hyperparameter tuning is performed to optimize recall performance.


In [None]:
# Section 5- Random Forest

with mlflow.start_run(run_name="Random_Forest_Tuned"):

    best_rf.fit(X_train, y_train)
    y_pred = best_rf.predict(X_test)

    recall = recall_score(y_test, y_pred, pos_label=1)
    accuracy = accuracy_score(y_test, y_pred)

    mlflow.log_param("n_estimators", best_rf.n_estimators)
    mlflow.log_param("max_depth", best_rf.max_depth)

    mlflow.log_metric("recall", recall)
    mlflow.log_metric("accuracy", accuracy)

    mlflow.sklearn.log_model(best_rf, name="random_forest_model")

    print("RF Recall:", recall)




RF Recall: 0.9827586206896551


# Hyperparameter Tuning Explanation

To improve model performance, GridSearchCV with 5-fold cross validation is used.

Why use 5-fold Cross validation?

- Reduces variance in performance estimation.
- Ensures model generalization.
- Uses multiple train-validation splits.

Scoring Metric: Recall

Since the business objective is minimizing loan defaults, recall for the "Bad" class is prioritized.

The model with the highest cross-validated recall will be selected.

In [None]:
# Section 6 -Hyperparameter Tuning(Random Forest)

param_grid_rf={
    'n_estimators': [100,200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_rf= GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=5,
    scoring='recall',
    n_jobs=-1
)

grid_rf.fit(X_train, y_train)

print("Best RF Parameters:", grid_rf.best_params_)
print("Best RF Recall(CV)", grid_rf.best_score_)

best_rf = grid_rf.best_estimator_


joblib.dump(best_rf, '/content/drive/MyDrive/credit-risk-ml-pipeline/Models/random_forest.pkl')


Best RF Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best RF Recall(CV) 0.9397779833487512


['/content/drive/MyDrive/credit-risk-ml-pipeline/Models/random_forest.pkl']

## XGBoost Explanation

XGBoost is a Gradient Boosting algorithm known for:

- High predictive performance
- Handling complex patterns
- Regularization to reduce overfitting

It is often one of the strongest performers in structured tabular datasets.

In [None]:
# Section 7- XGBoost

with mlflow.start_run(run_name="XGBoost_Tuned"):

    best_xgb.fit(X_train, y_train)
    y_pred = best_xgb.predict(X_test)

    recall = recall_score(y_test, y_pred, pos_label=1)
    accuracy = accuracy_score(y_test, y_pred)

    mlflow.log_param("learning_rate", best_xgb.learning_rate)
    mlflow.log_param("max_depth", best_xgb.max_depth)

    mlflow.log_metric("recall", recall)
    mlflow.log_metric("accuracy", accuracy)

    mlflow.sklearn.log_model(best_xgb, name="xgboost_model")

    print("XGB Recall:", recall)



XGB Recall: 1.0


In [None]:
!mlflow ui
!ls mlruns

Backend store URI not provided. Using sqlite:///mlflow.db
Registry store URI not provided. Using backend store URI.
[MLflow] Security middleware enabled with default settings (localhost-only). To allow connections from other hosts, use --host 0.0.0.0 and configure --allowed-hosts and --cors-allowed-origins.
[32mINFO[0m:     Uvicorn running on [1mhttp://127.0.0.1:5000[0m (Press CTRL+C to quit)
[32mINFO[0m:     Started parent process [[36m[1m5441[0m]
2026/02/25 18:27:30 INFO mlflow.server.jobs.utils: Starting huey consumer for job function invoke_scorer
2026/02/25 18:27:30 INFO mlflow.server.jobs.utils: Starting huey consumer for job function run_online_trace_scorer
2026/02/25 18:27:30 INFO mlflow.server.jobs.utils: Starting huey consumer for job function optimize_prompts
2026/02/25 18:27:30 INFO mlflow.server.jobs.utils: Starting huey consumer for job function run_online_session_scorer
2026/02/25 18:27:30 INFO mlflow.server.jobs.utils: Starting dedicated Huey consumer for perio

In [None]:
# Section 8 - Hyperparameter Tuning(XGBoost)


with mlflow.start_run(run_name="XGBoost_Tuned"):

    best_xgb = grid_xgb.best_estimator_
    best_xgb.fit(X_train, y_train)

    y_pred = best_xgb.predict(X_test)
    y_proba = best_xgb.predict_proba(X_test)[:,1]

    # Metrics
    recall = recall_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba)

    # üîπ Log Best Parameters
    mlflow.log_params(grid_xgb.best_params_)

    # üîπ Log Metrics
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("f1_score", f1)
    mlflow.log_metric("auc", auc)

    # üîπ Log CV Score
    mlflow.log_metric("cv_recall", grid_xgb.best_score_)

    # üîπ Log Model
    mlflow.sklearn.log_model(best_xgb, "xgboost_model")

    print("XGBoost Logged Successfully")

## End summary

All four required models are implemented and evaluated.

Hyperparameter Tuning was conducted for:
- Random Forest
- XGBoost


Performance comparision and detailed evaluation will be conducted in the next notebook.

The final production model will be selected based on:
- Recall(Primary metric)
- Overall F1-score
- Interpretability
- Deployment feasibility

### Model Performance Overview

- Logistic Regression: Good baseline, strong recall.
- Decision Tree: Very high training performance, potential overfitting.
- Random Forest (Tuned): High recall with better generalization.
- XGBoost (Tuned): Highest cross-validated recall performance.