### Module 7: Supervised Learning-1

#### Case Study–3

Domain –Banking/Loan

focus – Lower NPA (Non-Performing Asset)

Business challenge/requirement

PeerLoanKart is an NBFC (Non-Banking Financial Company) which facilitates peerto-peer loans.
It connects people who need money (borrowers) with people who have money (investors). As an investor, you would want to invest in people who showed a profile of having a high probability of paying you back.
You as an ML expert create a model that will help predict whether a borrower will pay the loan or not.


Key issues

Ensure NPAs are lower – meaning PeerLoanKart wants to be very diligent in giving loans to a borrower

Data volume

- Approx 9578 records – file loan_borowwer_data.csv


Fields in Data

• credit.policy: 1 if the customer meets the credit underwriting criteria of PeerLoanKart, and 0 otherwise

• purpose: The purpose of the loan (takes values "credit_card","debt_consolidation", "educational", "major_purchase", "small_business", and
"all_other")

• int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by PeerLoanKart to be riskier are assigned higher interest rates

• installment: The monthly installments owed by the borrower if the loan is funded

• log.annual.inc: The natural log of the self-reported annual income of the borrower

• dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income)

• fico: The FICO credit score of the borrower

• days.with.cr.line: The number of days the borrower has had a credit line

• revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle)

• revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available)

• inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months

• delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years

• pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments)

• not.fully.paid: This is the output field. Please note that 1 means the borrower is not going to pay the loan completely

Business benefits

Increase in profits up to 20% as NPA will be reduced due to loan disbursal for only good borrowers


In [5]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_csv("loan_borowwer_data.csv")

# Optional quick sanity checks
print("Shape:", df.shape)
print("Columns:", list(df.columns))
print("Target distribution:\n", df['not.fully.paid'].value_counts(normalize=True).round(3))


# Basic cleaning and missing values

# Separate numeric and categorical for imputations
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Impute numeric columns with median
for col in numeric_cols:
    if df[col].isnull().any():
        df[col].fillna(df[col].median(), inplace=True)

# Impute categorical with mode (most frequent)
for col in categorical_cols:
    if df[col].isnull().any():
        df[col].fillna(df[col].mode().iloc[0], inplace=True)

# Encode categorical features
# 'purpose' is the main categorical feature described. If others exist, encode them too.
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Split features and target

X = df.drop(columns=['not.fully.paid'])
y = df['not.fully.paid']  # 1 = not fully paid, 0 = fully paid

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


# Baseline model: Logistic Regression

log_reg = LogisticRegression(max_iter=1000, n_jobs=-1)
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]

print("\n=== Logistic Regression Results ===")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_proba_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr, digits=3))

# Stronger model: Random Forest

rf = RandomForestClassifier(
    n_estimators=300,
     max_depth=None,
#    min_samples_split=10,
#    min_samples_leaf=5,
    class_weight="balanced",  # helps with class imbalance
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:, 1]

print("\n=== Random Forest Results ===")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_proba_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf, digits=3))

# Business-facing summary

def summary_metrics(name, y_true, y_pred, y_proba):
    acc = accuracy_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    print(f"\n{name} Summary:")
    print(f"- Accuracy: {acc:.3f}")
    print(f"- ROC-AUC: {auc:.3f}")

summary_metrics("Logistic Regression", y_test, y_pred_lr, y_proba_lr)
summary_metrics("Random Forest", y_test, y_pred_rf, y_proba_rf)

# Optional: Feature importance (Random Forest)

importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("\nTop 10 important features (Random Forest):")
print(importances.head(10))

Shape: (9578, 14)
Columns: ['credit.policy', 'purpose', 'int.rate', 'installment', 'log.annual.inc', 'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util', 'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid']
Target distribution:
 not.fully.paid
0    0.84
1    0.16
Name: proportion, dtype: float64

=== Logistic Regression Results ===
Accuracy: 0.8392484342379958
ROC-AUC: 0.6543283606262007
Confusion Matrix:
 [[1601    8]
 [ 300    7]]
Classification Report:
               precision    recall  f1-score   support

           0      0.842     0.995     0.912      1609
           1      0.467     0.023     0.043       307

    accuracy                          0.839      1916
   macro avg      0.654     0.509     0.478      1916
weighted avg      0.782     0.839     0.773      1916


=== Random Forest Results ===
Accuracy: 0.8392484342379958
ROC-AUC: 0.6669811301656197
Confusion Matrix:
 [[1606    3]
 [ 305    2]]
Classification Report:
               precision    recall  