Problem Statement

A financial institution wants to predict whether a customer will default on a loan before approving it. Early identification of risky customers helps reduce financial loss.

You are working as a Machine Learning Analyst and must build a classification model using the K-Nearest Neighbors (KNN) algorithm to predict loan default.

This case introduces: Mixed feature types Financial risk interpretation Class imbalance awareness

Age,AnnualIncome(lakhs),CreditScore(300-900), LoanAmount(lakhs), LoanTerm(years), EmploymentType, loan(yes/no) 28,6.5,720,5,5,Salaried,0 45,12,680,10,10,Self-Employed,1 35,8,750,6,7,Salaried,0 50,15,640,12,15,Self-Employed,1 30,7,710,5,5,Salaried,0 42,10,660,9,10,Salaried,1 26,5.5,730,4,4,Salaried,0 48,14,650,11,12,Self-Employed,1 38,9,700,7,8,Salaried,0 55,16,620,13,15,Self-Employed,1

Interpretation

1. Identify high-risk customers.
2. What patterns lead to loan default?
3. How do credit score and income influence predictions?
4. Suggest banking policies based on model output.
5. Compare KNN with Decision Trees for this problem.
6. What happens if LoanAmount dominates distance calculation?
7. Should KNN be used in real-time loan approval systems?

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import LeaveOneOut, cross_val_predict
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score

In [2]:
data = pd.DataFrame({
    "Age": [28,45,35,50,30,42,26,48,38,55],
    "AnnualIncome": [6.5,12,8,15,7,10,5.5,14,9,16],
    "CreditScore": [720,680,750,640,710,660,730,650,700,620],
    "LoanAmount": [5,10,6,12,5,9,4,11,7,13],
    "LoanTerm": [5,10,7,15,5,10,4,12,8,15],
    "EmploymentType": ["Salaried","Self-Employed","Salaried","Self-Employed","Salaried",
                       "Salaried","Salaried","Self-Employed","Salaried","Self-Employed"],
    "loan": [0,1,0,1,0,1,0,1,0,1]  # 1 = default, 0 = non-default
})
data

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,loan
0,28,6.5,720,5,5,Salaried,0
1,45,12.0,680,10,10,Self-Employed,1
2,35,8.0,750,6,7,Salaried,0
3,50,15.0,640,12,15,Self-Employed,1
4,30,7.0,710,5,5,Salaried,0
5,42,10.0,660,9,10,Salaried,1
6,26,5.5,730,4,4,Salaried,0
7,48,14.0,650,11,12,Self-Employed,1
8,38,9.0,700,7,8,Salaried,0
9,55,16.0,620,13,15,Self-Employed,1


In [3]:
data.shape

(10, 7)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             10 non-null     int64  
 1   AnnualIncome    10 non-null     float64
 2   CreditScore     10 non-null     int64  
 3   LoanAmount      10 non-null     int64  
 4   LoanTerm        10 non-null     int64  
 5   EmploymentType  10 non-null     object 
 6   loan            10 non-null     int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 692.0+ bytes


In [5]:
display(data.describe(include="all"))

print("\nClass counts (loan):")
display(data["loan"].value_counts())

print("\nClass proportion:")
display(data["loan"].value_counts(normalize=True))

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,loan
count,10.0,10.0,10.0,10.0,10.0,10,10.0
unique,,,,,,2,
top,,,,,,Salaried,
freq,,,,,,6,
mean,39.7,10.3,686.0,8.2,9.1,,0.5
std,9.922477,3.750556,42.739521,3.224903,4.012481,,0.527046
min,26.0,5.5,620.0,4.0,4.0,,0.0
25%,31.25,7.25,652.5,5.25,5.5,,0.0
50%,40.0,9.5,690.0,8.0,9.0,,0.5
75%,47.25,13.5,717.5,10.75,11.5,,1.0



Class counts (loan):


loan
0    5
1    5
Name: count, dtype: int64


Class proportion:


loan
0    0.5
1    0.5
Name: proportion, dtype: float64

In [6]:
X = data.drop(columns=["loan"])
y = data["loan"]

numeric_features= ["Age", "AnnualIncome", "CreditScore", "LoanAmount", "LoanTerm"]
categorical_features = ["EmploymentType"]

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(drop="first"), categorical_features)
    ],
    remainder="drop"
)

In [7]:
knn = KNeighborsClassifier(
    n_neighbors=3,
    weights="distance",
    metric="minkowski"  #Euclidean distance
)

model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("knn", knn)
])

In [8]:
loo = LeaveOneOut()

# predicted class
y_pred = cross_val_predict(model, X, y, cv=loo, method="predict")

# predicted probability of default (class 1)
y_proba = cross_val_predict(model, X, y, cv=loo, method="predict_proba")[:, 1]

print("LOOCV Accuracy:", accuracy_score(y, y_pred))
print("LOOCV ROC-AUC :", roc_auc_score(y, y_proba))

print("\nConfusion Matrix:")
print(confusion_matrix(y, y_pred))

print("\nClassification Report:")
print(classification_report(y, y_pred, digits=3))

LOOCV Accuracy: 1.0
LOOCV ROC-AUC : 1.0

Confusion Matrix:
[[5 0]
 [0 5]]

Classification Report:
              precision    recall  f1-score   support

           0      1.000     1.000     1.000         5
           1      1.000     1.000     1.000         5

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10



In [9]:
results = data.copy()
results["pred_default_proba"] = y_proba
results["pred_label"] =y_pred

# High risk threshold (bank policy choice)
threshold = 0.60
results["high_risk_flag"] = (results["pred_default_proba"] >= threshold).astype(int)

# Sort by risk
results_sorted = results.sort_values("pred_default_proba", ascending=False)
results_sorted

Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,loan,pred_default_proba,pred_label,high_risk_flag
7,48,14.0,650,11,12,Self-Employed,1,1.0,1,1
3,50,15.0,640,12,15,Self-Employed,1,1.0,1,1
9,55,16.0,620,13,15,Self-Employed,1,1.0,1,1
1,45,12.0,680,10,10,Self-Employed,1,0.758308,1,1
5,42,10.0,660,9,10,Salaried,1,0.639388,1,1
8,38,9.0,700,7,8,Salaried,0,0.337107,0,0
0,28,6.5,720,5,5,Salaried,0,0.0,0,0
2,35,8.0,750,6,7,Salaried,0,0.0,0,0
4,30,7.0,710,5,5,Salaried,0,0.0,0,0
6,26,5.5,730,4,4,Salaried,0,0.0,0,0


In [10]:
k_values = [1,3,5,7,9]
scores = []

for k in k_values:
    model_k = Pipeline(steps=[
        ("preprocess", preprocess),
        ("knn", KNeighborsClassifier(n_neighbors=k, weights="distance"))
    ])
    proba_k = cross_val_predict(model_k, X, y, cv=loo, method="predict_proba")[:, 1]
    auc = roc_auc_score(y, proba_k)
    scores.append((k, auc))

scores_df = pd.DataFrame(scores, columns=["k", "LOOCV_ROC_AUC"]).sort_values("LOOCV_ROC_AUC", ascending=False)
scores_df

Unnamed: 0,k,LOOCV_ROC_AUC
0,1,1.0
1,3,1.0
2,5,1.0
3,7,1.0
4,9,1.0


In [11]:
# ===== Final "Model Summary " Output Cell (KNN has no equation) =====

FINAL_K = model.named_steps["knn"].n_neighbors
FINAL_WEIGHTS = model.named_steps["knn"].weights
FINAL_METRIC = model.named_steps["knn"].metric

THRESHOLD = 0.60  # we can change as per bank policy

print("===== FINAL KNN MODEL SUMMARY =====")
print(f"K (n_neighbors)   : {FINAL_K}")
print(f"Weights           : {FINAL_WEIGHTS}")
print(f"Distance metric   : {FINAL_METRIC}")
print("Preprocessing     : StandardScaler (numeric) + OneHotEncoder (EmploymentType)")
print(f"High-risk threshold (P(default) >= {THRESHOLD})")

# LOOCV evaluation (recompute here so this cell is self-contained)
from sklearn.model_selection import LeaveOneOut, cross_val_predict
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report, mean_squared_error

loo = LeaveOneOut()

y_pred_final = cross_val_predict(model, X, y, cv=loo, method="predict")
y_proba_final = cross_val_predict(model, X, y, cv=loo, method="predict_proba")[:, 1]

print("\n---- LOOCV METRICS ----")
print("Accuracy:", accuracy_score(y, y_pred_final))
print("ROC-AUC :", roc_auc_score(y, y_proba_final))

print("\n----- CONFUSION MATRIX -----")
print(confusion_matrix(y, y_pred_final))

print("\n--- CLASSIFICATION REPORT ---")
print(classification_report(y, y_pred_final, digits=3))

# High-risk customers table
final_results = data.copy()
final_results["pred_default_proba"]= y_proba_final
final_results["pred_label"] = y_pred_final
final_results["high_risk_flag "] = (final_results["pred_default_proba"] >= THRESHOLD).astype(int)

print("\n===== HIGH-RISK CUSTOMERS (sorted) =====")
display(final_results.sort_values("pred_default_proba", ascending=False))

===== FINAL KNN MODEL SUMMARY =====
K (n_neighbors)   : 3
Weights           : distance
Distance metric   : minkowski
Preprocessing     : StandardScaler (numeric) + OneHotEncoder (EmploymentType)
High-risk threshold (P(default) >= 0.6)

---- LOOCV METRICS ----
Accuracy: 1.0
ROC-AUC : 1.0

----- CONFUSION MATRIX -----
[[5 0]
 [0 5]]

--- CLASSIFICATION REPORT ---
              precision    recall  f1-score   support

           0      1.000     1.000     1.000         5
           1      1.000     1.000     1.000         5

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10


===== HIGH-RISK CUSTOMERS (sorted) =====


Unnamed: 0,Age,AnnualIncome,CreditScore,LoanAmount,LoanTerm,EmploymentType,loan,pred_default_proba,pred_label,high_risk_flag
7,48,14.0,650,11,12,Self-Employed,1,1.0,1,1
3,50,15.0,640,12,15,Self-Employed,1,1.0,1,1
9,55,16.0,620,13,15,Self-Employed,1,1.0,1,1
1,45,12.0,680,10,10,Self-Employed,1,0.758308,1,1
5,42,10.0,660,9,10,Salaried,1,0.639388,1,1
8,38,9.0,700,7,8,Salaried,0,0.337107,0,0
0,28,6.5,720,5,5,Salaried,0,0.0,0,0
2,35,8.0,750,6,7,Salaried,0,0.0,0,0
4,30,7.0,710,5,5,Salaried,0,0.0,0,0
6,26,5.5,730,4,4,Salaried,0,0.0,0,0
