1. Problem Statement



A financial institution wants to predict whether a customer will default on a loan before approving it. Early identification of risky customers helps reduce financial loss.

You are working as a Machine Learning Analyst and must build a classification model using the K-Nearest Neighbors (KNN) algorithm to predict loan default.

This case introduces:

Mixed feature types

Financial risk interpretation

Class imbalance awareness

Age, Annual Income(lakhs), Credit Score (300-900), Loan Amount(lakhs), Loan Term(years), Employment Type, loan(yes/no)

28,6.5,720,5,5, Salaried,0

45,12,680,10,10, Self-Employed,1

35,8,750,6,7, Salaried,0

50,15,640,12,15, Self-Employed,1

30,7,710,5,5, Salaried,0

42,10,660,9,10, Salaried,1

26,5.5,730,4,4, Salaried,0

48,14,650,11,12, Self-Employed,1

38,9,700,7,8, Salaried,0

55,16,620,13,15, Self-Employed,1



Interpretation

Identify high-risk customers.

What patterns lead to loan default?

How do credit scores and income influence predictions?

Suggest banking policies based on model output.

Compare KNN with Decision Trees for this problem.

What happens if Loan Amount dominates distance calculation?

Should KNN be used in real-time loan approval systems?

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [3]:
data = {
    "Age": [28,45,35,50,30,42,26,48,38,55],
    "Income": [6.5,12,8,15,7,10,5.5,14,9,16],
    "CreditScore": [720,680,750,640,710,660,730,650,700,620],
    "LoanAmount": [5,10,6,12,5,9,4,11,7,13],
    "LoanTerm": [5,10,7,15,5,10,4,12,8,15],
    "EmploymentType": ["Salaried","Self-Employed","Salaried","Self-Employed",
                       "Salaried","Salaried","Salaried","Self-Employed",
                       "Salaried","Self-Employed"],
    "Default": [0,1,0,1,0,1,0,1,0,1]
}

df = pd.DataFrame(data)


In [4]:
df["EmploymentType"] = df["EmploymentType"].map({
    "Salaried": 0,
    "Self-Employed": 1
})


In [5]:
X = df.drop("Default", axis=1)
y = df["Default"]


In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


In [7]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [8]:
knn = KNeighborsClassifier(
    n_neighbors=3,
    metric='euclidean'
)

knn.fit(X_train_scaled, y_train)


In [9]:
y_pred = knn.predict(X_test_scaled)


In [10]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 1.0

Confusion Matrix:
 [[2 0]
 [0 1]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



In [11]:
new_customer = pd.DataFrame({
    "Age": [40],
    "Income": [9],
    "CreditScore": [690],
    "LoanAmount": [8],
    "LoanTerm": [10],
    "EmploymentType": [0]  # Salaried
})

new_customer_scaled = scaler.transform(new_customer)

prediction = knn.predict(new_customer_scaled)

print("Loan Default Prediction:", "Yes" if prediction[0] == 1 else "No")


Loan Default Prediction: No
