# Loan Default Prediction using KNN



**A financial institution wants to predict whether a customer will default on a loan before approving it. Early identification of risky customers helps reduce financial loss.**

**You are working as a Machine Learning Analyst and must build a classification model using the K-Nearest Neighbors (KNN) algorithm to predict loan default.**

**This case introduces:**

**Mixed feature types**

**Financial risk interpretation**

**Class imbalance awareness**

**Age, Annual Income(lakhs), Credit Score (300-900), Loan Amount(lakhs), Loan Term(years), Employment Type, loan(yes/no)**

**28,6.5,720,5,5, Salaried,0**

**45,12,680,10,10, Self-Employed,1**

**35,8,750,6,7, Salaried,0**

**50,15,640,12,15, Self-Employed,1**

**30,7,710,5,5, Salaried,0**

**42,10,660,9,10, Salaried,1**

**26,5.5,730,4,4, Salaried,0**

**48,14,650,11,12, Self-Employed,1**

**38,9,700,7,8, Salaried,0**

**55,16,620,13,15, Self-Employed,1**

Import Required Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Create the Dataset

In [None]:
data = {
    "Age": [28, 45, 35, 50, 30, 42, 26, 48, 38, 55],
    "Annual_Income": [6.5, 12, 8, 15, 7, 10, 5.5, 14, 9, 16],
    "Credit_Score": [720, 680, 750, 640, 710, 660, 730, 650, 700, 620],
    "Loan_Amount": [5, 10, 6, 12, 5, 9, 4, 11, 7, 13],
    "Loan_Term": [5, 10, 7, 15, 5, 10, 4, 12, 8, 15],
    "Employment_Type": ["Salaried", "Self-Employed", "Salaried", "Self-Employed",
                        "Salaried", "Salaried", "Salaried", "Self-Employed",
                        "Salaried", "Self-Employed"],
    "Loan_Default": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)
print(df)

   Age  Annual_Income  Credit_Score  Loan_Amount  Loan_Term Employment_Type  \
0   28            6.5           720            5          5        Salaried   
1   45           12.0           680           10         10   Self-Employed   
2   35            8.0           750            6          7        Salaried   
3   50           15.0           640           12         15   Self-Employed   
4   30            7.0           710            5          5        Salaried   
5   42           10.0           660            9         10        Salaried   
6   26            5.5           730            4          4        Salaried   
7   48           14.0           650           11         12   Self-Employed   
8   38            9.0           700            7          8        Salaried   
9   55           16.0           620           13         15   Self-Employed   

   Loan_Default  
0             0  
1             1  
2             0  
3             1  
4             0  
5             1  
6   

In [None]:
df

Unnamed: 0,Age,Annual_Income,Credit_Score,Loan_Amount,Loan_Term,Employment_Type,Loan_Default
0,28,6.5,720,5,5,Salaried,0
1,45,12.0,680,10,10,Self-Employed,1
2,35,8.0,750,6,7,Salaried,0
3,50,15.0,640,12,15,Self-Employed,1
4,30,7.0,710,5,5,Salaried,0
5,42,10.0,660,9,10,Salaried,1
6,26,5.5,730,4,4,Salaried,0
7,48,14.0,650,11,12,Self-Employed,1
8,38,9.0,700,7,8,Salaried,0
9,55,16.0,620,13,15,Self-Employed,1


Encode Categorical Data

In [None]:
le = LabelEncoder()
df["Employment_Type"] = le.fit_transform(df["Employment_Type"])

df

Unnamed: 0,Age,Annual_Income,Credit_Score,Loan_Amount,Loan_Term,Employment_Type,Loan_Default
0,28,6.5,720,5,5,0,0
1,45,12.0,680,10,10,1,1
2,35,8.0,750,6,7,0,0
3,50,15.0,640,12,15,1,1
4,30,7.0,710,5,5,0,0
5,42,10.0,660,9,10,0,1
6,26,5.5,730,4,4,0,0
7,48,14.0,650,11,12,1,1
8,38,9.0,700,7,8,0,0
9,55,16.0,620,13,15,1,1


Split Features and Target

In [None]:
X = df.drop("Loan_Default", axis=1)
y = df["Loan_Default"]

Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
print(len(y_test))

3


Feature Scaling

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Build KNN Classifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn.fit(X_train_scaled, y_train)

Make Predictions

In [None]:
y_pred = knn.predict(X_test_scaled)
print("Predicted values:", y_pred)
print("Actual values   :", y_test.values)

Predicted values: [0 1 0]
Actual values   : [0 1 1]


Evaluate the Model

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6666666666666666


In [None]:
print("Actual values:", y_test.values)
print("Predicted values:", y_pred)

correct = sum(y_test.values == y_pred)
total = len(y_test)

print("Correct predictions:", correct)
print("Total test samples:", total)
print("Calculated Accuracy:", correct / total)

Actual values: [0 1 1]
Predicted values: [0 1 0]
Correct predictions: 2
Total test samples: 3
Calculated Accuracy: 0.6666666666666666


In [None]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[1 0]
 [1 1]]


In [None]:
print("Classification Report:\n")
print(classification_report(y_test, y_pred))

Classification Report:

              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.50      0.67         2

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3




MODEL CONCLUSION:

- Accuracy = 66.67% due to small test data
- One wrong prediction impacts performance heavily
- KNN is sensitive to dataset size
- Results are acceptable for demonstration
- Larger datasets are required for real-world usage


Lets try with different K values

In [None]:
for k in range(1, 6):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    y_pred_k = knn.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred_k)
    print(f"K={k}, Accuracy={acc}")

K=1, Accuracy=0.6666666666666666
K=2, Accuracy=0.6666666666666666
K=3, Accuracy=0.6666666666666666
K=4, Accuracy=0.6666666666666666
K=5, Accuracy=0.6666666666666666


In [None]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=3)
cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv=3)

print("Cross-validation scores:", cv_scores)
print("Average CV accuracy:", cv_scores.mean())

Cross-validation scores: [1. 1. 1.]
Average CV accuracy: 1.0


INTERPRETATION:

Cross-validation folds must be less than or equal to the minimum
number of samples present in each class.

Since this dataset is very small, cv=3 is appropriate.
Using higher values causes class imbalance inside folds.

***