# Interpretations

### **Question 1: Identify high-risk customers.**

In [29]:
import pandas as pd

In [30]:
data = {
    "Age": [28, 45, 35, 50, 30, 42, 26, 48, 38, 55],
    "Annual_Income": [6.5, 12, 8, 15, 7, 10, 5.5, 14, 9, 16],
    "Credit_Score": [720, 680, 750, 640, 710, 660, 730, 650, 700, 620],
    "Loan_Amount": [5, 10, 6, 12, 5, 9, 4, 11, 7, 13],
    "Loan_Term": [5, 10, 7, 15, 5, 10, 4, 12, 8, 15],
    "Employment_Type": ["Salaried", "Self-Employed", "Salaried", "Self-Employed",
                        "Salaried", "Salaried", "Salaried", "Self-Employed",
                        "Salaried", "Self-Employed"],
    "Loan_Default": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age,Annual_Income,Credit_Score,Loan_Amount,Loan_Term,Employment_Type,Loan_Default
0,28,6.5,720,5,5,Salaried,0
1,45,12.0,680,10,10,Self-Employed,1
2,35,8.0,750,6,7,Salaried,0
3,50,15.0,640,12,15,Self-Employed,1
4,30,7.0,710,5,5,Salaried,0
5,42,10.0,660,9,10,Salaried,1
6,26,5.5,730,4,4,Salaried,0
7,48,14.0,650,11,12,Self-Employed,1
8,38,9.0,700,7,8,Salaried,0
9,55,16.0,620,13,15,Self-Employed,1


In [31]:
# High-risk customers = those who actually defaulted
high_risk_customers = df[df["Loan_Default"] == 1]


print("High-Risk Customers:\n")
high_risk_customers

High-Risk Customers:



Unnamed: 0,Age,Annual_Income,Credit_Score,Loan_Amount,Loan_Term,Employment_Type,Loan_Default
1,45,12.0,680,10,10,Self-Employed,1
3,50,15.0,640,12,15,Self-Employed,1
5,42,10.0,660,9,10,Salaried,1
7,48,14.0,650,11,12,Self-Employed,1
9,55,16.0,620,13,15,Self-Employed,1


### **Question 2: What patterns lead to loan default?**

In [32]:
df.dtypes

Unnamed: 0,0
Age,int64
Annual_Income,float64
Credit_Score,int64
Loan_Amount,int64
Loan_Term,int64
Employment_Type,object
Loan_Default,int64


In [33]:
group_means = df.groupby("Loan_Default").mean(numeric_only=True)
group_means

Unnamed: 0_level_0,Age,Annual_Income,Credit_Score,Loan_Amount,Loan_Term
Loan_Default,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,31.4,7.2,722.0,5.4,5.8
1,48.0,13.4,650.0,11.0,12.4


In [34]:
# Numeric feature interpretation
print("Mean values grouped by Loan Default:\n")
df.groupby("Loan_Default").mean(numeric_only=True)

# Employment type risk analysis
print("\nDefault probability by Employment Type:\n")
df.groupby("Employment_Type")["Loan_Default"].mean()

Mean values grouped by Loan Default:


Default probability by Employment Type:



Unnamed: 0_level_0,Loan_Default
Employment_Type,Unnamed: 1_level_1
Salaried,0.166667
Self-Employed,1.0


### **Question 3: How do credit scores and income influence predictions?**

In [35]:
credit_risk = df.groupby("Loan_Default")["Credit_Score"].mean()
print("Average Credit Score by Default:\n")
credit_risk

Average Credit Score by Default:



Unnamed: 0_level_0,Credit_Score
Loan_Default,Unnamed: 1_level_1
0,722.0
1,650.0


In [36]:
low_credit = df[df["Credit_Score"] < 680]["Loan_Default"].value_counts()
high_credit = df[df["Credit_Score"] >= 700]["Loan_Default"].value_counts()

print("Low Credit Score Defaults:\n")
low_credit


Low Credit Score Defaults:



Unnamed: 0_level_0,count
Loan_Default,Unnamed: 1_level_1
1,4


In [37]:
print("\nHigh Credit Score Defaults:\n")
high_credit


High Credit Score Defaults:



Unnamed: 0_level_0,count
Loan_Default,Unnamed: 1_level_1
0,5


In [38]:
income_risk = df.groupby("Loan_Default")["Annual_Income"].mean()
print("Average Income by Default:\n")
income_risk

Average Income by Default:



Unnamed: 0_level_0,Annual_Income
Loan_Default,Unnamed: 1_level_1
0,7.2
1,13.4


### **Question 4: Suggest banking policies based on model output.**

In [39]:
print("""
BANKING POLICY INSIGHTS (FROM MODEL OUTPUT):

1. Low credit score customers show high default probability
2. High loan amount increases financial risk
3. Self-employed customers default more frequently
4. Income must be evaluated with credit score
""")


BANKING POLICY INSIGHTS (FROM MODEL OUTPUT):

1. Low credit score customers show high default probability
2. High loan amount increases financial risk
3. Self-employed customers default more frequently
4. Income must be evaluated with credit score



### **Question 5: Compare KNN with Decision Trees for this problem.**

In [40]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Employment_Type"] = le.fit_transform(df["Employment_Type"])

print(df["Employment_Type"].value_counts())

Employment_Type
0    6
1    4
Name: count, dtype: int64


In [41]:
X = df.drop("Loan_Default", axis=1)
y = df["Loan_Default"]

In [42]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

In [43]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

dt_pred = dt.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))

Decision Tree Accuracy: 1.0


In [44]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Encode already done

# Train-test split (same split for both)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# KNN (with scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
knn_pred = knn.predict(X_test_scaled)

# Decision Tree (no scaling)
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

print("KNN Test Accuracy:", accuracy_score(y_test, knn_pred))
print("Decision Tree Test Accuracy:", accuracy_score(y_test, dt_pred))

KNN Test Accuracy: 1.0
Decision Tree Test Accuracy: 1.0


Both KNN and Decision Tree models achieved a test accuracy of 1.0 on the given dataset. This occurs because the dataset is very small and the classes are clearly separable based on features such as credit score, loan amount, and employment type. With such limited data, both models are able to correctly classify all test samples.

### **Question 6: What happens if Loan Amount dominates distance calculation?**

In [45]:
print("Feature ranges:\n")
df.describe()

Feature ranges:



Unnamed: 0,Age,Annual_Income,Credit_Score,Loan_Amount,Loan_Term,Employment_Type,Loan_Default
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,39.7,10.3,686.0,8.2,9.1,0.4,0.5
std,9.922477,3.750556,42.739521,3.224903,4.012481,0.516398,0.527046
min,26.0,5.5,620.0,4.0,4.0,0.0,0.0
25%,31.25,7.25,652.5,5.25,5.5,0.0,0.0
50%,40.0,9.5,690.0,8.0,9.0,0.0,0.5
75%,47.25,13.5,717.5,10.75,11.5,1.0,1.0
max,55.0,16.0,750.0,13.0,15.0,1.0,1.0


In [46]:
from scipy.spatial.distance import euclidean

# Pick two customers
cust1 = X.iloc[0].values
cust2 = X.iloc[1].values

distance_unscaled = euclidean(cust1, cust2)

cust1_scaled = scaler.transform([cust1])[0]
cust2_scaled = scaler.transform([cust2])[0]

distance_scaled = euclidean(cust1_scaled, cust2_scaled)

print("Unscaled Distance:")
distance_unscaled


Unscaled Distance:




44.38749823993238

In [47]:
print("Scaled Distance  :")
distance_scaled

Scaled Distance  :


4.7409315475741

In [48]:
# KNN WITH scaling
knn_scaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled.fit(X_train_scaled, y_train)

pred_scaled = knn_scaled.predict(X_test_scaled)

acc_scaled = accuracy_score(y_test, pred_scaled)
print("Accuracy WITH scaling   :", acc_scaled)

Accuracy WITH scaling   : 1.0


In [49]:
# KNN WITHOUT scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=3)

knn_no_scale.fit(X_train, y_train)

pred_no_scale = knn_no_scale.predict(X_test)

In [51]:
print("Accuracy WITHOUT scaling:", accuracy_score(y_test, pred_no_scale))
print("Accuracy WITH scaling   :", acc_scaled)

Accuracy WITHOUT scaling: 1.0
Accuracy WITH scaling   : 1.0


Although both scaled and unscaled models show the same accuracy on this
small dataset, feature scaling is essential for KNN in real-world data.
Without scaling, features with large values can dominate distance
calculations and lead to incorrect predictions.



If Loan Amount dominates distance calculation, the KNN model gives
higher importance to loan size while ignoring other critical features
such as credit score and income. This results in biased neighbor
selection and unreliable predictions. Feature scaling prevents this
issue by giving equal importance to all features.


### **Question 7: Should KNN be used in real-time loan approval systems?**

In [52]:
import time

start = time.time()
knn_scaled.predict(X_test_scaled)
end = time.time()

print("KNN prediction time:", end - start, "seconds")

KNN prediction time: 0.004214763641357422 seconds


KNN checks distances against all training samples, so prediction time increases as data grows.

In [53]:
print("Number of training samples stored by KNN:", len(X_train))

Number of training samples stored by KNN: 7


KNN must store the entire training dataset in memory, which is inefficient for large banking systems.

KNN does not provide feature importance or decision rules.
Predictions are based only on nearest neighbors.

In [54]:
start = time.time()
dt.predict(X_test)
end = time.time()

print("Decision Tree prediction time:", end - start, "seconds")

Decision Tree prediction time: 0.0033593177795410156 seconds


Decision Trees predict faster and are more suitable for real-time use.

***