Step 7: Unsupervised Learning — Applicant Segmentation

This notebook applies unsupervised learning techniques to
discover hidden patterns in loan applicants.

The goal is to segment applicants into meaningful groups
without using loan approval labels.

1. IMPORTS

In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

2. LOAD CLEANED DATA

In [3]:
df = pd.read_csv(
    r"E:/ALL Documents/LEVEL 6 Completed/Projects/Week 1 AI & ML & Linux/end-to-end-explainable-ai-system/data/processed/cleaned_data.csv"
)

3. REMOVE TARGET VARIABLE

In [4]:
X = df.drop(columns=["Loan_Status"])

4. FEATURE SCALING

In [5]:
X = pd.get_dummies(X, drop_first=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

5. ELBOW METHOD (CHOOSE K)

In [6]:
inertia = []

for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertia.append(km.inertia_)

inertia

[390503.99999999756,
 389411.6683466833,
 388122.48821031256,
 386901.1505849187,
 385909.8624209806,
 384612.0613322628,
 383647.4929904377,
 383046.3454717311,
 382213.4904825525,
 381344.3749054233]

6. TRAIN KMEANS MODEL

In [7]:
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

df["Cluster"] = clusters

7. CLUSTER DISTRIBUTION

In [8]:
df["Cluster"].value_counts()

Cluster
2    299
1    217
0     98
Name: count, dtype: int64

8. CLUSTER CHARACTERISTICS

In [9]:
df.groupby("Cluster").mean(numeric_only=True)

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Credit_History,Loan_Status,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,9628.510204,1135.989796,196.632653,0.816327,0.622449,0.785714,0.469388,0.142857,0.153061,0.122449,0.193878,0.714286,0.295918,0.183673
1,5176.147465,1407.737327,147.815668,0.857143,0.746544,0.778802,0.700461,0.253456,0.142857,0.078341,0.184332,0.018433,0.903226,0.018433
2,4183.632107,1935.247224,127.578595,0.866221,0.665552,0.856187,0.67893,0.110368,0.183946,0.073579,0.250836,0.026756,0.026756,0.602007


9. Saving File

In [12]:
df.to_csv("E:/ALL Documents/LEVEL 6 Completed/Projects/Week 1 AI & ML & Linux/end-to-end-explainable-ai-system/model/random_forest_model.csv", index=True)

10. CLUSTER INTERPRETATION

- Cluster 0: Low income, higher loan amount → higher risk group
- Cluster 1: Stable income, good credit → safer applicants
- Cluster 2: Medium income, moderate risk profile

These clusters help contextualize model decisions
and support recommendation strategies.

11. ETHICAL NOTE

- Clustering may unintentionally group sensitive demographics
- Clusters should be used for decision support, not discrimination
- Human oversight is required when using applicant segmentation