# KNN Loan Default Prediction

Using the kNN approach that we discussed in the class, predict the class label for this test example,
X = (Home Owner = No, Marital Status = Married, Income = $120K).
Assume that k = 3 and distance is L2 norm.

In [1]:
import pandas as pd
import numpy as np

## Data

In [2]:
url = "./data/Loan_default.csv"

data = pd.read_csv(url)
data

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower
0,1,Yes,Single,125K,No
1,2,No,Married,100K,No
2,3,No,Single,70K,No
3,4,Yes,Married,120K,No
4,5,No,Divorced,95K,Yes
5,6,No,Married,60K,No
6,7,Yes,Divorced,220K,No
7,8,No,Single,85K,Yes
8,9,No,Married,75K,No
9,10,No,Single,90K,Yes


## Tiền xử lý
- Vì dataset có dữ liệu hỗn hợp nên phải đưa về dạng số để xử lý và tính khoảng cách
    - Đưa cột "Home Owner" về dạng nhị phân
    - Cột "Marital Status" không có thứ tự nên dùng One-Hot Encoding
    - Cột "Annual Income" bỏ chữ "k"

In [None]:
newData = data
newData["Home Owner"] = np.where(newData["Home Owner"] == "Yes", 1, 0)

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower
0,1,1,Single,125K,No
1,2,0,Married,100K,No
2,3,0,Single,70K,No
3,4,1,Married,120K,No
4,5,0,Divorced,95K,Yes
5,6,0,Married,60K,No
6,7,1,Divorced,220K,No
7,8,0,Single,85K,Yes
8,9,0,Married,75K,No
9,10,0,Single,90K,Yes


In [None]:
# One-hot Encoding
newData["Is_Single"] = np.where(newData["Marital Status"] == "Single", 1, 0)
newData["Is_Married"] = np.where(newData["Marital Status"] == "Married", 1, 0)
newData["Is_Divorced"] = np.where(newData["Marital Status"] == "Divorced", 1, 0)
newData

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower,Is_Single,Is_Married,Is_Divorced
0,1,1,Single,125K,No,1,0,0
1,2,0,Married,100K,No,0,1,0
2,3,0,Single,70K,No,1,0,0
3,4,1,Married,120K,No,0,1,0
4,5,0,Divorced,95K,Yes,0,0,1
5,6,0,Married,60K,No,0,1,0
6,7,1,Divorced,220K,No,0,0,1
7,8,0,Single,85K,Yes,1,0,0
8,9,0,Married,75K,No,0,1,0
9,10,0,Single,90K,Yes,1,0,0


In [23]:
newData["Annual Income"] = newData["Annual Income"].str.replace("K", "", case=False).astype(float)
newData

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower,Is_Single,Is_Married,Is_Divorced
0,1,1,Single,125.0,No,1,0,0
1,2,0,Married,100.0,No,0,1,0
2,3,0,Single,70.0,No,1,0,0
3,4,1,Married,120.0,No,0,1,0
4,5,0,Divorced,95.0,Yes,0,0,1
5,6,0,Married,60.0,No,0,1,0
6,7,1,Divorced,220.0,No,0,0,1
7,8,0,Single,85.0,Yes,1,0,0
8,9,0,Married,75.0,No,0,1,0
9,10,0,Single,90.0,Yes,1,0,0


## Chuẩn hóa dữ liệu

Trong KNN, nếu bạn để nguyên Annual Income là 125 và Home Owner là 1, khoảng cách của thu nhập sẽ áp đảo hoàn toàn các đặc trưng khác.

Bạn cần đưa tất cả về cùng một khoảng (thường là 0 đến 1) bằng công thức Min-Max Scaling

In [27]:
min_income = newData["Annual Income"].min()
max_income = newData["Annual Income"].max()

newData["Annual Income Scaling"] = (newData["Annual Income"] - min_income) / (max_income - min_income)
newData

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower,Is_Single,Is_Married,Is_Divorced,Annual Income Scaling
0,1,1,Single,125.0,No,1,0,0,0.40625
1,2,0,Married,100.0,No,0,1,0,0.25
2,3,0,Single,70.0,No,1,0,0,0.0625
3,4,1,Married,120.0,No,0,1,0,0.375
4,5,0,Divorced,95.0,Yes,0,0,1,0.21875
5,6,0,Married,60.0,No,0,1,0,0.0
6,7,1,Divorced,220.0,No,0,0,1,1.0
7,8,0,Single,85.0,Yes,1,0,0,0.15625
8,9,0,Married,75.0,No,0,1,0,0.09375
9,10,0,Single,90.0,Yes,1,0,0,0.1875


## Tính khoảng cách

In [None]:
income_scaling = (120 - min_income) / (max_income - min_income)
X = {
    "home_owner": 0,
    "martial_status": (0, 1, 0),
    "annual_income": income_scaling
}

newData["Distance"] = np.sqrt(
    np.pow(newData["Home Owner"] - X["home_owner"], 2) +
    np.pow(newData["Is_Single"] - X["martial_status"][0], 2) +
    np.pow(newData["Is_Married"] - X["martial_status"][1], 2) +
    np.pow(newData["Is_Divorced"] - X["martial_status"][2], 2) +
    np.pow(newData["Annual Income Scaling"] - X["annual_income"], 2)
)

newData.sort_values("Distance")

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower,Is_Single,Is_Married,Is_Divorced,Annual Income Scaling,Distance
1,2,0,Married,100.0,No,0,1,0,0.25,0.125
8,9,0,Married,75.0,No,0,1,0,0.09375,0.28125
5,6,0,Married,60.0,No,0,1,0,0.0,0.375
3,4,1,Married,120.0,No,0,1,0,0.375,1.0
4,5,0,Divorced,95.0,Yes,0,0,1,0.21875,1.422819
9,10,0,Single,90.0,Yes,1,0,0,0.1875,1.426589
7,8,0,Single,85.0,Yes,1,0,0,0.15625,1.431032
2,3,0,Single,70.0,No,1,0,0,0.0625,1.448329
0,1,1,Single,125.0,No,1,0,0,0.40625,1.732333
6,7,1,Divorced,220.0,No,0,0,1,1.0,1.841365


## Dự đoán
Vì Y là Yes/No nên nó là bài toán Classification và dựa vào 3 điểm có khoảng cách gần nhất đều là NO => 100% No

In [33]:
newData.sort_values("Distance").head(3)

Unnamed: 0,Tid,Home Owner,Marital Status,Annual Income,Defaulted Borrower,Is_Single,Is_Married,Is_Divorced,Annual Income Scaling,Distance
1,2,0,Married,100.0,No,0,1,0,0.25,0.125
8,9,0,Married,75.0,No,0,1,0,0.09375,0.28125
5,6,0,Married,60.0,No,0,1,0,0.0,0.375
