# White Box - Supervised ML Project
__Name:__ Your Full Name

__Topic Name:__ Regression Topic or Classification Topic

## Introduction
Brief overview of the business challenge and dataset

### Problem Statement
Define the real-world problem the model aims to solve


### Objectives
List the key questions guiding your analysis and modeling:
- What features influence the target variable?
- Can feature engineering improve model performance?
- How do different model versions compare?


## Data Overview
__Load and inspect the dataset__
- Source and format
- .head(), .info(), .describe(),…


## Data Cleaning
__Handle missing values, outliers, and inconsistencies__
- Rename columns
- Fix data types
- Document assumptions


## Exploratory Data Analysis (EDA)
### Analysis
__Answer objectives using visual and statistical insights__
- Trends, relationships, anomalies
- Outlier and missing values treatment
- Univariate, bivariate, or multivariate analysis
    - Histograms, box plots, bar charts
    - Correlation matrix for continuous columns (required if applicable)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# read csv file
df = pd.read_csv("train.csv")

print(df.shape)
df.head()

(100000, 28)


  df = pd.read_csv("train.csv")


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

In [3]:
df.describe()

Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,84998.0,100000.0,100000.0,100000.0,100000.0,98035.0,100000.0,100000.0
mean,4194.17085,17.09128,22.47443,72.46604,21.06878,27.754251,32.285173,1403.118217
std,3183.686167,117.404834,129.05741,466.422621,14.860104,193.177339,5.116875,8306.04127
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.0,0.0
25%,1625.568229,3.0,4.0,8.0,10.0,3.0,28.052567,30.30666
50%,3093.745,6.0,5.0,13.0,18.0,6.0,32.305784,69.249473
75%,5957.448333,7.0,7.0,20.0,28.0,9.0,36.496663,161.224249
max,15204.633333,1798.0,1499.0,5797.0,67.0,2597.0,50.0,82331.0


### Data Handling for Modeling
- __Transform, encode, and prepare features__
    - Categorical encoding
    - Feature scaling (especially for KNN)


In [4]:
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df = df[df['Age'] >= 0]

In [5]:
# remove nan, _ and ______.
df = df.replace('_', np.nan)
df = df.replace('_______', np.nan)
df = df.replace('!@9#%8', np.nan)
df = df.dropna()
df.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
6,0x1608,CUS_0xd40,July,Aaron Maashoh,23.0,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,22.537593,22 Years and 7 Months,No,49.574949,178.3440674122349,Low_spent_Small_value_payments,244.5653167062043,Good
9,0x160f,CUS_0x21b1,February,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,Good,605.03,38.550848,26 Years and 8 Months,No,18.816215,40.39123782853101,High_spent_Large_value_payments,484.5912142650067,Good
12,0x1612,CUS_0x21b1,May,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,Good,605.03,34.977895,26 Years and 11 Months,No,18.816215,130.11542024292334,Low_spent_Small_value_payments,444.8670318506144,Good
13,0x1613,CUS_0x21b1,June,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,Good,605.03,33.38101,27 Years and 0 Months,No,18.816215,43.47719014435575,High_spent_Large_value_payments,481.505261949182,Good
15,0x1615,CUS_0x21b1,August,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,Good,605.03,32.933856,27 Years and 2 Months,No,18.816215,218.90434353388733,Low_spent_Small_value_payments,356.07810855965045,Good


In [6]:
# We force these columns to numeric, coercing any remaining non-numeric values to NaN.
numeric_cols_to_convert = [
                           'Annual_Income', 
                           'Monthly_Inhand_Salary', 
                           'Num_Bank_Accounts',
                           'Num_Credit_Card',
                           'Interest_Rate',
                           'Num_of_Loan',
                           'Delay_from_due_date',
                           'Num_of_Delayed_Payment',
                           'Changed_Credit_Limit',
                           'Num_Credit_Inquiries',
                           'Credit_Mix',
                           'Outstanding_Debt',
                           'Credit_Utilization_Ratio',
                           'Total_EMI_per_month',
                           'Amount_invested_monthly',
                           'Monthly_Balance'
                           ]
for col in numeric_cols_to_convert:
    df[col] = pd.to_numeric(df[col], errors='coerce')

In [7]:
# convert the target to binary (0 and 1).
target_column = df.columns[-1]
# This assigns 0 to the first unique value encountered and 1 to the second.
df['Target'] = df[target_column].factorize()[0] 
df = df.drop(columns=[target_column]) # Drop the original string column
df

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Target
6,0x1608,CUS_0xd40,July,Aaron Maashoh,23.0,821-00-0265,Scientist,19114.12,1824.843333,3,...,,809.98,22.537593,22 Years and 7 Months,No,49.574949,178.344067,Low_spent_Small_value_payments,244.565317,0
9,0x160f,CUS_0x21b1,February,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,,605.03,38.550848,26 Years and 8 Months,No,18.816215,40.391238,High_spent_Large_value_payments,484.591214,0
12,0x1612,CUS_0x21b1,May,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,,605.03,34.977895,26 Years and 11 Months,No,18.816215,130.115420,Low_spent_Small_value_payments,444.867032,0
13,0x1613,CUS_0x21b1,June,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,,605.03,33.381010,27 Years and 0 Months,No,18.816215,43.477190,High_spent_Large_value_payments,481.505262,0
15,0x1615,CUS_0x21b1,August,Rick Rothackerj,28.0,004-07-5839,Teacher,34847.84,3037.986667,2,...,,605.03,32.933856,27 Years and 2 Months,No,18.816215,218.904344,Low_spent_Small_value_payments,356.078109,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99981,0x25fd3,CUS_0xaf61,June,Chris Wickhamm,50.0,133-16-7738,Writer,,3097.008333,1,...,,620.64,37.753013,30 Years and 2 Months,NM,84.205949,147.339908,Low_spent_Small_value_payments,368.154976,0
99984,0x25fda,CUS_0x8600,January,Sarah McBridec,28.0,031-35-0942,Architect,20002.88,1929.906667,10,...,,,22.895966,5 Years and 8 Months,Yes,60.964772,43.370670,High_spent_Large_value_payments,328.655224,2
99985,0x25fdb,CUS_0x8600,February,Sarah McBridec,28.0,031-35-0942,Architect,20002.88,1929.906667,10,...,,,39.772607,5 Years and 9 Months,Yes,12112.000000,148.275233,Low_spent_Small_value_payments,273.750662,2
99991,0x25fe1,CUS_0x8600,August,Sarah McBridec,29.0,031-35-0942,Architect,20002.88,1929.906667,10,...,,3571.70,37.140784,6 Years and 3 Months,Yes,60.964772,34.662906,High_spent_Large_value_payments,337.362988,1


In [8]:
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop first to avoid multicollinearity
occupation_encoded = encoder.fit_transform(df[['Occupation']])
occupation_df = pd.DataFrame(occupation_encoded, columns=encoder.get_feature_names_out(['Occupation']))
df = pd.concat([df.drop('Occupation', axis=1), occupation_df], axis=1)

In [9]:
occupation_encoded = encoder.fit_transform(df[['Type_of_Loan']])
occupation_df = pd.DataFrame(occupation_encoded, columns=encoder.get_feature_names_out(['Type_of_Loan']))
df = pd.concat([df.drop('Type_of_Loan', axis=1), occupation_df], axis=1)

In [10]:
occupation_encoded = encoder.fit_transform(df[['Payment_of_Min_Amount']])
occupation_df = pd.DataFrame(occupation_encoded, columns=encoder.get_feature_names_out(['Payment_of_Min_Amount']))
df = pd.concat([df.drop('Payment_of_Min_Amount', axis=1), occupation_df], axis=1)

In [11]:
occupation_encoded = encoder.fit_transform(df[['Payment_Behaviour']])
occupation_df = pd.DataFrame(occupation_encoded, columns=encoder.get_feature_names_out(['Payment_Behaviour']))
df = pd.concat([df.drop('Payment_Behaviour', axis=1), occupation_df], axis=1)

In [12]:
# Define the logical order: Bad < Standard < Good
credit_mapping = {'Bad': 0, 'Standard': 1, 'Good': 2}

# Apply mapping
df['Credit_Mix'] = df['Credit_Mix'].map(credit_mapping)

In [13]:
# Map month names to numbers (preserves natural order)
month_mapping = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}

# Apply mapping
df['Month'] = df['Month'].map(credit_mapping)

In [14]:
occupation_encoded = encoder.fit_transform(df[['Credit_History_Age']])
occupation_df = pd.DataFrame(occupation_encoded, columns=encoder.get_feature_names_out(['Credit_History_Age']))
df = pd.concat([df.drop('Credit_History_Age', axis=1), occupation_df], axis=1)

In [15]:
df.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,...,Credit_History_Age_9 Years and 11 Months,Credit_History_Age_9 Years and 2 Months,Credit_History_Age_9 Years and 3 Months,Credit_History_Age_9 Years and 4 Months,Credit_History_Age_9 Years and 5 Months,Credit_History_Age_9 Years and 6 Months,Credit_History_Age_9 Years and 7 Months,Credit_History_Age_9 Years and 8 Months,Credit_History_Age_9 Years and 9 Months,Credit_History_Age_nan
6,0x1608,CUS_0xd40,,Aaron Maashoh,23.0,821-00-0265,19114.12,1824.843333,3.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0x160f,CUS_0x21b1,,Rick Rothackerj,28.0,004-07-5839,34847.84,3037.986667,2.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,0x1612,CUS_0x21b1,,Rick Rothackerj,28.0,004-07-5839,34847.84,3037.986667,2.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0x1613,CUS_0x21b1,,Rick Rothackerj,28.0,004-07-5839,34847.84,3037.986667,2.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,0x1615,CUS_0x21b1,,Rick Rothackerj,28.0,004-07-5839,34847.84,3037.986667,2.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
df = df.drop(columns=['ID', 'Customer_ID', 'Name', 'SSN','Credit_Mix', 'Month'])
df = df.dropna()
df = df.dropna(subset=['Target'])
X = df.drop(columns=['Target'])
Y = df['Target']

In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

## Model Building
__Build and compare two models for your topic__
### Model 1: Linear / Logistic Regression
- Version 1: baseline
- Version 2: modified features or tuned parameters
- Checking if model assumptions were met
 ...


In [18]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [19]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, Y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,100


In [21]:
# Make predictions
Y_pred = log_reg.predict(X_test_scaled)
Y_pred_proba = log_reg.predict_proba(X_test_scaled)

# Evaluate the model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Accuracy:", accuracy_score(Y_test, Y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(Y_test, Y_pred))

Accuracy: 0.5697167755991286

Confusion Matrix:
[[115 153  34]
 [109 622 235]
 [ 32 227 309]]


### Model 2: KNN Regressor / KNN Classifier
- Version 1: baseline
- Version 2: modified features or tuned parameters
- Checking if model assumptions were met 
 ...


In [25]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [26]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, Y_train)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [27]:
# Make predictions
Y_pred_knn = knn.predict(X_test_scaled)
Y_pred_proba_knn = knn.predict_proba(X_test_scaled)

In [28]:
print("Accuracy:", accuracy_score(Y_test, Y_pred_knn))
print("\nConfusion Matrix:")
print(confusion_matrix(Y_test, Y_pred_knn))

Accuracy: 0.46296296296296297

Confusion Matrix:
[[ 88 164  50]
 [192 576 198]
 [ 84 298 186]]


### Model Evaluation
__Use appropriate metrics based on task type and compare between the Models and their versions__

__For Regression:__
- R² Score
- Root Mean Squared Error (RMSE)
- Residual plots


__For Classification:__
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC _**(Optional)**_
- Confusion Matrix

_Include visualizations and interpretation for each metric._


### Model Comparison
__Compare performance across models and versions__

__Note:__ _Use Bullet Points or table_
- Which model performed best and why
- Impact of feature changes or tuning
- Generalization and overfitting observations


Based on the accuracy, the best model is logistic regression


## Final Model & Insights
__Summarize your final model and key takeaways__

__Note__: _Use Bullet Points_
- Final model choice
- Business implications
- Limitations and future improvements


## References & Appendix
__Cite tools, libraries, and sources used__