# Credit Score Classification

## Objective
The primary objective of this project is to develop a machine learning model that accurately classifies individuals into predefined credit score categories based on various features available in the dataset. The end goal is to create a tool that can help financial institutions assess the creditworthiness of their clients efficiently and reliably.

# Features Description

- **ID**: A unique identifier for each entry in the dataset.

- **Customer_ID**: A unique identifier for each customer. This is used to track individual credit records.

- **Month**: The transaction or record month. Useful for temporal analysis to track scores over time.

- **Name**: The name of the customer. Generally not used for modeling but may be used for descriptive purposes.

- **Age**: The age of the customer. Age could be a significant factor in determining credit risk.

- **SSN**: Social Security Number. Typically, not used directly in modeling but important for identification and verification processes.

- **Occupation**: The occupation of the customer. Different occupation types can have varying default risk associated with them.

- **Annual_Income**: The total yearly income of the customer. A key factor in determining an individual's ability to repay loans.

- **Monthly_Inhand_Salary**: The salary in-hand received monthly after deductions. Important for assessing cash flow and repayment capacity.

- **Num_Bank_Accounts**: Total number of bank accounts the customer has. Reflects financial behavior and stability.

- **Num_Credit_Card**: The number of credit cards held by the customer. Related to credit exposure and spending habits.

- **Interest_Rate**: The interest rate applicable on loans or credit cards. Influences the customer’s repayment burden.

- **Num_of_Loan**: The total number of loans the customer has. Shows the level of indebtedness.

- **Type_of_Loan**: The types of loans held by the customer (e.g., personal, auto, mortgage). Different loan types carry different risks.

- **Delay_from_due_date**: Average delay in days for payments past due date. Indicative of payment habits and reliability.

- **Num_of_Delayed_Payment**: Number of payments delayed beyond due dates. Impacts credit score directly.

- **Changed_Credit_Limit**: Changes in credit limits. Could indicate changes in perceived creditworthiness or attempts to manage debts.

- **Num_Credit_Inquiries**: Number of times the credit report has been pulled. A high number might indicate financial distress or credit shopping.

- **Credit_Mix**: The variety of credit types the customer has (e.g., cards, mortgages, installment loans). Diversity can be positive or negative.

- **Outstanding_Debt**: The total debt remaining unpaid. High debt indicates higher risk.

- **Credit_Utilization_Ratio**: The percentage of credit used out of the total available credit limit. Higher ratios suggest greater risk.

- **Credit_History_Age**: The age of the customer’s oldest credit account, usually in years and months. Longer histories typically boast better scores.

- **Payment_of_Min_Amount**: Whether the customer makes only the minimum monthly payment. Shows risk-averse behavior.

- **Total_EMI_per_month**: The total Equated Monthly Installment paid across all loans. Determines monthly financial burden.

- **Amount_invested_monthly**: The amount the customer invests monthly. Suggests savings habits and financial health.

- **Payment_Behaviour**: General payment behavior and patterns. A qualitative measure of reliability.

- **Monthly_Balance**: The balance remaining after monthly expenses. Indicates cash flow and financial stability.


## Target
The target variable for this classification problem is the credit score category, which segments individuals according to their creditworthiness. The categories may include, but are not limited to:
- **Good**: Represents individuals who have a high credit score and are likely to meet their credit obligations.
- **Average**: Represents individuals with a medium-range credit score, indicating moderate creditworthiness.
- **Poor**: Represents individuals with a low credit score, suggesting a higher risk of default.

The model will be trained to assign each individual in the dataset to one of these categories based on the input features.

# 01. Import Necessary Libraries

- Default Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

- Additional Libraries

In [2]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.model_selection import cross_val_score, KFold, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

----------------
# 02. Load Data

In [3]:
train = pd.read_csv("Data/train.csv")

  train = pd.read_csv("Data/train.csv")


----------------
# 03. Exploratory Data Analysis (EDA)

## `i` Simple Analysis

In [4]:
train.shape

(100000, 28)

In [5]:
train.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

In [None]:
train.describe()

Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,84998.0,100000.0,100000.0,100000.0,100000.0,98035.0,100000.0,100000.0
mean,4194.17085,17.09128,22.47443,72.46604,21.06878,27.754251,32.285173,1403.118217
std,3183.686167,117.404834,129.05741,466.422621,14.860104,193.177339,5.116875,8306.04127
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.0,0.0
25%,1625.568229,3.0,4.0,8.0,10.0,3.0,28.052567,30.30666
50%,3093.745,6.0,5.0,13.0,18.0,6.0,32.305784,69.249473
75%,5957.448333,7.0,7.0,20.0,28.0,9.0,36.496663,161.224249
max,15204.633333,1798.0,1499.0,5797.0,67.0,2597.0,50.0,82331.0


In [8]:
train.isnull().sum()

ID                              0
Customer_ID                     0
Month                           0
Name                         9985
Age                             0
SSN                             0
Occupation                      0
Annual_Income                   0
Monthly_Inhand_Salary       15002
Num_Bank_Accounts               0
Num_Credit_Card                 0
Interest_Rate                   0
Num_of_Loan                     0
Type_of_Loan                11408
Delay_from_due_date             0
Num_of_Delayed_Payment       7002
Changed_Credit_Limit            0
Num_Credit_Inquiries         1965
Credit_Mix                      0
Outstanding_Debt                0
Credit_Utilization_Ratio        0
Credit_History_Age           9030
Payment_of_Min_Amount           0
Total_EMI_per_month             0
Amount_invested_monthly      4479
Payment_Behaviour               0
Monthly_Balance              1200
Credit_Score                    0
dtype: int64

- **'Name'** Missing values: 9985 -> ignore that because we will drop the column anyway.
- **'Monthly_Inhand_Salary'** Missing values: 15002 -> We can group customers by their jobs and take the mean of monthly salary then fill the null values with.
- **'Type_of_Loan'** Missing values: 11408 -> we can fill with the most commen loan. and the pepole with range age 20 - 25  fill with 'Student Loan' and check for his job.
- **'Num_of_Delayed_Payment'** Missing values: 7002 -> 
- **'Num_Credit_Inquiries'** Missing values: 1965 -> 
- **'Credit_History_Age'** Missing values: 9030 -> 
- **'Amount_invested_monthly'** Missing values: 4479 -> 
- **'Monthly_Balance'** Missing values: 1200 -> 

In [9]:
train.duplicated().sum()

0

Check the unknown symbols and unclean data, and wrong patterns to handle them.

In [10]:
int_cols = train.select_dtypes(include=['int64', 'float64']).columns

In [18]:
list_of_cols = [col for col in train.columns if col not in ['ID', 'Customer_ID', 'SSN'] + list(int_cols)]

for col in list_of_cols:
    if train[col].dtype == 'object':
        print(f"Column: {col}")
        print(f"Unique values: {train[col].unique()[:10]}")
        print(f"Missing values: {train[col].isnull().sum()}")
        print("-" * 50)

Column: Month
Unique values: ['January' 'February' 'March' 'April' 'May' 'June' 'July' 'August']
Missing values: 0
--------------------------------------------------
Column: Name
Unique values: ['Aaron Maashoh' nan 'Rick Rothackerj' 'Langep' 'Jasond' 'Deepaa' 'Np'
 'Nadiaq' 'Annk' 'Charlie Zhur']
Missing values: 9985
--------------------------------------------------
Column: Age
Unique values: ['23' '-500' '28_' '28' '34' '54' '55' '21' '31' '33']
Missing values: 0
--------------------------------------------------
Column: Occupation
Unique values: ['Scientist' '_______' 'Teacher' 'Engineer' 'Entrepreneur' 'Developer'
 'Lawyer' 'Media_Manager' 'Doctor' 'Journalist']
Missing values: 0
--------------------------------------------------
Column: Annual_Income
Unique values: ['19114.12' '34847.84' '34847.84_' '143162.64' '30689.89' '30689.89_'
 '35547.71_' '35547.71' '73928.46' '131313.4']
Missing values: 0
--------------------------------------------------
Column: Num_of_Loan
Unique values

In [12]:
check_cols = ['Monthly_Balance', 'Amount_invested_monthly', 'Outstanding_Debt']

In [13]:
for col in check_cols:
    # Print unique non-numeric values in each column
    non_numeric = train[~train[col].astype(str).str.replace('.', '', 1).str.replace('-', '', 1).str.isdigit()][col].unique()[:10]
    print(f"Column: {col}")
    print(f"Non-numeric values: {non_numeric}")
    print("-" * 50)

Column: Monthly_Balance
Non-numeric values: [nan '__-333333333333333333333333333__']
--------------------------------------------------
Column: Amount_invested_monthly
Non-numeric values: ['__10000__' nan]
--------------------------------------------------
Column: Outstanding_Debt
Non-numeric values: ['1328.93_' '1283.37_' '2797.17_' '3818.57_' '343.84_' '363.51_' '404.51_'
 '1755.81_' '2593.44_' '89.62_']
--------------------------------------------------


In [17]:
integer_columns = ['Age', 'Annual_Income', 'Num_of_Loan','Num_of_Delayed_Payment', 
                   'Changed_Credit_Limit','Outstanding_Debt', 'Amount_invested_monthly', 'Monthly_Balance']

for col in integer_columns:
    # Coerce conversion to numeric (invalid parsing becomes NaN)
    coerced = pd.to_numeric(train[col], errors='coerce')
    
    # Mask where original value is not numeric but not NaN in original column
    mask = coerced.isna() & train[col].notna()
    
    invalid_values = train.loc[mask, col].unique()
    
    print(f"\nColumn: {col}")
    print("Invalid values found:")
    print(invalid_values)
    print(f"Number of invalid entries: {mask.sum()}")
    print("-" * 50)



Column: Age
Invalid values found:
['28_' '34_' '30_' '24_' '33_' '35_' '31_' '40_' '37_' '54_' '21_' '20_'
 '43_' '38_' '18_' '2111_' '46_' '16_' '19_' '47_' '53_' '25_' '27_' '55_'
 '42_' '48_' '49_' '50_' '32_' '22_' '17_' '29_' '15_' '51_' '26_' '39_'
 '14_' '36_' '44_' '7670_' '45_' '23_' '41_' '52_' '733_' '5769_' '4383_'
 '56_' '2650_' '3307_' '6962_' '5589_' '6556_' '1447_' '8153_' '3834_'
 '6744_' '6471_' '7723_' '7640_' '6408_' '3502_' '7316_' '1102_' '8669_'
 '2463_' '6666_' '3055_' '1248_' '2220_' '2159_' '4583_' '3988_' '2155_'
 '6770_' '1843_' '1367_' '3742_' '2171_' '5109_' '3984_' '2474_' '5046_'
 '7715_' '2329_' '707_' '844_' '2756_' '2037_' '902_' '8523_' '3640_'
 '3998_' '3712_' '2097_' '8348_' '5373_' '3291_' '2994_' '3339_' '2812_'
 '3578_' '3564_' '1794_' '737_' '4301_' '2846_' '2373_' '1188_' '8207_'
 '5909_' '6381_' '8616_' '6799_' '1591_' '3775_' '6564_' '7122_' '4913_'
 '5697_' '3843_' '4445_' '6921_' '780_' '1070_' '5798_' '4808_']
Number of invalid entries: 

#### > Handling Underscore (`_`) Symbols in Columns

The dataset contains underscore (`_`) symbols in various columns. These symbols should be either removed or replaced with appropriate values based on the context of each column.

#### > Column-wise Cleaning Strategy

- **Age** : Remove `_`
- **Occupation** : Remove `_` or fill with `NULL`
- **Annual_Income** : Remove `_` or fill with `NULL` if there is no numeric value
- **Num_of_Loan** : Remove `_` or fill with `NULL` if there is no numeric value
- **Num_of_Delayed_Payment** : Remove `_` or fill with `NULL` if there is no numeric value
- **Changed_Credit_Limit** : Remove `_` or fill with `NULL` if there is no numeric value
- **Credit_Mix** : Fill `_` with `NULL`
- **Monthly_Balance** : Remove `_` or fill with `NULL` if there is no numeric value
- **Amount_invested_monthly** : Remove `_` or fill with `NULL` if there is no numeric value
- **Outstanding_Debt** : Remove `_` or fill with `NULL` if there is no numeric value

#### > Additional EDA insights:

- **Type_of_Loan** : Contains long strings with multiple loan types. These should be converted into a list of individual loan types.

  **Ex**:  `"Auto Loan, and Student Loan"` → `['Auto Loan', 'Student Loan']`

- **Age** : has un logcal values 

## `ii` Visual Analysis