## Main Objective
The goal of this project is to design a robust machine learning pipeline that:

1. **Classifies individuals** into predefined credit score categories (e.g., Bad, Standard, Good) based on financial and behavioral features.
2. **Predicts the exact credit score value** to provide a more granular and personalized analysis.

### Data Understanding
1. **ID**: Represents the unique identity of an entry.
2. **Customer_ID**: Represents the unique identity of a person.
3. **Month**: Represents the month of the year.
4. **Name**: Represents a person's name.
5. **Age**: Represents the person's age.
6. **SSN**: Represents the person's Social Security Number.
7. **Occupation**: Represents the person's occupation.
8. **Annual_Income**: Represents the person's annual income.
9. **Monthly_Inhand_Salary**: Represents the person's monthly net salary.
10. **Num_Bank_Accounts**: Represents the number of bank accounts the person has.
11. **Num_Credit_Card**: Represents the number of additional credit cards the person has.
12. **Interest_Rate**: Represents the credit card interest rate.
13. **Num_of_Loan**: Represents the number of loans taken from the bank.
14. **Type_of_Loan**: Represents the types of loans the person has taken.
15. **Delay_from_due_date**: Represents the average number of days delayed from the due date of payment.
16. **Num_of_Delayed_Payment**: Represents the number of delayed payments the person has made.
17. **Changed_Credit_Limit**: Represents the percentage change in the credit card limit.
18. **Num_Credit_Inquiries**: Represents the number of credit card inquiries.
19. **Credit_Mix**: Represents the classification of the credit mix.
20. **Outstanding_Debt**: Represents the remaining outstanding debt (in USD).
21. **Credit_Utilization_Ratio**: Represents the credit card utilization ratio.
22. **Credit_History_Age**: Represents the age of the person's credit history.
23. **Payment_of_Min_Amount**: Represents whether the person only pays the minimum amount due.
24. **Total_EMI_per_month**: Represents the monthly EMI payments (in USD).
25. **Amount_invested_monthly**: Represents the amount invested by the customer monthly (in USD).
26. **Payment_Behaviour**: Represents the customer's payment behavior.
27. **Monthly_Balance**: Represents the customer's monthly balance amount (in USD).
28. **Credit_Score**: Represents the credit score range (Bad, Standard, Good).

The dataset can be found at: [https://www.kaggle.com/datasets/parisrohan/credit-score-classification/data]

In [1]:
import pandas as pd

In [27]:
dataset = pd.read_csv("datasets/credit_scoring/train.csv", dtype={'Monthly_Balance': 'str'})

In [28]:
dataset.head(10)

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good
5,0x1607,CUS_0xd40,June,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,27.262259,22 Years and 6 Months,No,49.574949,62.430172331195294,!@9#%8,340.4792117872438,Good
6,0x1608,CUS_0xd40,July,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,22.537593,22 Years and 7 Months,No,49.574949,178.3440674122349,Low_spent_Small_value_payments,244.5653167062043,Good
7,0x1609,CUS_0xd40,August,,23,#F%$D@*&8,Scientist,19114.12,1824.843333,3,...,Good,809.98,23.933795,,No,49.574949,24.785216509052056,High_spent_Medium_value_payments,358.12416760938714,Standard
8,0x160e,CUS_0x21b1,January,Rick Rothackerj,28_,004-07-5839,_______,34847.84,3037.986667,2,...,Good,605.03,24.464031,26 Years and 7 Months,No,18.816215,104.291825168246,Low_spent_Small_value_payments,470.69062692529184,Standard
9,0x160f,CUS_0x21b1,February,Rick Rothackerj,28,004-07-5839,Teacher,34847.84,3037.986667,2,...,Good,605.03,38.550848,26 Years and 8 Months,No,18.816215,40.39123782853101,High_spent_Large_value_payments,484.5912142650067,Good


In [19]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

In [18]:
dataset.isnull().sum()

ID                              0
Customer_ID                     0
Month                           0
Name                         9985
Age                             0
SSN                             0
Occupation                      0
Annual_Income                   0
Monthly_Inhand_Salary       15002
Num_Bank_Accounts               0
Num_Credit_Card                 0
Interest_Rate                   0
Num_of_Loan                     0
Type_of_Loan                11408
Delay_from_due_date             0
Num_of_Delayed_Payment       7002
Changed_Credit_Limit            0
Num_Credit_Inquiries         1965
Credit_Mix                      0
Outstanding_Debt                0
Credit_Utilization_Ratio        0
Credit_History_Age           9030
Payment_of_Min_Amount           0
Total_EMI_per_month             0
Amount_invested_monthly      4479
Payment_Behaviour               0
Monthly_Balance              1200
Credit_Score                    0
dtype: int64

#### Dealing With Null Values

In [29]:
dataset_cpy = dataset.copy()

In [30]:
# The 'Name' column. It doesn’t seem to provide any direct predictive value for tasks like credit score classification or prediction, as names are not correlated with financial behavior or creditworthiness.
dataset_cpy = dataset_cpy.drop(columns=['Name'], axis=1)

In [None]:
# The 'Monthly_Inhand_Salary' for a user is constant, we can impute the missing values for each user by filling them with the first non-null value available for that user.
dataset_cpy['Monthly_Inhand_Salary'] = dataset_cpy.groupby('Customer_ID')['Monthly_Inhand_Salary'].transform(
    lambda x: x.ffill().bfill()
)
# Combined, ffill().bfill() ensures all missing values in a group are filled using valid values from both directions. (Both forward and backward filling.)

In [None]:
# The 'Type_of_Loan' for a user is constant, we can impute the missing values for each user by filling them with the first non-null value available for the user.
dataset_cpy['Type_of_Loan'] = dataset_cpy.groupby('Customer_ID')['Type_of_Loan'].transform(
    lambda x: x.ffill().bfill().infer_objects()
)
# Combined, ffill().bfill() ensures all missing values in a group are filled using valid values from both directions.
# .infer_objects() to ensure the result has the proper dtype inference (e.g., converting object arrays back to strings)

In [55]:
# The 'Num_Credit_Inquiries' for a user is constant, we cn impute the missing value for each user by filling them with the first non-null value available for the user.
dataset_cpy['Num_Credit_Inquiries'] = dataset_cpy.groupby('Customer_ID')['Num_Credit_Inquiries'].transform(
    lambda x: x.ffill().bfill()
)

In [59]:
def convert_to_months(value):
    years, months = map(int, value.replace('Years', '').replace('Months', '').split('and'))
    return years * 12 + months

In [68]:
temp = dataset_cpy.iloc[0]
tempValue = convert_to_months(temp['Credit_History_Age'])

In [69]:
print(tempValue)

265


In [58]:
dataset_cpy[['Customer_ID', 'Age', 'Month', 'Credit_History_Age']].head(32)

Unnamed: 0,Customer_ID,Age,Month,Credit_History_Age
0,CUS_0xd40,23,January,22 Years and 1 Months
1,CUS_0xd40,23,February,
2,CUS_0xd40,-500,March,22 Years and 3 Months
3,CUS_0xd40,23,April,22 Years and 4 Months
4,CUS_0xd40,23,May,22 Years and 5 Months
5,CUS_0xd40,23,June,22 Years and 6 Months
6,CUS_0xd40,23,July,22 Years and 7 Months
7,CUS_0xd40,23,August,
8,CUS_0x21b1,28_,January,26 Years and 7 Months
9,CUS_0x21b1,28,February,26 Years and 8 Months
