# **Credit Score Classification Using Machine Learning**
This notebook demonstrates a complete pipeline for cleaning, preprocessing, and modeling a dataset for predicting credit scores. The steps include:

1. Importing necessary libraries
2. Loading and exploring the dataset
3. Cleaning and preprocessing data
4. Feature engineering and encoding
5. Handling imbalanced data with SMOTE
6. Training machine learning models using GridSearchCV
7. Comparing model performances
8. Making predictions on the test set
9. Saving predictions for submission

---


## **1. Importing Libraries**
We begin by importing all necessary libraries for data manipulation, visualization, preprocessing, and modeling. This includes:

- `pandas` and `numpy` for data manipulation
- `seaborn` and `matplotlib` for visualization
- Machine learning libraries like `sklearn`, `xgboost`, and `imblearn`
- `warnings` to suppress warnings for a cleaner output

---

In [255]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.naive_bayes import GaussianNB
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")

## **2. Loading the Dataset**
We load the training and testing datasets and display basic information to understand the data structure.

Steps:
1. Load datasets using `pandas.read_csv()`
2. Display the shape of the datasets
3. Check for missing values
4. Identify data types in each column

---

In [256]:
# Load the dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Display basic information
print(f"Training data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")

# Check for missing values
print("Missing values in training data:")
print(train_data.isnull().sum())

# Display data types of columns
print("Data types in training data:")
print(train_data.dtypes)

Training data shape: (80000, 28)
Test data shape: (20000, 27)
Missing values in training data:
ID                              0
Customer_ID                     0
Month                           0
Name                         8029
Age                             0
Number                          0
Profession                      0
Income_Annual                   0
Base_Salary_PerMonth        12032
Total_Bank_Accounts             0
Total_Credit_Cards              0
Rate_Of_Interest                0
Total_Current_Loans             0
Loan_Type                    9157
Delay_from_due_date             0
Total_Delayed_Payments       5595
Credit_Limit                    0
Total_Credit_Enquiries       1549
Credit_Mix                      0
Current_Debt_Outstanding        0
Ratio_Credit_Utilization        0
Credit_History_Age           7240
Payment_of_Min_Amount           0
Per_Month_EMI                   0
Monthly_Investment           3605
Payment_Behaviour               0
Monthly_Balance      

In [257]:
train_data.columns

Index(['ID', 'Customer_ID', 'Month', 'Name', 'Age', 'Number', 'Profession',
       'Income_Annual', 'Base_Salary_PerMonth', 'Total_Bank_Accounts',
       'Total_Credit_Cards', 'Rate_Of_Interest', 'Total_Current_Loans',
       'Loan_Type', 'Delay_from_due_date', 'Total_Delayed_Payments',
       'Credit_Limit', 'Total_Credit_Enquiries', 'Credit_Mix',
       'Current_Debt_Outstanding', 'Ratio_Credit_Utilization',
       'Credit_History_Age', 'Payment_of_Min_Amount', 'Per_Month_EMI',
       'Monthly_Investment', 'Payment_Behaviour', 'Monthly_Balance',
       'Credit_Score'],
      dtype='object')

In [258]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        80000 non-null  object 
 1   Customer_ID               80000 non-null  object 
 2   Month                     80000 non-null  object 
 3   Name                      71971 non-null  object 
 4   Age                       80000 non-null  object 
 5   Number                    80000 non-null  object 
 6   Profession                80000 non-null  object 
 7   Income_Annual             80000 non-null  object 
 8   Base_Salary_PerMonth      67968 non-null  float64
 9   Total_Bank_Accounts       80000 non-null  int64  
 10  Total_Credit_Cards        80000 non-null  int64  
 11  Rate_Of_Interest          80000 non-null  int64  
 12  Total_Current_Loans       80000 non-null  object 
 13  Loan_Type                 70843 non-null  object 
 14  Delay_

In [259]:
train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Base_Salary_PerMonth,67968.0,4189.211406,3181.3711,303.645417,1623.664167,3086.683333,5950.863333,15204.633333
Total_Bank_Accounts,80000.0,17.046287,116.953761,-1.0,4.0,6.0,7.0,1798.0
Total_Credit_Cards,80000.0,22.175438,128.083595,0.0,4.0,5.0,7.0,1499.0
Rate_Of_Interest,80000.0,72.26135,466.370837,1.0,8.0,14.0,20.0,5797.0
Delay_from_due_date,80000.0,21.081663,14.85521,-5.0,10.0,18.0,28.0,67.0
Total_Credit_Enquiries,78451.0,28.153115,194.812201,0.0,3.0,6.0,9.0,2597.0
Ratio_Credit_Utilization,80000.0,32.273436,5.116887,20.0,28.052046,32.292625,36.482439,49.564519
Per_Month_EMI,80000.0,1414.789973,8323.122028,0.0,30.305498,68.839655,160.585877,82331.0


In [260]:
train_data.describe(exclude=np.number).T

Unnamed: 0,count,unique,top,freq
ID,80000,80000,0x522a,1
Customer_ID,80000,12500,CUS_0x4a7,8
Month,80000,8,June,10035
Name,71971,10139,Stevex,37
Age,80000,1466,28,2247
Number,80000,12501,#F%$D@*&8,4443
Profession,80000,16,_______,5691
Income_Annual,80000,17821,20867.67,13
Total_Current_Loans,80000,381,3,11543
Loan_Type,70843,6260,Not Specified,1105


## **3. Data Cleaning**
The raw data often contains inconsistencies, such as invalid entries, placeholder values, or irrelevant columns. 

### Steps:
1. **Convert Numeric Columns**: Remove invalid characters and convert columns to numeric.
2. **Drop Irrelevant Columns**: Remove columns like `Customer_ID`, `Name`, and `Number`.
3. **Replace Placeholder Values**: Replace placeholders such as `_______` or `_` with `NaN` for easier imputation.
4. **Convert `Credit_History_Age`**: Convert credit history from years and months to total months for uniformity.

---


In [261]:
# Clean numeric columns
columns_to_clean = ['Age', 'Income_Annual', 'Current_Debt_Outstanding', 'Monthly_Investment', 
                    'Monthly_Balance', 'Total_Current_Loans', 'Total_Delayed_Payments', 'Credit_Limit']
for col in columns_to_clean:
    train_data[col] = pd.to_numeric(train_data[col].replace('[^0-9.]', '', regex=True), errors='coerce')
    test_data[col] = pd.to_numeric(test_data[col].replace('[^0-9.]', '', regex=True), errors='coerce')

In [262]:
train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,80000.0,119.498,686.0849,14.0,25.0,33.0,42.0,8698.0
Income_Annual,80000.0,174765.2,1422707.0,7005.93,19419.65,37482.37,72678.605,24198060.0
Base_Salary_PerMonth,67968.0,4189.211,3181.371,303.645417,1623.664167,3086.683333,5950.863333,15204.63
Total_Bank_Accounts,80000.0,17.04629,116.9538,-1.0,4.0,6.0,7.0,1798.0
Total_Credit_Cards,80000.0,22.17544,128.0836,0.0,4.0,5.0,7.0,1499.0
Rate_Of_Interest,80000.0,72.26135,466.3708,1.0,8.0,14.0,20.0,5797.0
Total_Current_Loans,80000.0,11.04204,63.88997,0.0,2.0,3.0,6.0,1496.0
Delay_from_due_date,80000.0,21.08166,14.85521,-5.0,10.0,18.0,28.0,67.0
Total_Delayed_Payments,74405.0,31.20578,228.9401,0.0,9.0,14.0,18.0,4397.0
Credit_Limit,78309.0,10.46665,6.666356,0.0,5.36,9.41,14.85,36.97


In [263]:
train_data.describe(exclude=np.number).T

Unnamed: 0,count,unique,top,freq
ID,80000,80000,0x522a,1
Customer_ID,80000,12500,CUS_0x4a7,8
Month,80000,8,June,10035
Name,71971,10139,Stevex,37
Number,80000,12501,#F%$D@*&8,4443
Profession,80000,16,_______,5691
Loan_Type,70843,6260,Not Specified,1105
Credit_Mix,80000,4,Standard,29214
Credit_History_Age,72760,404,18 Years and 3 Months,361
Payment_of_Min_Amount,80000,3,Yes,41857


In [264]:
train_data.drop(columns=['Customer_ID', 'Name', 'Number'], inplace=True)
test_data.drop(columns=['Customer_ID', 'Name', 'Number'], inplace=True)

In [265]:
train_data['Month'].value_counts()

Month
June        10035
August      10025
July        10013
January     10013
May          9999
April        9987
February     9985
March        9943
Name: count, dtype: int64

In [266]:
train_data['Profession'].value_counts()

Profession
_______          5691
Lawyer           5273
Architect        5100
Mechanic         5095
Engineer         5067
Scientist        5044
Accountant       5007
Teacher          4990
Media_Manager    4959
Developer        4946
Journalist       4909
Entrepreneur     4857
Doctor           4843
Manager          4810
Musician         4717
Writer           4692
Name: count, dtype: int64

In [267]:
test_data['Profession'].value_counts()

Profession
_______          1371
Entrepreneur     1317
Lawyer           1302
Developer        1289
Engineer         1283
Media_Manager    1273
Accountant       1264
Architect        1255
Scientist        1255
Doctor           1244
Teacher          1225
Mechanic         1196
Musician         1194
Writer           1193
Journalist       1176
Manager          1163
Name: count, dtype: int64

In [268]:
# Replace placeholders with NaN
train_data['Profession'] = train_data['Profession'].replace('_______', np.nan)
test_data['Profession'] = test_data['Profession'].replace('_______', np.nan)
train_data['Credit_Mix'] = train_data['Credit_Mix'].replace('_', np.nan)
test_data['Credit_Mix'] = test_data['Credit_Mix'].replace('_', np.nan)

In [269]:
train_data

Unnamed: 0,ID,Month,Age,Profession,Income_Annual,Base_Salary_PerMonth,Total_Bank_Accounts,Total_Credit_Cards,Rate_Of_Interest,Total_Current_Loans,...,Credit_Mix,Current_Debt_Outstanding,Ratio_Credit_Utilization,Credit_History_Age,Payment_of_Min_Amount,Per_Month_EMI,Monthly_Investment,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x522a,May,51,Musician,101583.480,,5,7,10,4,...,Standard,50.93,34.462154,24 Years and 1 Months,No,190.811017,630.015789,Low_spent_Large_value_payments,314.002193,Standard
1,0x6091,August,23,Writer,101926.950,8635.912500,4,4,9,1,...,,1058.00,39.693812,20 Years and 5 Months,No,70.587681,662.803927,Low_spent_Medium_value_payments,410.199642,Standard
2,0xcb5f,February,49,Writer,158871.120,,0,4,8,1,...,Good,576.48,39.367225,19 Years and 0 Months,No,86.905860,746.805985,Low_spent_Medium_value_payments,742.514154,Standard
3,0x17dbc,March,40,Doctor,60379.280,,5,6,18,3,...,Standard,725.39,29.061701,17 Years and 1 Months,NM,90.906385,166.418658,High_spent_Medium_value_payments,473.135623,Standard
4,0x225b3,June,17,Accountant,50050.830,4085.902500,9,10,20,5,...,Bad,3419.10,30.386321,4 Years and 6 Months,Yes,190.445060,56.789441,High_spent_Large_value_payments,401.355749,Poor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79995,0x15619,August,55,Doctor,114597.040,,7,6,4,4,...,Good,926.18,26.436313,31 Years and 9 Months,No,225.923762,327.619668,High_spent_Medium_value_payments,633.131903,Poor
79996,0x3c48,July,28,Entrepreneur,8227.855,656.654583,6,8,31,100,...,,2695.38,24.127401,11 Years and 8 Months,NM,7352.000000,49.544158,Low_spent_Medium_value_payments,268.108435,Poor
79997,0x9589,August,46,Writer,35032.660,2853.388333,10,6,33,7,...,Bad,1789.00,25.086176,11 Years and 5 Months,Yes,150.500097,106.735679,Low_spent_Small_value_payments,318.103057,Poor
79998,0x74fe,May,42,Mechanic,129680.280,10643.690000,8,3,5,2,...,,240.27,33.944094,20 Years and 5 Months,NM,114.165609,567.179873,High_spent_Small_value_payments,643.023518,Standard


In [270]:
def Month_Converter(val):
    if pd.notnull(val):
        years = int(val.split(' ')[0])
        month = int(val.split(' ')[3])
        return (years * 12) + month
    else:
        return val

train_data['Credit_History_Age'] = train_data['Credit_History_Age'].apply(lambda x: Month_Converter(x)).astype(float)
test_data['Credit_History_Age'] = test_data['Credit_History_Age'].apply(lambda x: Month_Converter(x)).astype(float)

In [271]:
train_data['Payment_of_Min_Amount'].value_counts()

Payment_of_Min_Amount
Yes    41857
No     28509
NM      9634
Name: count, dtype: int64

In [272]:
train_data['Payment_Behaviour'].value_counts()

Payment_Behaviour
Low_spent_Small_value_payments      20470
High_spent_Medium_value_payments    14057
Low_spent_Medium_value_payments     11101
High_spent_Large_value_payments     10931
High_spent_Small_value_payments      9034
Low_spent_Large_value_payments       8309
!@9#%8                               6098
Name: count, dtype: int64

In [273]:
test_data['Payment_Behaviour'].value_counts()

Payment_Behaviour
Low_spent_Small_value_payments      5043
High_spent_Medium_value_payments    3483
High_spent_Large_value_payments     2790
Low_spent_Medium_value_payments     2760
High_spent_Small_value_payments     2306
Low_spent_Large_value_payments      2116
!@9#%8                              1502
Name: count, dtype: int64

## **4. Feature Engineering and Encoding**
To make the data suitable for machine learning models, we perform the following:

1. **Handle Categorical Features**:
   - Replace invalid entries in `Payment_Behaviour`.
   - Encode `Credit_Score` as numeric values (`Poor`: 0, `Standard`: 1, `Good`: 2).
2. **Impute Missing Values**:
   - Use the mode within groups for categorical columns like `Profession`.
3. **Binary Encoding for `Loan_Type`**:
   - Convert multi-label data into separate binary columns using `MultiLabelBinarizer`.
4. **Label Encoding**:
   - Convert categorical columns into numeric representations using `LabelEncoder`.

---


In [274]:
train_data['Payment_Behaviour'] = train_data['Payment_Behaviour'].replace('!@9#%8', np.nan)
test_data['Payment_Behaviour'] = test_data['Payment_Behaviour'].replace('!@9#%8', np.nan)
train_data['Credit_Score'] = train_data['Credit_Score'].replace({'Poor': 0, 'Standard': 1, 'Good': 2})

In [275]:
null_value_percentages=(train_data.isna().sum()/train_data.shape[0])*100
null_value_percentages[null_value_percentages>0]

Profession                 7.11375
Base_Salary_PerMonth      15.04000
Loan_Type                 11.44625
Total_Delayed_Payments     6.99375
Credit_Limit               2.11375
Total_Credit_Enquiries     1.93625
Credit_Mix                20.13500
Credit_History_Age         9.05000
Monthly_Investment         4.50625
Payment_Behaviour          7.62250
Monthly_Balance            1.18750
dtype: float64

In [276]:
null_value_percentages_2=(test_data.isna().sum()/train_data.shape[0])*100
columns_with_nulls = null_value_percentages_2[null_value_percentages_2 > 0].index.to_list()
null_value_percentages_2[null_value_percentages_2>0]


Profession                1.71375
Base_Salary_PerMonth      3.71250
Loan_Type                 2.81375
Total_Delayed_Payments    1.75875
Credit_Limit              0.50000
Total_Credit_Enquiries    0.52000
Credit_Mix                5.10875
Credit_History_Age        2.23750
Monthly_Investment        1.09250
Payment_Behaviour         1.87750
Monthly_Balance           0.31250
dtype: float64

In [277]:
train_data.describe(exclude=np.number).T

Unnamed: 0,count,unique,top,freq
ID,80000,80000,0x522a,1
Month,80000,8,June,10035
Profession,74309,15,Lawyer,5273
Loan_Type,70843,6260,Not Specified,1105
Credit_Mix,63892,3,Standard,29214
Payment_of_Min_Amount,80000,3,Yes,41857
Payment_Behaviour,73902,6,Low_spent_Small_value_payments,20470


In [None]:
columns_to_impute_mode = ['Profession', 'Payment_Behaviour']
def fill_missing_with_group_mode(df, groupby, column):
    mode_per_group = df.groupby(groupby)[column].transform(lambda x: x.mode()[0] if not x.mode().empty else np.nan)
    df[column].fillna(mode_per_group, inplace=True)

for col in columns_to_impute_mode:
    fill_missing_with_group_mode(train_data, 'ID', col)
    fill_missing_with_group_mode(test_data, 'ID', col)

In [None]:
# Convert Loan_Type to binary columns
from sklearn.preprocessing import MultiLabelBinarizer
train_data['Loan_Type'] = train_data['Loan_Type'].fillna('Not Specified').str.replace(r'\band \b', '', regex=True).str.strip()
train_data['Loan_Type_List'] = train_data['Loan_Type'].str.split(', ')
mlb = MultiLabelBinarizer()
loan_type_encoded_train = mlb.fit_transform(train_data['Loan_Type_List'])
loan_type_df_train = pd.DataFrame(loan_type_encoded_train, columns=mlb.classes_, index=train_data.index)
train_data = pd.concat([train_data, loan_type_df_train], axis=1).drop(columns=['Loan_Type', 'Loan_Type_List'])

# Apply similar transformation to test data
test_data['Loan_Type'] = test_data['Loan_Type'].fillna('Not Specified').str.replace(r'\band \b', '', regex=True).str.strip()
test_data['Loan_Type_List'] = test_data['Loan_Type'].str.split(', ')
loan_type_encoded_test = mlb.transform(test_data['Loan_Type_List'])
loan_type_df_test = pd.DataFrame(loan_type_encoded_test, columns=mlb.classes_, index=test_data.index)
test_data = pd.concat([test_data, loan_type_df_test], axis=1).drop(columns=['Loan_Type', 'Loan_Type_List'])

# Encode other categorical features
categorical_columns = ['Month', 'Profession', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour']
label_encoders = {}
for column in categorical_columns:
    label_encoders[column] = LabelEncoder()
    train_data[column] = label_encoders[column].fit_transform(train_data[column])
    test_data[column] = label_encoders[column].transform(test_data[column])

## **5. Imputation for Missing Values**
Missing values can significantly impact model performance. To handle them:

1. Use `KNNImputer` for numeric columns, which imputes values based on their nearest neighbors.
2. Use mode-based imputation for remaining categorical columns.

This ensures the dataset is complete and consistent for machine learning models.

---


In [None]:
imputer = KNNImputer(n_neighbors=5)
columns_to_impute = train_data.loc[:, 'Age':'Monthly_Balance'].columns
train_data[columns_to_impute] = imputer.fit_transform(train_data[columns_to_impute])
test_data[columns_to_impute] = imputer.transform(test_data[columns_to_impute])

## **6. Scaling Numeric Features**
To ensure that all features contribute equally to the model, scale numeric columns using `RobustScaler`, which is less sensitive to outliers.

---

In [None]:
# Define features and target
X = train_data.loc[:, 'Month':'Monthly_Balance']
y = train_data['Credit_Score']

# Apply RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

# Scale test data
X_test = test_data.loc[:, 'Month':'Monthly_Balance']
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

## **Comprehensive Model Evaluation and Comparison**

In this analysis, we train, tune, and evaluate several machine learning models to determine the best-performing classifier based on accuracy. The models considered include **Decision Tree**, **Random Forest**, **XGBoost**, **K-Nearest Neighbors (KNN)**, and **Gaussian Naive Bayes (GaussianNB)**.

Additionally, **SMOTE (Synthetic Minority Over-sampling Technique)** was used for handling class imbalance. This technique synthesizes new samples in the feature space to balance the class distribution before training the models. This step helps to avoid bias towards the majority class and improve model performance, especially in imbalanced datasets.

### **Models and Hyperparameters**

1. **Decision Tree**:
   - Parameters tuned:
     - `criterion`: Entropy for information gain.
     - `splitter`: Best split strategy.
     - `max_depth`: Maximum depth of the tree.
     - `min_samples_split`: Minimum samples required to split a node.
     - `min_samples_leaf`: Minimum samples required to be a leaf node.

2. **Random Forest**:
   - Parameters tuned:
     - `n_estimators`: Number of trees.
     - `max_features`: Number of features for the best split.
     - `max_depth`: Maximum depth of the trees.
     - `min_samples_split`: Minimum samples required to split a node.
     - `min_samples_leaf`: Minimum samples required to be a leaf node.
     - `bootstrap`: Whether to use bootstrap sampling.

3. **XGBoost**:
   - Parameters tuned:
     - `n_estimators`: Number of boosting rounds.
     - `max_depth`: Maximum depth of a tree.
     - `learning_rate`: Step size shrinkage to prevent overfitting.
     - `gamma`: Minimum loss reduction required for splitting.

4. **K-Nearest Neighbors (KNN)**:
   - Parameters tuned:
     - `n_neighbors`: Number of neighbors.
     - `weights`: Weight function (e.g., distance-based).
     - `metric`: Distance metric (e.g., Manhattan distance).

5. **Gaussian Naive Bayes (GaussianNB)**:
   - No hyperparameters tuned as it is a probabilistic model.

### **Evaluation Process**
- For each model (except GaussianNB), hyperparameters are optimized using **GridSearchCV** with stratified cross-validation to ensure balanced class distributions.
- **SMOTE** was applied to handle class imbalance before training the models. This technique is critical in ensuring that models do not become biased towards the majority class and helps them generalize better on minority class predictions.
- Models are evaluated using **accuracy** as the scoring metric.
- GaussianNB is trained on the dataset without tuning and evaluated using its built-in scoring.

### **Results**
- **Best parameters** for each model and their respective cross-validation accuracies are printed.
- The performance of each model is evaluated and compared based on accuracy, with the results showing how well each model has adapted to the SMOTE-transformed data.

### **Key Observations**
1. The inclusion of hyperparameter tuning significantly improves the performance of complex models like Random Forest and XGBoost. These models benefit greatly from the flexibility that tuning provides.
2. Simpler models like KNN and GaussianNB offer quick and effective evaluations. Despite its simplicity, **GaussianNB** performs competitively and can be a good choice when computational efficiency is required.
3. **SMOTE** helped in balancing the class distribution, which was particularly useful for the models sensitive to class imbalances (like Random Forest and XGBoost). This is reflected in the improved performance of these models after applying SMOTE.
4. The selection of the best model depends on the dataset's complexity and preprocessing. The most complex models like XGBoost and Random Forest tend to outperform when hyperparameter optimization is included, but simpler models may still hold up well depending on the problem.

### **Conclusion**
By employing SMOTE and hyperparameter tuning, we were able to improve the performance of most models, particularly those with more complex structures such as Random Forest and XGBoost. The performance of each model is dependent on both the inherent complexity of the dataset and the tuning parameters chosen. This analysis provides valuable insights into model selection, the importance of balancing the dataset, and the role of tuning in achieving high performance.


In [None]:
smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X_scaled, y)

# Define models and parameters
skfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
tree_params = {"criterion": ["gini"], "splitter": ["best"], "max_depth": [15], "min_samples_split": [2], "min_samples_leaf": [5]}
rf_params = {'n_estimators': [200], 'max_features': ['sqrt'], 'max_depth': [10], 'min_samples_split': [2], 'min_samples_leaf': [4], 'bootstrap': [True]}
xgb_params = {'n_estimators': [200], 'max_depth': [15, 10], 'learning_rate': [0.5, 0.25], 'gamma': [0.03]}

# GridSearch for Decision Tree
tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, tree_params, scoring="accuracy", n_jobs=-1, verbose=1, cv=skfold)
tree_cv.fit(X_sm, y_sm)
print(f"Best parameters for Decision Tree: {tree_cv.best_params_}")
print("Best Cross-Validation Accuracy for Decision Tree:", tree_cv.best_score_)

# GridSearch for Random Forest
rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, rf_params, scoring="accuracy", cv=3, verbose=2, n_jobs=-1)
rf_cv.fit(X_sm, y_sm)
print(f"Best parameters for Random Forest: {rf_cv.best_params_}")
print("Best Cross-Validation Accuracy for Random Forest:", rf_cv.best_score_)

# GridSearch for XGBoost
xgb_clf = XGBClassifier(random_state=42)
xgb_cv = GridSearchCV(xgb_clf, xgb_params, scoring='accuracy', cv=3, verbose=2, n_jobs=-1)
xgb_cv.fit(X_sm, y_sm)
print(f"Best parameters for XGBoost: {xgb_cv.best_params_}")
print("Best Cross-Validation Accuracy for XGBoost:", xgb_cv.best_score_)


# KNN
knn_params = {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']}
knn_clf = KNeighborsClassifier()
knn_cv = GridSearchCV(knn_clf, knn_params, scoring='accuracy', cv=3, verbose=2, n_jobs=-1)
knn_cv.fit(X_scaled, y)
print(f"KNN - Best parameters: {knn_cv.best_params_}")
print(f"KNN - Best Cross-Validation Accuracy: {knn_cv.best_score_}")

# Naive Bayes
nb_clf = GaussianNB()
nb_scores = cross_val_score(nb_clf, X_scaled, y, cv=skfold, scoring='accuracy')
nb_accuracy = nb_scores.mean()
print(f"Naive Bayes - Accuracy: {nb_accuracy}")

# Compare model performances
models = {
    'Decision Tree': tree_cv.best_score_,
    'Random Forest': rf_cv.best_score_,
    'XGBoost': xgb_cv.best_score_,
    'KNN': knn_cv.best_score_,
    'Naive Bayes': nb_accuracy
}

best_model_name = max(models, key=models.get)
print(f"\nBest performing model: {best_model_name} with accuracy: {models[best_model_name]:.4f}")

## **Model Selection and Prediction Generation**

After evaluating the performance of various machine learning models, the **best model** is selected based on the highest accuracy. This model is then used to predict credit scores on the test dataset. Below is the process for selecting the best model and generating the predictions.

### **Model Selection**
1. Based on the previous evaluation, the best-performing model is chosen.
2. The model's `best_estimator_` is used to make predictions.

The following conditions are checked:
- If the best model is **Decision Tree**, the model is set to `tree_cv.best_estimator_`.
- If the best model is **Random Forest**, the model is set to `rf_cv.best_estimator_`.
- If the best model is **KNN**, the model is set to `knn.best_estimator_`.
- If the best model is **Gaussian Naive Bayes**, the model is set to `gaussian_classifier`.
- Otherwise, **XGBoost**'s best estimator is selected.

### **Prediction Process**
1. **Predictions** are made on the **test set** using the selected model.
2. **Categorical conversion** is applied to the numeric predictions, mapping them to classes: 'Poor', 'Standard', and 'Good'.
   - `0` is mapped to 'Poor'.
   - `1` is mapped to 'Standard'.
   - `2` is mapped to 'Good'.
   
### **Submission Creation**
1. A **submission DataFrame** is created containing the **ID** from the test dataset and the predicted credit scores.
2. The predictions are saved to a CSV file named **`credit_score_predictions.csv`**.

### **Final Output**
- **Predictions** are saved successfully in the file `credit_score_predictions.csv`, which can now be used for further analysis or evaluation.

In [None]:
if best_model_name == 'Decision Tree':
    best_model = tree_cv.best_estimator_
elif best_model_name == 'Random Forest':
    best_model = rf_cv.best_estimator_
elif best_model_name == 'KNN':
    best_model = knn.best_estimator_
elif best_model_name == 'Gaussian':
    best_model = gaussian_classifier       
else:
    best_model = xgb_cv.best_estimator_

# Make predictions on test set
predictions = best_model.predict(X_test_scaled)

# Convert numeric predictions back to categories
prediction_map = {0: 'Poor', 1: 'Standard', 2: 'Good'}
predictions_categorical = [prediction_map[pred] for pred in predictions]

# Create submission DataFrame
submission = pd.DataFrame({
    'ID': test_data['ID'],
    'Credit_Score': predictions_categorical
})

# Save predictions to CSV
submission.to_csv('credit_score_predictions.csv', index=False)
print("\nPredictions have been saved to 'credit_score_predictions.csv'")

## **10. Conclusion**
In this notebook, we:
1. Explored and cleaned the dataset, addressing missing values and outliers.
2. Applied feature engineering, including encoding and scaling.
3. Handled class imbalance using SMOTE.
4. Trained and tuned multiple machine learning models.
5. Predicted credit scores for the test dataset and saved the results.

---
