<div style="align: center;">
    <br>
    <img src="https://storage.googleapis.com/kaggle-datasets-images/2289007/3846912/ad5e128929f5ac26133b67a6110de7c0/dataset-cover.jpg?" style="display:block; margin:auto; width:75%; height:350px;">
</div><br><br> 

<div style="letter-spacing:normal; opacity:1.;">
<!--   https://xkcd.com/color/rgb/   -->
  <p style="text-align:center; background-color: lightsalmon; color: Jaguar; border-radius:10px; font-family:monospace; 
            line-height:1.4; font-size:32px; font-weight:bold; text-transform: uppercase; padding: 9px;">
            <strong>Finance Company Credit-Related Information</strong></p>  
  
  <p style="text-align:center; background-color:romance; color: Jaguar; border-radius:10px; font-family:monospace; 
            line-height:1.0; font-size:28px; font-weight:normal; text-transform: capitalize; padding: 5px;"
     >Machine Learning Module: Part 2: Credit Score Multi-Class Classification<br>Models: Logistic Regression, Random Forest, XGBoost</p>    
</div>

**About Dataset**

**Problem Statement**

You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.

**Task**

Given a person’s credit-related information, build a machine learning model that can classify the credit score.

<h4>Table of Contents</h4>


01. Import Libraries
02. Reading Clean the Data from File
03. Multi-class Classification Data Pre-Processing
    01. Implement Logistic Regression Model
    02. Implement Random Forest Classifier Model
    03. Implement Xgboost Classifier Model
04. Future Importance XGBoost 
05. Final Model


## For Detailed EDA: [credit-score-classification-data-cleaning-project](https://www.kaggle.com/code/clkmuhammed/credit-score-classification-data-cleaning-project)

## Clean Dataset: [Credit score classification](https://www.kaggle.com/datasets/clkmuhammed/creditscoreclassification)
## Dataset      : [Credit score classification](https://www.kaggle.com/datasets/parisrohan/credit-score-classification)

# 01. Import Libraries 

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# The style parameters control properties like the color of the background and whether a grid is enabled by default.
sns.set_style("whitegrid", {'axes.grid' : False})
# sns.set_style("whitegrid")

# Environment settings: 
pd.set_option('display.float_format', lambda x: f'{x:.3f}')

# import warnings
# # Suppressing a warning 
# warnings.filterwarnings("ignore") 
# warnings.warn("this will not show")

import re
import time
import random
import tempfile
from tqdm.notebook import tqdm

import gc
gc.collect()

# 02. Reading Clean the Data from File

In [None]:
# we are using cleaned Data
df_origin_train = pd.read_csv('/kaggle/input/creditscoreclassification/train.csv')
df_train = df_origin_train.copy()

df_origin_test = pd.read_csv('/kaggle/input/creditscoreclassification/test.csv')
df_test = df_origin_test.copy()

df_train.shape, df_test.shape

In [None]:
df_train.head(8).T

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
# Drop the Columns we not use in model
df_train.drop(columns=['ID', 'Customer_ID', 'Month', 'Name', 'SSN'], inplace=True)
df_test.drop(columns=['ID', 'Customer_ID', 'Month', 'Name', 'SSN'], inplace=True)

In [None]:
df_train.describe().T

In [None]:
df_test.describe().T

In [None]:
df_train.select_dtypes(include="object").describe().T

In [None]:
df_test.select_dtypes(include="object").describe().T

In [None]:
# Correlation between Numerical features
plt.figure(figsize=(18, 10))
sns.heatmap(
    df_train.corr(), 
    mask=np.triu(np.ones_like(df_train.corr(), dtype=bool)), 
    annot=True, vmin=-1, vmax=1, cmap="PiYG"
);

# 03. Multi-class Classification Data Pre-Processing

## Import Libraries

In [None]:
# conda install -c anaconda scikit-learn
# sklearn library for machine learning algorithms, data preprocessing, and evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline

# Supervised-Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# conda install -c conda-forge xgboost
from xgboost import XGBClassifier

# Supervised-Classifier-metrics
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, accuracy_score, log_loss
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, plot_confusion_matrix
from sklearn.metrics import plot_roc_curve, roc_curve, roc_auc_score, auc
from sklearn.metrics import plot_precision_recall_curve, precision_recall_curve, average_precision_score

# Supervised-cross_validate-GridSearchCV
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# import pickle

random_state = 42

## Train | Test Split

In [None]:
df_train.info()

In [None]:
df_train["Credit_Score"].value_counts(normalize=True).sort_index()

In [None]:
X      = df_train.drop(columns="Credit_Score")
y      = df_train['Credit_Score']
X_test = df_test

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=random_state)

X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, 

## Label Encoder

In [None]:
from sklearn.preprocessing import LabelEncoder

le      = LabelEncoder()
print(np.unique(y_train))
y_train = le.fit_transform(y_train)
y_val   = le.transform(y_val)

pd.DataFrame(y_train).value_counts().sort_index()

In [None]:
# our focus group
le.inverse_transform([1])

## Dummy Operation

https://celik-muhammed.medium.com/how-to-converting-pandas-column-of-comma-separated-strings-into-dummy-variables-762c02282a6c

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class GetDummies(BaseEstimator, TransformerMixin): 
    def __init__(self, data_sep=',', col_name_sep='_'):
        """
        Transformer that creates dummy variables from categorical columns with a separator.
        Parameters:
            - data_sep (str): Separator used to split categorical values into multiple dummy variables.
            - col_name_sep (str): Separator used to separate the column name from the prefix in the output column names.
        """
        self.data_sep     = data_sep
        self.col_name_sep = col_name_sep
        
    # Return self nothing else to do here
    def fit(self, X, y  = None): 
        """
        Fit the transformer to the data.
        Parameters:
            - X (pandas.DataFrame): Input data with categorical columns.
            - y (array-like): Target variable (ignored).
        Returns:
            - self: Returns the transformer object.
        """
        object_cols       = X.select_dtypes(include="object").columns
        self.dummy_cols   = [col for col in object_cols if X[col].str.contains(self.data_sep, regex=True).any()]
        self.dummy_prefix = [''.join(map(lambda x: x[0], col.split(self.col_name_sep)))  if self.col_name_sep in col else col[:2]   for col in self.dummy_cols]
        
        for col, pre in zip(self.dummy_cols, self.dummy_prefix):
            dummy_X = X.join(X[col].str.get_dummies(sep=self.data_sep).add_prefix(pre+self.col_name_sep))            
            
        dummy_X.drop(columns = self.dummy_cols, inplace=True)
        self.columns = dummy_X.columns
        return self
    
    # Transformer method we wrote for this transformer
    def transform(self, X, y = None):
        """
        Transform the input data by creating dummy variables.
        Parameters:
            - X (pandas.DataFrame): Input data with categorical columns.
            - y (array-like): Target variable (ignored).
        Returns:
            - X_transformed (pandas.DataFrame): Transformed data with dummy variables.
        """
        for col, pre in zip(self.dummy_cols, self.dummy_prefix):
            X_transformed = X.join(X[col].str.get_dummies(sep=self.data_sep).add_prefix(pre+self.col_name_sep))   

        X_transformed = X_transformed.reindex(columns=self.columns, fill_value=0)          
        return X_transformed
        
    # to get feature names    
    def get_feature_names_out(self, input_features=None):
        """
        Get the names of the transformed features.
        Parameters:
            - input_features (array-like): Names of the input features (ignored).
        Returns:
            - output_features (list): Names of the transformed features.
        """
        return self.columns.tolist()

In [None]:
# check for columns containing commas
[col for col in X_train.select_dtypes('O').columns if X_train[col].str.contains(',', regex=True).any()]

In [None]:
dummy = GetDummies()

X_train_dummy = dummy.fit_transform(X_train)
X_val_dummy   = dummy.transform(X_val)

X_train_dummy.shape, X_val_dummy.shape

In [None]:
X_train_dummy.info()

## OneHotEncoder and LabelEncoder

- Nominal data represents categories without any inherent order or hierarchy. Each category is independent of others. One-hot encoding is commonly used for nominal data.

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat = X_train_dummy.select_dtypes(include="object").columns.tolist()   
print('OneHotEncoder:', cat)
ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)

X_train_cat = pd.DataFrame(
    ohe.fit_transform(X_train_dummy[cat]), index = X_train_dummy.index, 
    columns = ohe.get_feature_names_out(cat)
)    
X_val_cat  = pd.DataFrame(
    ohe.transform(X_val_dummy[cat]), index = X_val_dummy.index, 
    columns = ohe.get_feature_names_out(cat)
)    
X_train_ohe = X_train_cat.join(X_train_dummy.select_dtypes("number"))
X_val_ohe   = X_val_cat.join(X_val_dummy.select_dtypes("number"))

X_train_ohe.shape, X_val_ohe.shape

In [None]:
X_train_ohe.columns

## Scale data

In [None]:
scaler = MinMaxScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_ohe), columns=X_train_ohe.columns)
X_val_scaled   = pd.DataFrame(scaler.transform(X_val_ohe), columns=X_val_ohe.columns)

## Define Model Evaluation Functions

In [None]:
from sklearn.metrics import confusion_matrix, classification_report 

def eval(model, X_train, X_val, y_train=y_train, y_val=y_val):
    print('TEST')
    y_val_pred = model.predict(X_val)
    print(confusion_matrix(y_val, y_val_pred))
    print(classification_report(y_val, y_val_pred))
    print("-------------------------------------------------------")
    print('TRAIN')
    y_train_pred = model.predict(X_train)
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import roc_auc_score, auc
from sklearn.metrics import make_scorer

# for multi-class
scoring = {
    'precision': make_scorer(precision_score, average=None, labels=[1]),
    'recall'   : make_scorer(recall_score, average=None, labels=[1]),
    'f1'       : make_scorer(f1_score, average=None, labels=[1]),
    'accuracy' : make_scorer(accuracy_score),
} 
# Identify people with low credit scores
# recall_1    = make_scorer(recall_score, average = None, labels=[1])
scoring['recall'] 
# log_loss_neg = make_scorer(log_loss, greater_is_better=False, needs_proba=True)

In [None]:
from sklearn.utils import class_weight

class_weights = dict(
    zip(np.unique(y_train),
        class_weight.compute_class_weight(
            class_weight = 'balanced',
            classes = np.unique(y_train), 
            y = y_train)
))
class_weights

In [None]:
from sklearn.utils import class_weight
sample_weight = class_weight.compute_sample_weight(class_weight='balanced', y=y_train)
pd.unique(sample_weight)

In [None]:
from collections import Counter

counter = Counter(y_train)                          
max_val = float(max(counter.values()))       
class_weights = {class_id : max_val/count for class_id, count in counter.items()}  
class_weights

In [None]:
pd.value_counts(y_train, normalize=True).sort_index()

In [None]:
# Step 1: Compute normalized class proportions
class_proportions = pd.value_counts(y_train, normalize=True)

# Step 2: Determine maximum class proportion
max_proportion = class_proportions.max()

# Step 3: Calculate class weights
class_weights = max_proportion / class_proportions
class_weights 

# 01. Implement Logistic Regression Model

In [None]:
log_reg = LogisticRegression(
    class_weight = "balanced",
    random_state = random_state,
    max_iter     = 10000
)

In [None]:
%%time
log_reg.fit(X_train_scaled, y_train)

In [None]:
print("LOG MODEL")
eval(log_reg, X_train_scaled, X_val_scaled)

## With Best Parameters (GridsearchCV)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

model = LogisticRegression(   
    class_weight = "balanced",
    random_state = random_state,
    max_iter     = 10000
)
param_grid = {
    "penalty"     : ["l1", "l2"],
    "C"           : np.linspace(0.01, 1, 2).round(3),
    "class_weight": ["balanced"],
    "solver"      : ["saga", "liblinear"]
}
grid_model_log = GridSearchCV(
    estimator=model,
    param_grid = param_grid, 
    scoring = scoring['recall'],
    error_score="raise",
    n_jobs=-1,
    cv=5
) 

In [None]:
%%time
grid_model_log.fit(X_train_scaled, y_train)

In [None]:
# Get the best hyperparameters
best_paramsl_log = grid_model_log.best_params_

best_paramsl_log, grid_model_log.best_score_

In [None]:
print("GRID LOG MODEL BALANCED")
eval(grid_model_log, X_train_scaled, X_val_scaled)

# 02. Implement Random Forest Classifier Model

## OrdinalEncoder

- Ordinal data represents categories with a specific order or hierarchy. Ordinal encoding is suitable for ordinal data.

In [None]:
import sklearn; print(sklearn.__version__)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

cat = X_train_dummy.select_dtypes(include="object").columns.to_list()    
print('OrdinalEncoder:', cat) 
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

X_train_cat = pd.DataFrame(
    enc.fit_transform(X_train_dummy[cat]), index = X_train_dummy.index, 
    columns = enc.feature_names_in_
)    
X_val_cat  = pd.DataFrame(
    enc.transform(X_val_dummy[cat]), index = X_val_dummy.index, 
    columns = enc.feature_names_in_
)    
X_train_enc = X_train_cat.join(X_train_dummy.select_dtypes("number"))
X_val_enc   = X_val_cat.join(X_val_dummy.select_dtypes("number"))

X_train_enc.shape, X_val_enc.shape

In [None]:
X_train_enc.columns

## RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(
    class_weight = 'balanced',
    random_state=random_state
)

In [None]:
%%time
rfc.fit(X_train_enc, y_train)

In [None]:
print("RF MODEL")
eval(rfc, X_train_enc, X_val_enc)

## With Best Parameters (GridsearchCV)

In [None]:
from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier(
    class_weight = 'balanced',
    random_state=random_state
)
param_grid = {
    'class_weight': ['balanced'], # [None, 'balanced', 'balanced_subsample']
    'n_estimators': np.linspace(100, 200, 2, dtype=int),
    'criterion'   : ["gini", "entropy"],
    'max_depth'   : np.arange(2, 3, 1), 
    'min_impurity_decrease': [0],
    'oob_score'   : [True],  
#     'max_features': [None],
}
grid_model_rfc = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring = scoring['recall'],      
    error_score='raise',  
    n_jobs = -1,
#     refit=True,
    cv=5,
)

In [None]:
%%time
grid_model_rfc.fit(X_train_enc, y_train)

In [None]:
# Get the best hyperparameters
best_paramsl_rfc = grid_model_rfc.best_params_

best_paramsl_rfc, grid_model_rfc.best_score_

In [None]:
print("GRID RF MODEL BALANCED")
eval(grid_model_rfc, X_train_enc, X_val_enc)

# 03. Implement XGBoost Classifier Model (Xgboost with Scikit-learn API)

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(
    random_state=random_state,
)

In [None]:
%%time
xgb.fit(X_train_enc, y_train, 
    sample_weight=sample_weight
)
# weight parameter in XGBoost is per instance not per class.

In [None]:
print("XGB MODEL")
eval(xgb, X_train_enc, X_val_enc)

## With Best Parameters (GridsearchCV)

In [None]:
from sklearn.model_selection import GridSearchCV

model = XGBClassifier(
    random_state=random_state
)
param_grid = {
    'n_estimators' : [100],
    'learning_rate': np.linspace(0.01, 0.3, 2).round(3),
    'max_depth'    : [5, 6],   
#     'reg_alpha'    : [0, 0.5, 1],  
#     'reg_lambda'   : [0, 0.5, 1], 
}
grid_model_xgb = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring = scoring['recall'], # 'neg_log_loss'      
    error_score='raise',     
    n_jobs = -1,
#     refit=True,
    cv=5,
)

In [None]:
%%time
grid_model_xgb.fit(X_train_enc, y_train, 
    sample_weight=sample_weight
)

In [None]:
# Get the best hyperparameters
best_params_xgb = grid_model_xgb.best_params_

best_params_xgb, grid_model_xgb.best_score_

In [None]:
print("GRID XGB MODEL BALANCED")
eval(grid_model_xgb, X_train_enc, X_val_enc)

# Compare The Models

In [None]:
# !pip install scikit-plot -q
import scikitplot as skplt

y_val_proba = grid_model_log.predict_proba(X_val_scaled)
skplt.metrics.plot_precision_recall(y_val, y_val_proba);
# skplt.metrics.plot_roc(y_test, y_prob_test)
plt.plot([0, 1],[1, 0], 'k--')
plt.show()

In [None]:
# !pip install scikit-plot -q
import scikitplot as skplt

y_val_proba = grid_model_rfc.predict_proba(X_val_enc)
skplt.metrics.plot_precision_recall(y_val, y_val_proba);
# skplt.metrics.plot_roc(y_test, y_prob_test)
plt.plot([0, 1],[1, 0], 'k--')
plt.show()

In [None]:
# !pip install scikit-plot -q
import scikitplot as skplt

y_val_proba = grid_model_xgb.predict_proba(X_val_enc)
skplt.metrics.plot_precision_recall(y_val, y_val_proba);
# skplt.metrics.plot_roc(y_test, y_prob_test)
plt.plot([0, 1],[1, 0], 'k--')
plt.show()

In [None]:
# from yellowbrick.classifier import PrecisionRecallCurve

# model      = grid_model_log
# visualizer = PrecisionRecallCurve(model, classes=le.classes_, per_class=True, micro=False)
# visualizer.fit(X_train_scaled, y_train)     # Fit the training data to the visualizer
# visualizer.score(X_val_scaled, y_val)       # Evaluate the model on the test data
# visualizer.show(); 

In [None]:
# from yellowbrick.classifier import precision_recall_curve, PrecisionRecallCurve

# # Create the visualizer, fit, score, and show it, take a long time
# viz = precision_recall_curve(grid_model_log, X_train_scaled, y_train, X_val_scaled, y_val)

# 04. Future Importance XGBoost 

In [None]:
# xgb_model has best scorer default parameters
X_val_enc.columns.shape, xgb.feature_importances_.shape

In [None]:
plt.figure(figsize=(14,8))
plt.barh(X_val_enc.columns, xgb.feature_importances_);

## yellowbrick Feature Importances

In [None]:
from yellowbrick.model_selection import feature_importances, FeatureImportances

model = XGBClassifier(
    random_state=random_state, 
    **best_params_xgb
)

plt.subplots(figsize=(12, 9))
# Use the quick method and immediately show the figure
feature_importances(model, X_val_enc, y_val);

## Permutation Based Feature Importance (with scikit-learn)

In [None]:
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(xgb, X_val_enc, y_val)
perm_importance['importances_mean']

In [None]:
sorted_idx = perm_importance.importances_mean.argsort()
plt.figure(figsize=(14,8))
plt.barh(X_val_enc.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance");

# 05. Final Model

In [None]:
model = XGBClassifier(
    random_state=random_state, 
    **best_params_xgb
)
# seect top 8 features
viz = feature_importances(model, X_val_enc, y_val, relative=False, topn=9)

# get top 8 features
print(viz.features_)

In [None]:
# we selected 9 features for final model
df_final = df_train[viz.features_.tolist() + ['Credit_Score']]
df_final

In [None]:
df_final.info()

In [None]:
X = df_final.drop(columns='Credit_Score')
y = df_final['Credit_Score']

In [None]:
cat = X.select_dtypes(include="object").columns.to_list()     
print('OrdinalEncoder:', cat)  
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

X_enc      = X.copy()
X_enc[cat] = enc.fit_transform(X_enc[cat])

X_enc.shape

In [None]:
le = LabelEncoder()
y  = le.fit_transform(y)

In [None]:
sample_weight = class_weight.compute_sample_weight(class_weight='balanced', y=y)
sample_weight

### ({'learning_rate': 0.3, 'max_depth': 6, 'n_estimators': 100},

In [None]:
final_model = XGBClassifier(
    random_state=random_state, 
    **best_params_xgb
)

In [None]:
%%time
final_model.fit(X_enc, y, 
    sample_weight=sample_weight
)

In [None]:
sns.set_style("whitegrid", {'axes.grid' : False})
from sklearn.metrics import ConfusionMatrixDisplay

y_pred = final_model.predict(X_enc)

print(classification_report(y, y_pred))
ConfusionMatrixDisplay.from_estimator(final_model, X_enc, y);

In [None]:
np.bincount(y_pred)

In [None]:
X.head()

In [None]:
X.describe()

In [None]:
X.describe(include='O')

In [None]:
mean_human = pd.concat([X.select_dtypes('number').mean().astype(int).to_frame().T, X.select_dtypes('object').mode()], axis=1)
mean_human.to_dict()

In [None]:
mean_human[cat] = enc.transform(mean_human[cat])
mean_human

In [None]:
predict = final_model.predict(mean_human)
predict, le.inverse_transform(predict)

## Predict Test Data

In [None]:
X_test_enc      = X_test[df_final.columns[:-1]].copy()
X_test_enc[cat] = enc.fit_transform(X_test_enc[cat])

X_test_enc.shape

In [None]:
y_test_pred = final_model.predict(X_test_enc)
pd.value_counts(y_test_pred).sort_index()

In [None]:
np.bincount(y_test_pred)

# 06. Prepare Model Deployment

In [None]:
%%writefile get_dummies.py

# Save your custom function in a Python script (.py file) and then import it to use it with pickle.load().
# This is a common approach to store and reuse custom functions in different scripts or projects.
from sklearn.base import BaseEstimator, TransformerMixin

class GetDummies(BaseEstimator, TransformerMixin): 
    def __init__(self, data_sep=',', col_name_sep='_'):
        """
        Transformer that creates dummy variables from categorical columns with a separator.
        Parameters:
            - data_sep (str): Separator used to split categorical values into multiple dummy variables.
            - col_name_sep (str): Separator used to separate the column name from the prefix in the output column names.
        """
        self.data_sep     = data_sep
        self.col_name_sep = col_name_sep
        
    # Return self nothing else to do here
    def fit(self, X, y  = None): 
        """
        Fit the transformer to the data.
        Parameters:
            - X (pandas.DataFrame): Input data with categorical columns.
            - y (array-like): Target variable (ignored).
        Returns:
            - self: Returns the transformer object.
        """
        object_cols       = X.select_dtypes(include="object").columns
        self.dummy_cols   = [col for col in object_cols if X[col].str.contains(self.data_sep, regex=True).any()]
        self.dummy_prefix = [''.join(map(lambda x: x[0], col.split(self.col_name_sep)))  if self.col_name_sep in col else col[:2]   for col in self.dummy_cols]
        
        for col, pre in zip(self.dummy_cols, self.dummy_prefix):
            dummy_X = X.join(X[col].str.get_dummies(sep=self.data_sep).add_prefix(pre+self.col_name_sep))            
            
        dummy_X.drop(columns = self.dummy_cols, inplace=True)
        self.columns = dummy_X.columns
        return self
    
    # Transformer method we wrote for this transformer
    def transform(self, X, y = None):
        """
        Transform the input data by creating dummy variables.
        Parameters:
            - X (pandas.DataFrame): Input data with categorical columns.
            - y (array-like): Target variable (ignored).
        Returns:
            - X_transformed (pandas.DataFrame): Transformed data with dummy variables.
        """
        for col, pre in zip(self.dummy_cols, self.dummy_prefix):
            X_transformed = X.join(X[col].str.get_dummies(sep=self.data_sep).add_prefix(pre+self.col_name_sep))   

        X_transformed = X_transformed.reindex(columns=self.columns, fill_value=0)          
        return X_transformed
        
    # to get feature names    
    def get_feature_names_out(self, input_features=None):
        """
        Get the names of the transformed features.
        Parameters:
            - input_features (array-like): Names of the input features (ignored).
        Returns:
            - output_features (list): Names of the transformed features.
        """
        return self.columns.tolist()

In [None]:
# check
from get_dummies import GetDummies

In [None]:
import pickle
pickle.dump(enc,   open("credit_score_multi_class_ord_encoder.pkl", 'wb'))
pickle.dump(le,    open("credit_score_multi_class_le.pkl", 'wb'))
pickle.dump(dummy, open("credit_score_multi_class_dummy.pkl", 'wb'))

## Let’s save the XGBoost model:
- https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRanker.save_model
- https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html#difference-between-saving-model-and-dumping-model
```py
model.save_model("model.json")
# or
model.save_model("model.ubj")
```

In [None]:
# Save the model for XGBoost
final_model.save_model('credit_score_multi_class_xgboost_model.json')

## sanity check

In [None]:
# sanity check
import pandas as pd
from xgboost import XGBClassifier
import pickle 

import sys
# Replace with the actual path to 'get_dummies.py'
sys.path.append('/kaggle/input/creditscoreclassification/')
# Now you can import the custom module and use its functions
from get_dummies import GetDummies

# Load the encoder from the file
loaded_enc   = pickle.load(open("/kaggle/input/creditscoreclassification/credit_score_multi_class_ord_encoder.pkl", "rb")) 
loaded_le    = pickle.load(open("/kaggle/input/creditscoreclassification/credit_score_multi_class_le.pkl", "rb"))
loaded_dummy = pickle.load(open("/kaggle/input/creditscoreclassification/credit_score_multi_class_dummy.pkl", "rb"))

# Load the model from the file
loaded_model = XGBClassifier()
loaded_model.load_model("/kaggle/input/creditscoreclassification/credit_score_multi_class_xgboost_model.json")
loaded_model

In [None]:
sample = pd.read_csv('/kaggle/input/creditscoreclassification/test.csv').head(2)
loaded_dummy.fit_transform(sample).T

In [None]:
cat        = ['Credit_Mix']
mean_human = pd.DataFrame.from_dict(
    {
        'Total_EMI_per_month': {0: 107},
        'Num_Bank_Accounts': {0: 5},
        'Num_of_Delayed_Payment': {0: 13},
        'Delay_from_due_date': {0: 21},
        'Changed_Credit_Limit': {0: 10},
        'Num_Credit_Card': {0: 5},
        'Outstanding_Debt': {0: 1426},
        'Interest_Rate': {0: 14},
        'Credit_Mix': {0: 'Standard'}
    }
)
mean_human[cat] = loaded_enc.transform(mean_human[cat])
predict         = loaded_model.predict(mean_human)
predict, loaded_le.inverse_transform(predict)

In [None]:
from IPython.display import FileLink, FileLinks
ord_enc   = FileLink(r'credit_score_multi_class_ord_encoder.pkl', result_html_prefix="Click here to download: ")
l_enc     = FileLink(r'credit_score_multi_class_le.pkl', result_html_prefix="Click here to download: ")
dummy_enc = FileLink(r'credit_score_multi_class_dummy.pkl', result_html_prefix="Click here to download: ")
model     = FileLink(r'credit_score_multi_class_xgboost_model.json', result_html_prefix="Click here to download: ")

display(ord_enc, l_enc, dummy_enc, model)

# End of the Project