<a href="https://colab.research.google.com/github/SilasEmma/Competition-Participated/blob/main/credit_card_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement**

Financial threats are displaying a trend about the credit risk of commercial banks as the incredible improvement in the financial industry has arisen. In this way, one of the biggest threats faces by commercial banks is the risk prediction of credit clients. The goal is to predict the probability of credit default based on credit card owner's characteristics and payment history.

**Approach**

Tasks:- Supervised Learning tasks (Classifical Problem)

Machine Learning skills like Data Exploration, Data Cleaning, Feature Engineering/Selection, Model Building and Testing/Evaluation

In [None]:
!pip install mlflow

In [None]:
# importing relevant libraries

# utils lib
import pickle
from statsmodels.stats.outliers_influence import variance_inflation_factor
from collections import Counter

# data manipulation
import pandas as pd
import numpy as np

# data viz
import seaborn as sns
#from pandas_profiling import ProfileReport
#from sweetviz import DataframeReport
from matplotlib import pyplot as plt

# data preprocessing 
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE, VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# data imbalance
from imblearn.over_sampling import ADASYN, SMOTE

# model performance and metrics
from sklearn import metrics
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, roc_auc_score, precision_score, plot_roc_curve
from sklearn.metrics import accuracy_score, plot_roc_curve
from yellowbrick.model_selection import learning_curve, validation_curve

# model tracking and registry
import mlflow
import bentoml
from mlflow import active_run, set_experiment, set_tag, log_artifact, log_params, log_metrics

# ml algorithm
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier

**Data Understanding**

In [None]:
# loading data
#from google.colab import drive
#drive.mount('/content/drive')
df = pd.read_csv('../input/uci-credit-card/UCI_Credit_Card.csv')
#df =pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Credit Card Project/UCI_Credit_Card.csv')

In [None]:
# Return the first five rows.
df.head()

In [None]:
# Print a concise summary of a DataFrame.
df.info()

In [None]:
# Return columns with missing values
df.isnull().sum()

In [None]:
# Return a Series containing counts of unique rows in the DataFrame.
df['default.payment.next.month'].value_counts()

 **Data is highly imbalanced**

In [None]:
sns.countplot(data=df, x='default.payment.next.month')
plt.show()

In [None]:
# Generate descriptive statistics
df.describe()

**Data Cleaning**

Data cleaning refers to identifying and correcting errors in the dataset that may negatively
impact a predictive model.

In [None]:
# Return the sum of Series denoting duplicate rows.
df.duplicated().sum()

In [None]:
# dropping rows with duplicate
df.drop_duplicates(inplace=True)

In [None]:
# Return Series with number of distinct observations
df.nunique()

**Data Exploration**



In [None]:
# Generate a profile report from a Dataset
#pr = ProfileReport(df) ERROR

In [None]:
#pr.to_widgets()

In [None]:
sns.histplot(data=df, x='AGE', bins=30)
plt.show()

In [None]:
sns.histplot(data=df, x='LIMIT_BAL', bins=30)
plt.show()

In [None]:
# assigning independent variale
x = df.drop(['default.payment.next.month'], axis=1)
# assigning dependent variable
y = df['default.payment.next.month']

**Detecting and Handling Multicollinearity using VIF(Variance Inflation Factor.)**

In [None]:
# creating dataframe for vif
vif_data = pd.DataFrame()
vif_data['feature'] = x.columns

# calculating VIF for each features
vif_data['VIF'] = [variance_inflation_factor(x.values, i) for i in range(len(x.columns))]

# printing VIF
print(vif_data)

In [None]:
# dropping some correlated features
x.drop(['BILL_AMT6','BILL_AMT5', 'BILL_AMT4', 'BILL_AMT3', 'BILL_AMT2', 'BILL_AMT1'], axis=1, inplace=True)

In [None]:
x.head()

In [None]:
# Split arrays or matrices into random train and test subsets
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.33, random_state=1)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

**Data Transformation**

Data transformation is the process of Changing the scale or distribution of variables.

In [None]:
# Transform features by scaling each feature to a given range.
# Standardize features by removing the mean and scaling to unit variance.
mn = MinMaxScaler()

In [None]:
# specifying columns to transform
scale_col = ['LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

# Fit to data, then transform it.
x_train[scale_col] = mn.fit_transform(x_train[scale_col])
# Fit to data, then transform it.
x_test[scale_col] = mn.fit_transform(x_test[scale_col])

**Imbalanced Datasets**

Imbalanced Dataset is a problem of when there is an unequal distribution of classes in the training dataset.

The credit card default datasets in imbalanced with about 23364 of non default and 
about 6636 of default

In [None]:
# Class to perform over-sampling using SMOTE. 
sm = SMOTE()

In [None]:
# Resample the dataset.
x_train_sm, y_train_sm = sm.fit_resample(x_train, y_train)

In [None]:
print('Original dataset shape:- {}'.format(Counter(y_train)))
print('Reshaped dataset shape:- {}'.format(Counter(y_train_sm)))

**Confusion Matrix Function**

This will help in various calculation

In [None]:
# function name : evaluate_model
# argumet : y_true, y_predicted
# prints Confusion matrix

def evaluate_model(y_true, y_predicted, print_score=False):
    confusion = metrics.confusion_matrix(y_true, y_predicted)
    # Predicted     not_converted    converted
    # Actual
    # not_converted        TN         FP
    # converted            FN         TP

    TP = confusion[1,1] # true positive 
    TN = confusion[0,0] # true negatives
    FP = confusion[0,1] # false positives
    FN = confusion[1,0] # false negatives

    accuracy_sc = metrics.accuracy_score(y_true, y_predicted)
    sensitivity_score = TP / float(TP+FN) #TPR
    specificity_score = TN / float(TN+FP) #TNR
    precision_sc = precision_score(y_true, y_predicted)
    FPR = FP/float(FP+TN)
    F1_Score = 2 * (precision_sc* sensitivity_score) / (precision_sc + sensitivity_score)
    #YJS = sensitivity_score+specificity_score -1
    #YJS = sensitivity_score-FPR
    #YJS = precision_sc+accuracy_sc+specificity_score-2 # Min FP
    YJS =sensitivity_score+accuracy_sc+specificity_score-2 # Min FN
    
    if print_score:
        print("Confusion Matrix :\n", confusion)
        print("Accuracy :", accuracy_sc)
        print("Sensitivity :", sensitivity_score)
        print("Specificity :", specificity_score)
        print("Precision :", precision_sc)
        print("FPR :", FPR) 
        print("YJS (TPR-FPR) :", YJS)
        print("F1 Score :", F1_Score)
        
    return accuracy_sc, sensitivity_score, specificity_score, precision_sc,YJS,F1_Score

**Model Building**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_logit = Pipeline([
    # Principal component analysis (PCA).
    ('selection', PCA()),
    # Implementing Logistic Regression Classifier.
    ('lr', LogisticRegression(C=0.1, penalty='elasticnet', solver='saga', l1_ratio=0.5, multi_class='ovr', random_state=10))
])
# Fit the model and transform with the final estimator
pipe_logit.fit(x_train_sm, y_train_sm)

In [None]:
# Repeats Stratified K-Fold n times with different randomization in each repetition.
stk_logit = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_logit = {
    'lr__max_iter' : [20000, 30000, 40000]
}

# Search over specified parameter values with successive halving.
gridcv_logit = HalvingGridSearchCV(estimator=pipe_logit, 
                             param_grid=param_logit, 
                             cv=stk_logit, random_state=10)
# Run fit with all sets of parameters.
gridcv_logit.fit(x_train_sm, y_train_sm)

In [None]:
# cross-validated score of the best_estimator.
gridcv_logit.best_score_

In [None]:
# 
y_pred_logit = gridcv_logit.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
Logistic_mod =list(evaluate_model(y_test, y_pred_logit, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_logit, x_train_sm, y_train_sm)
plt.show()

In [None]:
# Determines cross-validated training and test scores for different training set sizes.
#print(learning_curve(pipe_logit, x, y, cv=5, scoring='accuracy'))

In [None]:
# Determine training and test scores for varying parameter values.
#print(validation_curve(pipe_logit, x, y, cv=5, scoring='accuracy'))

**KNeighborClassifier with Cross-validation and hyper-parameter**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_knn = Pipeline([
    # Principal component analysis (PCA).
    ('selection', PCA()),
    # Classifier implementing the k-nearest neighbors vote.
    ('knn', KNeighborsClassifier())
])
# Fit the model and transform with the final estimator
pipe_knn.fit(x_train_sm, y_train_sm)

In [None]:
# Repeats Stratified K-Fold n times with different randomization in each repetition.
stk_knn = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_knn = {
    'knn__n_neighbors' : [3, 5, 6]
}

# Search over specified parameter values with successive halving.
gridcv_knn = HalvingGridSearchCV(estimator=pipe_knn, 
                                 param_grid=param_knn, 
                                 cv=stk_knn, random_state=10)
# Run fit with all sets of parameters.
gridcv_knn.fit(x_train_sm, y_train_sm)

In [None]:
# cross-validated score of the best_estimator.
gridcv_knn.best_score_

In [None]:
# 
y_pred_knn = gridcv_knn.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
Knn_mod = list(evaluate_model(y_test, y_pred_knn, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_knn, x_train_sm, y_train_sm)

**HistGradientBoostingClassifier with Cross-validation and hyper-parameter turning**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_hist = Pipeline([
    # Principal component analysis (PCA)
    ('selection', PCA()),
    # Histogram-based Gradient Boosting Classification Tree.
    ('hist', HistGradientBoostingClassifier(loss='binary_crossentropy', learning_rate=0.01, validation_fraction=0.2, random_state=1))
])
# Fit the model and transform with the final estimator
pipe_hist.fit(x_train_sm, y_train_sm)

In [None]:
# Repeats Stratified K-Fold n times with different randomization in each repetition.
stk_hist = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_hist = {
    'hist__max_iter' : [200, 300, 400]
}

# Search over specified parameter values with successive halving.
gridcv_hist = HalvingGridSearchCV(estimator=pipe_hist, 
                                  param_grid=param_hist, 
                                  cv=stk_hist, random_state=10)
# Run fit with all sets of parameters.
gridcv_hist.fit(x_train_sm, y_train_sm)

In [None]:
# cross-validated score of the best_estimator.
gridcv_hist.best_score_

In [None]:
# 
y_pred_hist = gridcv_hist.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
Hist_Grad_boost = list(evaluate_model(y_test, y_pred_hist, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_hist, x_train_sm, y_train_sm)

**XGBClassifier with Cross-validation and Hyper-parameter**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_xgb = Pipeline([
    # Principal component analysis (PCA)
    ('selection', PCA()),
    # Implementation of the scikit-learn API for XGBoost classification.
    ('xgb', XGBClassifier(booster='gbtree', max_depth=10, learning_rate=0.01, random_state=22))
])
# Fit the model and transform with the final estimator
pipe_xgb.fit(x_train_sm, y_train_sm)

In [None]:
# Provides train/test indices to split data in train/test sets.
stk_xgb = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_xgb = {
    'xgb__n_estimators' : [200, 300, 400]
}

# Search over specified parameter values with successive halving.
gridcv_xgb = HalvingGridSearchCV(estimator=pipe_xgb, 
                                  param_grid=param_xgb, 
                                  cv=stk_xgb, random_state=10)
# Run fit with all sets of parameters.
gridcv_xgb.fit(x_train_sm, y_train_sm)

In [None]:
print("Best XGBClassifier Parameter: {}".format(gridcv_xgb.best_params_))

In [None]:
# cross-validated score of the best_estimator.
gridcv_xgb.best_score_

**Using Mlflow for Tracking our Project**

MLflow is an Open Source tool to manage the life cycle of machine learning models. To do this, it has several main aspects:

Tracking : records the results and parameters of the models to be able to compare them.
Projects : packages the code in such a way that be reproducible.
Models : allows you to manage the versioning of models, as well as put ML models into production as an endpoint. 

In [None]:
# Set the tracking server URI.
mlflow.set_tracking_uri('file:///kaggle/working/mlruns')

In [None]:
# Set the given experiment as the active experiment.
set_experiment("Credit Card Default Prediction.")
ml_experiment_id = 1

In [None]:
# Start a new MLflow run
with mlflow.start_run(experiment_id=ml_experiment_id):
    
    # Log a batch of params for the current run.
    log_params(gridcv_xgb.best_params_)
    
    # get prediction on x data
    y_pred_xgb = gridcv_xgb.predict(x_test)
    
    # calculating accuracy, precision and recall
    # Accuracy classification score.
    accuracy = accuracy_score(y_test, y_pred_xgb)
    # Compute the precision.
    precision  = precision_score(y_test, y_pred_xgb, average='weighted')
    # Compute the F1 score, also known as balanced F-score or F-measure.
    f1 = f1_score(y_test, y_pred_xgb, average='weighted')
    
    # log parameter
    metrics = {
        "accuracy":accuracy,
        "precision":precision,
        "f1 score": f1
    }
    
    # Log multiple metrics for the current run.
    log_metrics(metrics)
    
    # log a local file or directory as an artifact of the currently active run
    log_artifact(local_path='../input/uci-credit-card/UCI_Credit_Card.csv', artifact_path='default.payment.next.month')
    
    # Log a scikit-learn model as an MLflow artifact for the current run.
    mlflow.sklearn.log_model(gridcv_xgb, "gridcv_xgb_model")

In [None]:
y_pred_xgb = gridcv_xgb.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
#XGB_mod = list(evaluate_model(y_test, y_pred_xgb, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_xgb, x_train_sm, y_train_sm)

**Random Forest Classifier with Cross_validation and hyper-parameter turning**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_rfc = Pipeline([
    # Principal component analysis (PCA)
    ('selection', PCA()), 
    # A random forest is a meta estimator that fits a number of decision tree
    ('rf', RandomForestClassifier(max_depth=8, oob_score=True, n_jobs=-1, random_state=400))
])
# Fit the model and transform with the final estimator
pipe_rfc.fit(x_train_sm, y_train_sm)

In [None]:
# Repeats Stratified K-Fold n times with different randomization in each repetition.
stk_rfc = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_rfc = {
    'rf__n_estimators' : [200, 300, 400]
}

# Search over specified parameter values with successive halving.
gridcv_rfc = HalvingGridSearchCV(estimator=pipe_rfc, 
                                  param_grid=param_rfc, 
                                  cv=stk_rfc, random_state=10)
# Run fit with all sets of parameters.
gridcv_rfc.fit(x_train_sm, y_train_sm)

In [None]:
# cross-validated score of the best_estimator.
gridcv_rfc.best_score_

In [None]:
# 
y_pred_rfc = gridcv_rfc.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
#rfc_mod = list(evaluate_model(y_test, y_pred_rfc, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_rfc, x_train_sm, y_train_sm)

**Support Vector Classifier with Cross_validation and hyper-parameter turning**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_svm = Pipeline([
    # Principal component analysis (PCA)
    ('selection', PCA()), 
    # C-Support Vector Classification.
    ('svm', SVC(kernel='rbf', random_state=2))
])
# Fit the model and transform with the final estimator
pipe_svm.fit(x_train_sm, y_train_sm)

In [None]:
# Repeats Stratified K-Fold n times with different randomization in each repetition.
stk_svm = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_svm = {
    'svm__C' : [2, 3, 4]
}

# Search over specified parameter values with successive halving.
gridcv_svm = HalvingGridSearchCV(estimator=pipe_svm, 
                                  param_grid=param_svm, 
                                  cv=stk_svm, random_state=10)
# Run fit with all sets of parameters.
gridcv_svm.fit(x_train_sm, y_train_sm)

In [None]:
# cross-validated score of the best_estimator.
gridcv_svm.best_score_

In [None]:
# 
y_pred_svm = gridcv_svm.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
#svm =list(evaluate_model(y_test, y_pred_svm, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_svm, x_train_sm, y_train_sm)

**CatBoost Classifier with Cross_validation and hyper-parameter turning**

In [None]:
# Pipeline of transforms with a final estimator.
pipe_cat = Pipeline([
    # Principal component analysis (PCA)
    ('selection', PCA()), 
    # mplementation of the scikit-learn API for CatBoost classification.
    ('cat', CatBoostClassifier(learning_rate=0.01, depth=8, max_bin=255, random_state=2))
])
# Fit the model and transform with the final estimator
pipe_cat.fit(x_train_sm, y_train_sm)

In [None]:
# Repeats Stratified K-Fold n times with different randomization in each repetition.
stk_cat = StratifiedKFold(n_splits=8, shuffle=True, random_state=12)

# specifying paramtere
param_cat = {
    'cat__iterations' : [550, 600, 800]
}

# Search over specified parameter values with successive halving.
gridcv_cat = HalvingGridSearchCV(estimator=pipe_cat, 
                                  param_grid=param_cat, 
                                  cv=stk_cat, random_state=10)
# Run fit with all sets of parameters.
gridcv_cat.fit(x_train_sm, y_train_sm)

In [None]:
# cross-validated score of the best_estimator.
gridcv_cat.best_score_

In [None]:
# 
y_pred_cat = gridcv_cat.predict(x_test)

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
#cat =list(evaluate_model(y_test, y_pred_cat, print_score=True))

In [None]:
# Plot Receiver operating characteristic (ROC) curve
plot_roc_curve(pipe_cat, x_train_sm, y_train_sm)

**Compare All Models Performance**

In [None]:
Metrics_name = ["Accuracy","Sensitivity","Specificity","Precision","YJS","F1-Score"]
zip2 = zip(Logistic_mod, Knn_mod, Hist_Grad_boost,) #XGB_mod, rfc_mod, svm, cat)
fnl=pd.DataFrame(zip2)
fnl.columns = ["LogisticRegression", "KNN", "HistGradBoost",] #"XGBooster", "Random Forest", "Support Vector Classifier","CatBoosterClassifier"]
fnl =fnl.transpose()
fnl.columns = Metrics_name
fnl= fnl.sort_values(by=['Accuracy'],ascending=False)
fnl


In [None]:
!mlflow ui