PROJECT

PROBLEM: CLASSIFICATION PROBLEM ON THE PREDICTION OF LOAN DEFAULT

AIM: To Predict Loan Default and Repayment Behaviour by Customers using the Demographic of the Customer and their Previous Loan History

Install Libraries

In [None]:
!pip install statsmodels -q
!pip install imbalanced_learn -q
!pip install --upgrade xgboost lightgbm  -q
!pip install plotly -q

Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score,roc_curve, roc_auc_score
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
import plotly.io as pio
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import precision_recall_curve, average_precision_score
import joblib

Load data from csv

In [None]:
Performance_data = pd.read_csv('https://raw.githubusercontent.com/Oyeniran20/axia_cohort_8/refs/heads/main/trainperf.csv')
Demographic_data = pd.read_csv('https://raw.githubusercontent.com/Oyeniran20/axia_cohort_8/refs/heads/main/traindemographics.csv')
Previous_loan_data = pd.read_csv('https://raw.githubusercontent.com/Oyeniran20/axia_cohort_8/refs/heads/main/trainprevloans.csv')

Data understanding for Performance_data

In [None]:
Performance_data.head()

In [None]:
Performance_data.columns

In [None]:
Performance_data.shape

In [None]:
Performance_data.info()

Descriptive Statistic for Performance_data

In [None]:
Performance_data.describe().T


Check for missing values for Performance_data

In [None]:
Performance_data.isna().sum()

Check for the percentage(%) of missing values for Performance_data

In [None]:
Performance_data.isna().sum().sort_values(ascending=False)/len(Demographic_data)*100

Drop columns with 80% missing value and not an important column to the dataset for Performance_data

In [None]:
Performance_data.drop(columns=['referredby'], inplace = True)

Check for duplicates for Performance_data

In [None]:
Performance_data.duplicated().sum()

Data understanding for Demographic_data

In [None]:
Demographic_data.head()

In [None]:
Demographic_data.shape

In [None]:
Demographic_data.columns

In [None]:
Demographic_data.info()

In [None]:
Demographic_data.tail()

Descriptive statistic for Demographic_data

In [None]:
Demographic_data.describe().T

Check for missing values for Demographic_data

In [None]:
Demographic_data.isna().sum()


Check for the percentage(%) of missing values for Demographic_data

In [None]:
Demographic_data.isna().sum().sort_values(ascending=False)/len(Performance_data)*100

Check for duplicates for Demographic_data

In [None]:
Demographic_data.duplicated().sum()

Drop duplicates for Demographic_data

In [None]:
Demographic_data.drop_duplicates(inplace = True)

Data understanding for Previous_loan_data

In [None]:
Previous_loan_data.head()

In [None]:
Previous_loan_data.shape

In [None]:
Previous_loan_data.columns

In [None]:
Previous_loan_data.info()

In [None]:
Previous_loan_data.describe().T

Check for missing values for Previous_loan_data

In [None]:
Previous_loan_data.isna().sum()

Check for the percentage(%) of missing values for Previous_loan_data

In [None]:
Previous_loan_data.isna().sum().sort_values(ascending=False)/len(Previous_loan_data)*100

Drop columns with 80% missing value and not an important column to the dataset ;for Previous_loan_data

In [None]:
Previous_loan_data.drop(columns=['referredby'], inplace = True)

Check for duplicates for Previous_loan_data

In [None]:
Previous_loan_data.duplicated().sum()

In [None]:
Previous_loan_data['customerid'].unique()

In [None]:
Previous_loan_data['systemloanid'].unique()

In [None]:
Previous_loan_data['systemloanid'].count()

Feature Engineering: Previous_loan_data has repeated numbers of customers. Inorder to get the unique customers for Previous_loan_data, a new column will be created. Thus:
First; Covert dates of the needed column of Previous_loan_data to datetime.
Then; Create your feature engineering

For my feature engineering, I want to get the customer's payment behaviour. So, I engineered my features as shown below

In [None]:
Previous_loan_data['firstduedate'] = pd.to_datetime(Previous_loan_data['firstduedate'], errors = 'coerce')
Previous_loan_data['firstrepaiddate'] = pd.to_datetime(Previous_loan_data['firstrepaiddate'], errors = 'coerce')


In [None]:
Previous_loan_data['On_Time_Repayment'] = (Previous_loan_data['firstrepaiddate'] <= Previous_loan_data['firstduedate']).astype(int)
Previous_loan_data['Late_Time_Repayment'] = (Previous_loan_data['firstrepaiddate'] > Previous_loan_data['firstduedate']).astype(int)

In [None]:
Previous_loan_data_repayment = Previous_loan_data.groupby('customerid')[['On_Time_Repayment','Late_Time_Repayment']].sum().reset_index()
Previous_loan_data_repayment.head()

From the above feature engineering, it denoted the number of times a customer repaid a loan on time and at a later time.

Merge the 3 dataset on the customerid column (Demographic_data, Performance_data and Previous_loan_data_repayment)

In [None]:
df_merge = Performance_data.merge(Demographic_data, on = 'customerid', how = 'left' )
df = df_merge.merge(Previous_loan_data_repayment, on = 'customerid', how = 'left' )

Feature Engineering and Creation: Convert Columns with Dates to datetime, birthdate to age, and convert termdays to category as its features are discrete and limited unlike other numerical features that are continuous

In [None]:
df['approveddate'] = pd.to_datetime(df['approveddate'], errors = 'coerce')
df['creationdate'] = pd.to_datetime(df['creationdate'], errors = 'coerce')
df['birthdate'] = pd.to_datetime(df['birthdate'], errors = 'coerce')

#convert birthdate to age 
today = pd.to_datetime('today')
df['age'] = df['birthdate'].apply(
        lambda bd: today.year - bd.year - ((today.month, today.day) < (bd.month, bd.day))
    )

#create interest column
df['interest'] = (df['totaldue'] - df['loanamount']).astype(int)

#convert termdays to categorial column 
df['termdays'] = df['termdays'].astype('category')

#convert to numerical values
for col in ['approveddate', 'creationdate']:
    df[col] = pd.to_datetime(df[col], errors = 'coerce')
    
   # Extract datetime features
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_hour'] = df[col].dt.hour
    df[col + '_minute'] = df[col].dt.minute

#Drop columns
df.drop(columns = ['approveddate', 'creationdate', 'birthdate'], inplace = True)

Data understanding for the merged dataset(df)

In [None]:
df.head()

In [None]:
df.columns

Definition of features(columns) in df
- customerid - unique identifier for customers(borrowers)
- systemloanid - unique identifier for a particular loans
- loannumber - the number of the loan to be predicted
- loanamount - the amount of loan taken by customers
- interest - extra amount paid by customers
- totaldue - the sum of loanamount and interest
- termdays - loan term length in days
- good_bad_flag - repayment behaviour by customers(good = repaid/not defaulted, bad = defaulted)
- bank_account_type - the type of account(savings, current)
- longitude_gps - longitude of customer's location
- latitude_gps - latitude of customer's location
- bank_name_clients - customer's bank name
- bank_branch_clients - customer's bank branch
- employment_status_clients - customer's employment status
- level_of_education_clients - customer's educational level
- On_Time_Repayment - number of times customers paid early
- Late_Time_Repayment - number of times customers paid late
- approveddate year, month, day, minute - the time a loan was approved
- creation year, month, day, minute - the time a loan was created
- age - the age of the customer 

In [None]:
df.shape

In [None]:
df.info()

From the above information, I observed that the datatypes are corresponding, i.e, it is as it should be

Descriptive statistic for df

In [None]:
df.describe().T.round(2)

Observation: 
- there are missing values as count is not thesame
- Outliers are present 

Check for missing values

In [None]:
df.isna().sum()

Check for the percentage of missing values

In [None]:
df.isna().sum().sort_values(ascending = False) / len(df) * 100

Drop columns with 80% missing value and not an important column to the dataset for df1

In [None]:
df.drop(columns=['level_of_education_clients', 'bank_branch_clients'], inplace = True)

Check for columns with missing values and fill them

In [None]:
df['employment_status_clients'].isna().sum()

In [None]:
df['employment_status_clients'] = df['employment_status_clients'].fillna('Unknown')

In [None]:
df['bank_account_type'].isna().sum()

In [None]:
df['age'].isna().sum()

In [None]:
df['age'] = df['age'].fillna(df['age'].median())

In [None]:
df['bank_account_type'] = df['bank_account_type'].fillna('NA')

In [None]:
df['longitude_gps'].isna().sum()

In [None]:
df['longitude_gps'] = df['longitude_gps'].fillna(df['longitude_gps'].mean())

In [None]:
df['latitude_gps'].isna().sum()

In [None]:
df['latitude_gps'] = df['latitude_gps'].fillna(df['latitude_gps'].mean())

In [None]:
df['bank_name_clients'].isna().sum()

In [None]:
df['bank_name_clients'] = df['bank_name_clients'].fillna('NA')

In [None]:
df['Late_Time_Repayment'] = df['Late_Time_Repayment'].fillna(0)

In [None]:
df['On_Time_Repayment'] = df['On_Time_Repayment'].fillna(0)

Check for Duplicates

In [None]:
df.duplicated().sum()

Observation: No duplicates found

Check Categorical Columns in order to see if there's need for Grouping or Encoding

In [None]:
#visualize cat column
cols = ['bank_account_type','employment_status_clients','termdays','bank_name_clients']
plt.figure(figsize = (18, 18))
for i, col in enumerate(cols, 1):
    plt.subplot(2,2,i)
    sns.countplot(x =col, data = df)
    plt.title(f'Count Plot of{col}')
plt.show()

Observation: From te visualization of cat col above, the different features in each column was displayed with 'savings' having the highest distribution for bank_account_type, 'permanent' for employment_status_clients, '30 days' for termdays and 'GT Bank' for bank_name_clients.

In [None]:
#visualize cat column vs target column('good_bad_flag')
cols = ['bank_account_type','employment_status_clients','termdays','bank_name_clients']
plt.figure(figsize = (18, 18))
for i, col in enumerate(cols, 1):
    plt.subplot(2,2,i)
    sns.countplot(x =col, hue = 'good_bad_flag', data = df)
    plt.title(f'{col} vs good_bad_flag')
plt.show()

INSIGHT: From the above,(1) countplot showed the frequency and distribution of columms in relation to good_bad_flag. (2)It shows that the dataset is imbalanced as it predicts more good flag. (3)It showed the pattern of distribution 

Visualization of Numerical Columns to Check for Outliers using boxplot

In [None]:
#Numerical plots
num_cols = ['loannumber','loanamount', 'interest', 'totaldue','On_Time_Repayment','Late_Time_Repayment', 'age']
for col in num_cols:
    plt.figure(figsize = (8,4))
    #Histogram plot
    plt.subplot(1,2,1)
    sns.histplot(df[col], bins = 30, kde = True)
    plt.title(f'Histogram Distribution of {col}')
    #Box plot
    plt.subplot(1,2,2)
    sns.boxplot(df[col])
    plt.title(f'Boxplot Distribution of {col}')
    plt.show()

In [None]:
for col in num_cols:
    plt.figure(figsize = (8,4))
    #Boxplot
    plt.subplot(1,2,1)
    sns.boxplot(df[col])
    plt.title(f'Boxplot Distribution of {col}')
    #Violin plot
    plt.subplot(1,2,2)
    sns.violinplot(df[col])
    plt.title(f'Violinplot Distribution of {col}')
    plt.show()

Observation: all the numerical columns have outliers. This is because the columns are not normally skewed, they are either skewed to the left or right.

Calculate Skewness

In [None]:
num_cols_skewness = df[num_cols].skew().sort_values(ascending = False)
num_cols_skewness

Handle Outliers by using  Winsorization method(i.e, by capping them) with a default factor of 1.5

In [None]:
for col in num_cols:
    Q1 = df[num_cols].quantile(0.25)
    Q3 = df[num_cols].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[num_cols] = np.where(df[num_cols] < lower_bound, lower_bound,
                     np.where(df[num_cols] > upper_bound, upper_bound, df[num_cols]))


In [None]:
#Numerical plots
for col in num_cols:
    plt.figure(figsize = (8,4))
    #Histogram plot
    plt.subplot(1,2,1)
    sns.histplot(df[col], bins = 30, kde = True)
    plt.title(f'Histogram Distribution of {col}')
    #Box plot
    plt.subplot(1,2,2)
    sns.boxplot(df[col])
    plt.title(f'Boxplot Distribution of {col}')
    plt.show()

After using Winsorization method, i observed that the skewness was capped, it controlled the outliers. 

Correlation Analysis; This is done in order to detect redundancy and multicollinearity(i.e, features that are too correlated). Since I will be using LOGISTIC REGRESSION Model, I visualize the numerical columns and found multicollinearitry. In order to avoid that, I will be dropping the columns since it brings almost the same information.



In [None]:
plt.figure(figsize = (8,4))
heatmap = df[num_cols].corr()
sns.heatmap(data = heatmap, fmt = '.2f', annot = True, cmap = 'coolwarm')
plt.title('Correlation Matrix of Numerical Features', fontweight = 'bold')
plt.show()

Using VIF; this is used in order to Detect and confirm Multicollinearity

In [None]:
num_cols = df[num_cols]
num_cols_const = add_constant(num_cols)
vif = pd.DataFrame()
vif['features'] = num_cols_const.columns
vif['VIF'] = [variance_inflation_factor(num_cols_const.values, i)
             for i in range(num_cols_const.shape[1])]
vif

From the above information, I observed that
- loannumber, loanamout, totaldues and On_Time_Repayment have high correlation  while loanamount and totaldue have extremely high multicollinearity. In order to be safe while using Logistic Regression Model and not confuse my model, I will be dropping totaldue and keeping loanamount. This is because, loanamount is just the amount the customer collected while totaldue refers to the sum of the exact loan amount, interest and others charges such as late repayment charges, e.t.c.
- totaldue has been splitted into loanamount and interest, therefore, it is safe to drop
- Age has very low correlation(it is perfect and independent)

In [None]:
#drop totaldue
df.drop(columns = ['totaldue'], inplace = (True))

Check if the Target Column is Balanced or Not.
- check by counting the values
- check by plotting or visualizing using bar cart

In [None]:
df.good_bad_flag.value_counts()

In [None]:
df.good_bad_flag.value_counts().plot(kind = 'bar')

Observation: The target column is not balanced as this can be termed a 20-80% distribution
To balance the target column, I will be using the oversampling technique called SMOTE, this is because my dataset is small and i don't want to predict only the majority class.

Convert the datatype of the target column(good_bad_flag) from object(text) to integers(numbers)

In [None]:
df.good_bad_flag.unique()

In [None]:
df.good_bad_flag=(df.good_bad_flag == 'Good').astype(int)

In [None]:
df.good_bad_flag.unique()

DATA PREPARATION

Seperation of Columns into Features and Target Columns. While seperating my columns, I dropped some column because its not needed for my data preparation.
X = features and y = target

In [None]:
df.columns

In [None]:
df['latitude_gps']

In [None]:
pio.renderers.default = 'notebook'
fig = px.scatter_mapbox(
    df,
    lat = 'latitude_gps',
    lon = 'longitude_gps',
    hover_data = ['latitude_gps', 'longitude_gps'],
    zoom = 6,
    height = 600,
    width = 1000,
    center = {'lat': df['latitude_gps'].mean(), 'lon': df['longitude_gps'].mean()}
)
fig.update_layout(mapbox_style = 'open-street-map')
fig.show()

INSIGHTS
- it shows that the customers location is Nigeria
- its highly concentrated in southern and central Nigeria
- Northern Nigeria has fewer customers
- after analysing and knowing the location of the customers, the location columns will be removed as the features will not affect the target column

In [None]:
df.drop(columns = ['longitude_gps','latitude_gps'], inplace = True)

In [None]:
#convert termdays to cat col
bins = [0,30,60,90]
#ranges: 0-30, 31-60, 61-90
labels = ['short', 'medium', 'long']
df['termdays_cat'] = pd.cut(df['termdays'], bins = bins, labels = labels, right = True)
df['termdays_cat'] = df['termdays_cat'].astype('category')
df['termdays_cat'] = df['termdays_cat'].cat.set_categories(['short', 'medium', 'long'], ordered = True)

In [None]:
#Seperate columns into features(X) and target column(y)
X = df.drop(columns = ['customerid','systemloanid','termdays','good_bad_flag'])
y = df['good_bad_flag']

DATA PREPROCESSING


In [None]:
#split into Test Data and Train Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

Seperate into Categorical and Numerical Column

In [None]:
num_cols = X.select_dtypes(include = np.number).columns.tolist()
cat_cols = X.select_dtypes(include = ['object', 'category']).columns.tolist()

Encoding the Categorical Column and Scaling the Numerical Columns: After checking the cat_cols, I observed that there is need for encoding. i will be using the OneHotEncoder as my encoder because it is used for only categorical columns and it converts categorical columns into numeric features that are useable by models. Also, the numeriral columns will be scaled using standard scaler

In [None]:
num_pipeline = Pipeline(steps =[
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline(steps =[
    ('encoder', OneHotEncoder(sparse_output = False, handle_unknown = 'ignore'))
])

#Apply preprocessing
preprocessor = ColumnTransformer(transformers = [
    ('num',num_pipeline, num_cols),
    ('cat',cat_pipeline, cat_cols)
])

In [None]:
preprocessor


In [None]:
#INITIALIZE/SELECT Models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state = 42),
    'Logistic Regression': LogisticRegression(random_state = 42),
    'Random Forest': RandomForestClassifier(random_state = 42),
    'XGBoost': XGBClassifier(eval_metrics = 'logloss', random_state = 42),
    'LightGBM': LGBMClassifier(verbose = -1, random_state = 42),
    'Svm': SVC(kernel = 'rbf', C = 1.0, gamma = 'scale')
}

#MODEL EVALUATION using for loop
results = {}
fig, axes = plt.subplots(2, 3, figsize = (12, 4))
for (name, model), ax in zip(models.items(), axes.flatten()):

    pipeline = Pipeline(steps = [
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    #Traing the Model
    pipeline.fit(X_train, y_train)

    #Predict the Training and Test Score
    train_pred = pipeline.predict(X_train)
    test_pred = pipeline.predict(X_test)

    #Evaluate the Predictions
    train_score = accuracy_score(train_pred, y_train)
    test_score = accuracy_score(y_test, test_pred)
    precision = precision_score(y_test, test_pred)
    recall = recall_score(y_test, test_pred)
    F1 = f1_score(y_test, test_pred)
    #Store the Result
    results[name] = {
        'Train Accuracy': train_score,
        'Test Accuracy':  test_score,
        'Precision Score': precision,
        'Recall Score': recall,
        'F1-Score': F1
    }

    #Plot Confusion Matrix
    cm = confusion_matrix(y_test, test_pred)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(ax = ax, cmap = 'Blues')
    ax.set_title(name)
plt.tight_layout()
plt.show()

#Print Metrics
print('\n{name} Report:')
print(classification_report(y_test, test_pred))
metrics_df = pd.DataFrame(results)
print('\nSummary of Model Selection:')
print(metrics_df)

OBSERVATION:
- From the train accuracy, Random Forest Model and decision tree overfit the train accuracy.  
- The classification Report said that the model is much better at predicting class 1 than class 0. This is because class 1 has more classes than class 0(imbalanced).
- decision tree performed well with the precision score
- Logistic Regression and svm performed well with the recall score and f1 score but svm overfits recall.

- By predcting the imbalanced class, the models tend to predict the majority class more. 

Handling Imbalance

In [None]:
#Using Class WeIght: This helps to caution against mistakes on the minority class
models_cw = {
    'Decision Tree': DecisionTreeClassifier( class_weight = 'balanced', random_state = 42), 
    'Logistic Regression': LogisticRegression( class_weight = 'balanced', random_state = 42),
    'Random Forest': RandomForestClassifier( class_weight = 'balanced', random_state = 42),
    'XGBoost': XGBClassifier(eval_metrics = 'logloss',  class_weight = 'balanced',  random_state = 42),
    'LightGBM': LGBMClassifier( class_weight = 'balanced',verbose = -1, random_state = 42),
    'Svm': SVC(kernel = 'rbf', C = 1.0, gamma = 'scale', class_weight = 'balanced')
}

#MODEL EVALUATION using for loop
results = {}
fig, axes = plt.subplots(2, 3, figsize = (12, 4))
for (name, model), ax in zip(models_cw.items(), axes.flatten()):

    pipeline = Pipeline(steps = [
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    #Traing the Model
    pipeline.fit(X_train, y_train)

    #Predict the Training and Test Score
    train_pred_weighted = pipeline.predict(X_train)
    test_pred_weighted = pipeline.predict(X_test)

     #Evaluate the Predictions
    train_score_weighted = accuracy_score(train_pred_weighted, y_train)
    test_score_weighted = accuracy_score(y_test, test_pred_weighted)
    precision_weighted = precision_score(y_test, test_pred_weighted)
    recall_weighted = recall_score(y_test, test_pred_weighted)
    F1_weighted = f1_score(y_test, test_pred_weighted)

    #Store the Result
    results[name] = {
        'Train Accuracy': train_score_weighted,
        'Test Accuracy':  test_score_weighted,
        'Preision Score': precision_weighted,
        'Recall Score': recall_weighted,
        'F1-Score': F1_weighted
    }

    #Plot Confusion Matrix
    cm = confusion_matrix(y_test, test_pred_weighted)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(ax = ax, cmap = 'Blues')
    ax.set_title(name)
plt.tight_layout()
plt.show()

#Print Metrics
print('\n{name} Report:')
print(classification_report(y_test, test_pred_weighted))
metrics_df = pd.DataFrame(results)
print('\nSummary of Model Selection:')
print(metrics_df)

By using class weight for balancing, Random forest performs well with Recall score and logistic regression for precision. 

MODEL ON SMOTE_BALANCED DATA(RETRAINING MY MODEL ON THE RESAMPLED DATA): Now, I want to retrain with the balanced class which was gotten from the oversampling technique used(SMOTE) 

In [None]:
#using SMOTE
models_sm = {
    'Decision Tree': DecisionTreeClassifier(random_state = 42),
    'Logistic Regression': LogisticRegression(random_state = 42),
    'Random Forest': RandomForestClassifier(random_state = 42),
    'XGBoost': XGBClassifier(eval_metrics = 'logloss', random_state = 42),
    'LightGBM': LGBMClassifier(verbose = -1, random_state = 42),
    'Svm': SVC(kernel = 'rbf', C = 1.0, gamma = 'scale')
}

#Re_Evaluate Model
results = {}
fig, axes = plt.subplots(2, 3, figsize = (12, 4))
for (name, model), ax in zip(models_sm.items(), axes.flatten()):

    pipeline = ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state = 42)),
        ('classifier', model)
    ])

    #Traing the Model
    pipeline.fit(X_train, y_train)

    #Predict the Training and Test Score
    train_pred = pipeline.predict(X_train)
    test_pred = pipeline.predict(X_test)

    #Evaluate the Predictions
    train_score = accuracy_score(train_pred, y_train)
    test_score = accuracy_score(y_test, test_pred)
    precision = precision_score(y_test, test_pred)
    recall = recall_score(y_test, test_pred)
    F1 = f1_score(y_test, test_pred)

    #Store the Result
    results[name] = {
        'Train Accuracy': train_score,
        'Test Accuracy':  test_score,
        'Preision Score': precision,
        'Recall Score': recall,
        'F1-Score': F1
    }

    #Plot Confusion Matrix
    cm = confusion_matrix(y_test, test_pred)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(ax = ax, cmap = 'Blues')
    ax.set_title(name)
plt.tight_layout()
plt.show()

#Print Metrics
print('\n{name} Report:')
print(classification_report(y_test, test_pred))
metrics_df = pd.DataFrame(results)
print('\nSummary of Model Selection:')
print(metrics_df)

Observation:
- I observed that by using a balanced class, the models are not biased towards only the majority class
- it improves the models ability to detect defaulters.
- Decision Tree and Random Forest overfits the train accuracy.
- Random forest detected positives with high recall
- LightGBM best balance recall as it does not overfit train accuracy and performs well with f1 score

In [None]:
# Using a specific threshold of 0.5 to balance my model 
models_thres = {
    'Decision Tree': DecisionTreeClassifier(random_state = 42),
    'Logistic Regression': LogisticRegression(random_state = 42),
    'Random Forest': RandomForestClassifier(random_state = 42),
    'XGBoost': XGBClassifier(eval_metrics = 'logloss', random_state = 42),
    'LightGBM': LGBMClassifier(verbose = -1, random_state = 42),
    'Svm': SVC(kernel = 'rbf', C = 1.0, gamma = 'scale', probability = True)
}

#Re_Evaluate Model
results = {}
for (name, model), ax in zip(models_thres.items(), axes.flatten()):
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    #Traing the Model
    pipeline.fit(X_train, y_train)

y_prob = pipeline.predict_proba(X_test)[:, 1]



print('Threshold | Accuracy | Precision | Recall | F1')
print('-' * 50)
for t in np.arange(0.1, 0.9, 0.05):
    y_pred_thresh = (y_prob >= t).astype(int)
    test_score = accuracy_score(y_test, y_pred_thresh,)
    precision = precision_score(y_test, y_pred_thresh, zero_division = 0)
    recall = recall_score(y_test, y_pred_thresh, zero_division = 0)
    F1 = f1_score(y_test, y_pred_thresh, zero_division = 0)

     #Store the Result
    results[name] = {
        'Test Accuracy':  test_score,
        'Preision Score': precision,
        'Recall Score': recall,
        'F1-Score': F1
    }

    print(f'{t:.2f}      |   {test_score:.3f}  | {precision: .3f}    | {recall: .3f} | {F1: .3f}')

precision, recall, thresholds = precision_recall_curve(y_test, y_prob)


ap_score = average_precision_score(y_test, y_prob)

#Print Metrics
print('\n{name} Report:')
print(classification_report(y_test, test_pred))
metrics_df = pd.DataFrame(results)
print('\nSummary of Model Selection:')
print(metrics_df)

plt.figure(figsize = (8, 4))
plt.plot(recall, precision, marker = '.', label = f'Average Precision (AP={ap_score:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

INSIGHT: Upon using threshold, the model performed without adjustment as it misclassified loan defaulters

Upon handling imbalance with several methods as seen above, SMOTE method of handling imbalance suits the models and performed better than the others.
Therefore, SMOTE method was chosen amongst the others for further evaluation 

Hyperparameter Tuning : In order to avoid overfitting and underfitting as shown above, hyperparameter tuning is required, thus:

In [None]:
# Hyperparameter Tuning on all my Model for the balanced class with SMOTE
#Defining Model and Param
models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=500),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier(eval_metrics = 'logloss'),
    'LightGBM': LGBMClassifier(verbose = -1, random_state = 42),
    'SVC': SVC(probability = True)
}
param_grids = {
    "Logistic Regression": {
        'smote__sampling_strategy': [0.5, 1.0],
        "classifier__C": [ 0.1, 1, 10],
        "classifier__solver": ["liblinear", "lbfgs"]
    },
    "Decision Tree": {
        "classifier__max_depth": [None, 10],
        "smote__sampling_strategy": [0.5, 1.0],
        "classifier__min_samples_leaf": [1, 2, 4]
    },
    "Random Forest": {
        "classifier__n_estimators": [100, 200],
        'smote__sampling_strategy': [0.5, 1.0],
        "classifier__max_depth": [None, 10],
        "classifier__min_samples_split": [2, 5],
        "classifier__min_samples_leaf": [1, 2],
    },
    "LightGBM": {
        "classifier__n_estimators": [100, 200],
        'smote__sampling_strategy': [0.5, 1.0],
        "classifier__num_leaves": [31,63],
        "classifier__learning_rate": [0.01, 0.1]
    },
    "XGBoost": {
        "classifier__n_estimators": [100, 200],
        'smote__sampling_strategy': [0.5, 1.0],
        "classifier__max_depth": [3, 6],
        "classifier__learning_rate": [0.01, 0.1]
    },
    "SVC": {
        "classifier__C": [1, 10],
        'smote__sampling_strategy': [0.5, 1.0],
        "classifier__kernel": ['linear', 'rbf']
    }
}


#RandomizedSearchCV for each model
best_models = {}
results = []

for name, model in models.items():
    print(f'\nTraining{name}:\n')
    
    pipeline = ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state = 42)),
        ('classifier', model)
    ])

    search = RandomizedSearchCV(
        pipeline,
        param_distributions =param_grids[name],
        n_iter = 5,
        cv=3,
        scoring = "f1_weighted",
        refit="recall",   # pick recall as main refit metric
        n_jobs=1,
        random_state = 42
    )
    search.fit(X_train, y_train)
    best_models[name] = search.best_estimator_
    y_pred = search.best_estimator_.predict(X_test)


        # Metrics
    Accuracy = accuracy_score(y_test, y_pred),
    Precision = precision_score(y_test, y_pred, average="weighted"),
    Recall = recall_score(y_test, y_pred, average="weighted"),
    F1 = f1_score(y_test, y_pred, average="weighted"),
        
    
    results.append({
        'Models': name,
        "Best Params": search.best_params_,
        "Best CV Score": search.best_score_,
        "Test Accuracy": Accuracy,
        "Test Precision": Precision,
        "Test Recall": Recall,
        "Test F1 Score": F1
    })

#Final Results
df_results = pd.DataFrame(results).T
print(df_results)

OBSERVATION:
- After the hyperparameter tuning, overfitting and underfitting were reduced and the metrics were balanced.
- the best performing models are the Random Forest, XGBoost, logistic regression and lightgbm
- best Accuracy = logistic Regression
- best Precision = lightgbm
- best Recall = logistic regression
- best f1 = lightgbm

- From the above, logistic regression and lightgbm are the best models. In order to get the actual bet performing model, Cross validation will be performed on the models  

In [None]:
best_model_1 = best_models['Logistic Regression']
best_model_2 = best_models['LightGBM']

In [None]:
# Cross Validation on the two best Model
#Defining Model and Param
best_models = {
    "Logistic Regression":  best_model_1,
    'LightGBM': best_model_2
}
   
#Cross-validation and tuning
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = []

for name, model in best_models.items():
    print(f"\n Cross-Validating {name}:")

    scores = cross_validate(
        model,
        X, y,   #full dataset before train
        cv = cv,
        scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'],
        n_jobs = -1,
        return_train_score = False
    )

       # Metrics
    metrics = {
        'Models': name,
        "Accuracy Mean": scores['test_accuracy'].mean(),
        "Accuracy Std": scores['test_accuracy'].std(),
        "Precision Mean": scores['test_precision_weighted'].mean(),
        "Recall Mean": scores['test_recall_weighted'].mean(),
        "F1 Mean": scores['test_f1_weighted'].mean()
    }

    results.append(metrics) 
    
# Summary table
df_cv_results = pd.DataFrame(results).T
print("\nCross-Validation Results:\n")
print(df_cv_results)


INSIGHT:
- From the recall mean and accuracy mean, Logistic regression has higher value which means that it is better for predicting loan defaults
- logistic regression has lower variance(std) which means that it is more stable
- lightgbm has higher f1 mean
conclusion
- since f1 score balances both recall and precision  $FI-SCORE = 2 \times \frac{Precision \times Recall}{Precision + Recall}$
- Lightgbm is chosen as the best model for predicting loan default as it also handles nonlinear relationships and interaction in customers that logistic Regression might not do 

In [None]:
results = {}
best_model = LGBMClassifier(verbose = -1, random_state = 42)

pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state = 42)),
    ('classifier', best_model)
])

#Traing the Model
pipeline.fit(X_train, y_train)

#Predict the Training and Test Score
train_pred = pipeline.predict(X_train)
test_pred = pipeline.predict(X_test)

#Evaluate the Predictions
train_score = accuracy_score(train_pred, y_train)
test_score = accuracy_score(y_test, test_pred)
precision = precision_score(y_test, test_pred)
recall = recall_score(y_test, test_pred)
F1 = f1_score(y_test, test_pred)

#Store the Result
results[name] = {
    'Train Accuracy': train_score,
    'Test Accuracy':  test_score,
    'Preision Score': precision,
    'Recall Score': recall,
    'F1-Score': F1
}

#Plot Confusion Matrix
fig, ax = plt.subplots()
cm = confusion_matrix(y_test, test_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot(ax = ax, cmap = 'Blues')
ax.set_title(name)
plt.tight_layout()
plt.show()

#Print Metrics
print('\n{name} Report:')
print(classification_report(y_test, test_pred))
metrics_df = pd.DataFrame(results)
print('\nSummary of Model Selection:')
print(metrics_df)

INSIGHT
- TN = 74, predicted class 0 correctly
- FP = 212, misclassified as 1
- FN = 94, misclassified as 0
- TP = 931, correctly predicted class 1
- From classification report, it predicted 91% of actual defaulters with an accuracy of 77%
- it trained more than it tested,91% train accuracy and 77% test accuracy. it shows overfitting, in order to reduce overfitting, I will tune my model

In [None]:
#hyperparameter Tuning of LightGBM
pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state = 42)),
    ('classifier', LGBMClassifier(objective = 'binary', verbose = -1, random_state = 42))
])

#Randomized search
param_dist = {
    "classifier__n_estimators": randint(100, 200),
    "classifier__max_depth": randint(3, 6),
    "classifier__learning_rate": uniform(0.01, 0.1),
    "classifier__num_leaves": randint(15, 50),
    "classifier__subsample": uniform(0.6, 0.4),
    "classifier__colsample_bytree": uniform(0.6, 0.4),
    "classifier__scale_pos_weight": [1,2,5]
}

random_search = RandomizedSearchCV(
    estimator = pipeline,
    param_distributions =param_dist,
    n_iter = 10,
    cv=3,
    scoring = "f1",
    refit="recall",   # pick recall as main refit metric
    n_jobs=-1,
    verbose = 1,
    random_state = 42
)

#fit
random_search.fit(X_train, y_train)

#Results
print('Best Parameters:', random_search.best_params_)
print('Best Score:', random_search.best_score_)


In [None]:
#best tuned
best_model = random_search.best_estimator_
best_model

In [None]:
#Check probability based metrics

y_pred_proba =best_model.predict_proba(X_test)[:, 1]
print('ROC-AUC:', roc_auc_score(y_test, y_pred_proba))

#plot
fpr, tpr, threshold = roc_curve(y_test, y_pred_proba)
roc_auc =  roc_auc_score(y_test, y_pred_proba)

plt.figure(figsize = (12, 4))
plt.plot(fpr, tpr, color = 'blue', label = f'roc curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color = 'grey', linestyle= '--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - LightGBM(Tuned)')
plt.legend(loc = 'lower right')
plt.show()

FEATURE IMPORTANCE

In [None]:
lgbm_model = best_model.named_steps['classifier']
feature_names = best_model.named_steps['preprocessor'].get_feature_names_out

feature_names = X_train.columns
feature_importances = lgbm_model.feature_importances_
sort = np.argsort(feature_importances)[::-1]
sort
plt.figure(figsize = (12, 4))
plt.bar(range(len(feature_importances)),feature_importances[sort])

plt.show()

SAVE THE MODEL

In [None]:
#Save the model
joblib.dump(best_model, "loan_default_model.pkl")

In [None]:
#save to csv
best_model.to_csv('loan_prediction.csv')

Test the Model

In [101]:
import pandas as pd
import numpy as np
import joblib

In [102]:
model = joblib.load("loan_default_model.pkl")

In [103]:
model

In [104]:
X.head()

Unnamed: 0,loannumber,loanamount,bank_account_type,bank_name_clients,employment_status_clients,On_Time_Repayment,Late_Time_Repayment,age,interest,approveddate_year,approveddate_month,approveddate_day,approveddate_hour,approveddate_minute,creationdate_year,creationdate_month,creationdate_day,creationdate_hour,creationdate_minute,termdays_cat
0,12.0,30000.0,Other,Diamond Bank,Permanent,7.0,4.0,48.0,4500.0,2017,7,25,8,22,2017,7,25,7,22,short
1,2.0,15000.0,Savings,GT Bank,Permanent,0.0,0.0,40.0,2250.0,2017,7,5,17,4,2017,7,5,16,4,short
2,7.0,20000.0,Other,EcoBank,Permanent,3.0,3.0,40.0,2250.0,2017,7,6,14,52,2017,7,6,13,52,short
3,3.0,10000.0,Savings,First Bank,Permanent,0.0,2.0,47.0,1500.0,2017,7,27,19,0,2017,7,27,18,0,short
4,9.0,35000.0,Other,GT Bank,Permanent,8.0,0.0,39.0,4000.0,2017,7,3,23,42,2017,7,3,22,42,short


In [105]:
X['termdays_cat'].unique()

['short', 'medium', 'long']
Categories (3, object): ['short' < 'medium' < 'long']

In [106]:
p = model.named_steps['preprocessor'].get_feature_names_out()
p

array(['num__loannumber', 'num__loanamount', 'num__On_Time_Repayment',
       'num__Late_Time_Repayment', 'num__age', 'num__interest',
       'num__approveddate_year', 'num__approveddate_month',
       'num__approveddate_day', 'num__approveddate_hour',
       'num__approveddate_minute', 'num__creationdate_year',
       'num__creationdate_month', 'num__creationdate_day',
       'num__creationdate_hour', 'num__creationdate_minute',
       'cat__bank_account_type_Current', 'cat__bank_account_type_NA',
       'cat__bank_account_type_Other', 'cat__bank_account_type_Savings',
       'cat__bank_name_clients_Access Bank',
       'cat__bank_name_clients_Diamond Bank',
       'cat__bank_name_clients_EcoBank', 'cat__bank_name_clients_FCMB',
       'cat__bank_name_clients_Fidelity Bank',
       'cat__bank_name_clients_First Bank',
       'cat__bank_name_clients_GT Bank',
       'cat__bank_name_clients_Heritage Bank',
       'cat__bank_name_clients_Keystone Bank',
       'cat__bank_name_clients_NA'