## Predicting Loan Default Risk
### Problem Statement:
Develop a robust machine learning pipeline to predict loan default risk, enabling better credit decisions and minimizing financial losses.
#### Target:good_bad_flag
- good - will pay back
- bad - will not payback

In [2]:
#importing all necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Our data comprises of Demographics, Financial and Previousrecords data each stored seperately in 3 different files , this are the link to those files

In [None]:
data1 = 'https://raw.githubusercontent.com/Oyeniran20/axia_cohort_8/refs/heads/main/traindemographics.csv'
data2 = 'https://raw.githubusercontent.com/Oyeniran20/axia_cohort_8/refs/heads/main/trainperf.csv'
data3 = 'https://raw.githubusercontent.com/Oyeniran20/axia_cohort_8/refs/heads/main/trainprevloans.csv'

In [None]:
df1 = pd.read_csv(data1)

In [None]:
df2 = pd.read_csv(data2)

### Data understanding of first dataset(Demographics)

In [None]:
df1.head()

In [None]:
df1.info()

#### Observations:
- The dataset contains 9 colums and 4346 rows
- There are 2 numerical columns and 7 categorical columns
- The features bank_branch_clients,  employment_status_clients and  level_of_education_clients all contain missing values
- The following columns will be dropped due to their irrelevance - bank_branch_clients,longitude_gps,latitude_gps
- Birthdate column needs transformation

In [None]:
#Dropping irrelevant columns

df1 = df1.drop(columns=['longitude_gps','latitude_gps'],axis=1)

#### Checking and dealing with Duplicate values

In [None]:
# checking for duplicate values

df1.duplicated().sum()

In [None]:
#Dropping duplicates
df1 = df1.drop_duplicates()

In [None]:
df1.duplicated().sum()

In [None]:
df1.shape

#### Dealing with missing values

In [None]:
#Checking total of null values

df1.isna().sum()

In [None]:
#percentage of missing value in bank_branch
4283/4346 * 100

This column will be dropped due to high % of missing values

In [None]:
#percentage of missing values in employment_status_clients

648/4346 * 100

In [None]:
df1['employment_status_clients'] = df1['employment_status_clients'].fillna('unknown')

In [None]:
#percentage of missing values in level_of_education_clients

3748/4346 * 100

The level_of_education_clients column is going to be dropped due to the high percentage of missing value

In [None]:
df1 = df1.drop(columns=['level_of_education_clients','bank_branch_clients'],axis=1)

#### Feature transformation

#### Creating new feature age from birthdate

The birthdate feature has relevant information but in it's current form its not useful so we will transform it

In [None]:
df1['birthdate']

In [None]:
#getting rid of the zeros

df1['birthdate'] = df1['birthdate'].str.split(' ').str[0]

In [None]:
#Getting only the year value 

df1['birthdate'] = df1['birthdate'].str.split('-').str[0]

In [None]:
#Converting it from object to integer

df1['birthdate'] = df1['birthdate'].astype(int)

In [None]:
df1['birthdate']

In [None]:
#Current year = 2025 , so age = 2025 - birthdate

df1['age'] = 2025 - df1['birthdate']

In [None]:
df1['age']

In [None]:
#Dropping birthdate column since not so relevant again

df1 = df1.drop('birthdate', axis=1)

In [None]:
df1

### Data understanding of second dataset(Performance)

In [None]:
df2.sample(3)

In [None]:
df2.info()

#### Observations:
- There are 10 colums and 4368 rows with 4 numerical columns and 6 categorical columns
- referredby column is the only feature with  missing value
#### The target column is good_bad_flag

In [None]:
#Percentage of missing values in referredby column

df2['referredby'].isna().sum() / 4368 * 100

Since it is very high , column will be dropped

In [None]:
# Dropping irrelevant columns

df2 = df2.drop(columns=['referredby'],axis=1)

In [None]:
# Checking for duplicate values

df2.duplicated().sum()

#### Creating loan_approval_min from approveddate and creationdate

In [None]:
df2['approveddate'] = df2['approveddate'].str.split('.').str[0]
df2['creationdate'] = df2['creationdate'].str.split('.').str[0]

In [None]:
df2.info()

In [None]:
df2['creationdate']

In [None]:
df2['approveddate'] = pd.to_datetime(df2['approveddate'])
df2['creationdate'] = pd.to_datetime(df2['creationdate'])

In [None]:
df2['loan_approval_speed'] = df2['approveddate'] - df2['creationdate']
df2['loan_approval_speed']

In [None]:
df2['loan_approval_min'] = (df2['loan_approval_speed'].dt.total_seconds() / 60).round(2)
df2['loan_approval_min']

In [None]:
df2.info()

### Data understanding of third dataset(Previous_records)

In [None]:
df3 = pd.read_csv(data3)

In [None]:
df3.head()

In [None]:
df3['customerid'].duplicated().sum()

#### Observations:
- So there are 12 columns and 18183 rows with only referredby column having missing values
- It shares a common column customerid with the other 2 datsets
- The customer id is meant to be a unique identifier but some rows share the same customerid
-  presence of duplicates in customerid column ####

In [None]:
#Removing the time(0:00:00) from the follwing columns to make them more meaningful

for col in ['approveddate','creationdate','closeddate','firstduedate','firstrepaiddate']:
    df3[col] = df3[col].str.split(' ').str[0]


#### Creating a new column called payment_status ####

Using the firstduedate and firstrepaiddate we are going to form a column called payment status.
- The firstduedate tells us the first expected date of payment of due
- The  firstrepaiddate tells us the actual date that the customer paid the first payment.

So comparing the firstrepaiddate and firstduedate can tell us if a customer made payment EARLY ,ONTIME or LATE.


In [None]:
# Convert the columns to date format for easy comparison

df3['firstrepaiddate']  = pd.to_datetime(df3['firstrepaiddate'] )
df3['firstduedate'] =  pd.to_datetime(df3['firstduedate'] )

In [None]:
#Function to compare dates and return payment status

def get_payment_status(row):
    if row['firstrepaiddate'] == row['firstduedate']:
        return 'ontime'
    elif row['firstrepaiddate'] < row['firstduedate']:
        return 'early'
    else:
        return 'late'

In [None]:
#Applying the function to each row in the DataFrame and creating the overall_payment_status column

df3['overall_payment_status'] = df3.apply(get_payment_status, axis=1)

In [None]:
#Dropping irrelevant columns which will not be needed

df3 = df3.drop(columns=['systemloanid','approveddate','creationdate','loanamount','totaldue','termdays','referredby'], axis=1)

#### Creating payment_status_score 
The code below gives each overall payment status a score inorder to later create Repayment Behaviour Score

In [None]:
#Giving each paymentstatus a score e.g early-5

status_score = {
    'early': 5,
    'ontime': 4.5,
    'late': -5
}
df3['payment_status_score'] = df3['overall_payment_status'].map(status_score)

In [None]:
df3.head()

In [None]:
df3['late_df'] = (df3['payment_status_score'] == -5.0).astype(int)
df3['early_df'] = df3['payment_status_score'].isin([4.5,5.0]).astype(int)

In [None]:
early_payments_count = df3.groupby('customerid')['early_df'].sum().reset_index(name='early_payments')

In [None]:
late_payments_count = df3.groupby('customerid')['late_df'].sum().reset_index(name='late_payments')


In [None]:
payments_count = pd.merge(early_payments_count,late_payments_count, on='customerid',how='left')
payments_count

#### Creating repayment_score
This is a sum total of the payment_score of rows with the SAME CUSTOMER ID  

In [None]:
df3_repayment_score = df3.groupby('customerid')['payment_status_score'].agg(lambda x: x.sum()).reset_index()
df3_repayment_score


### Merging payment_count and repayment_score together

In [None]:
df3_best = pd.merge(payments_count,df3_repayment_score,on ='customerid')
df3_best

### Merging dataset1 and dataset2

In [None]:
df4 = pd.merge(df1,df2, on ='customerid' )


#### MERGING DATASET4 AND ENGINEERED DF3_best

In [None]:
df = pd.merge(df4,df3_best, on ='customerid')
df.head()


#### Creating Custom Repayment-behaviour-score(RBS) from pastloans and payment-status-score
To calculate our custom Repayment-behaviour-score(RBS) we divide payment_status_score by total number of previousloans

In [None]:
# to get amount of previous loans taken we do loan amount(current amount of loans taken) subtracted by 1(subtract by 1 to exclude present loan taken)
df['pastloans'] = df['loannumber'] - 1

#Divide totalpayment_status_score by pastloans takento get RBS
df['repayment_behaviour_score'] = df['payment_status_score'] / df['pastloans'] 
df['repayment_behaviour_score'] = df['repayment_behaviour_score'].round(2) * 100
df.head()

In [None]:
#Dropping the following columns
df = df.drop(columns=['pastloans','bank_name_clients', 'loan_approval_speed'],axis =1)

In [None]:
df.info()

In [None]:
df.to_csv('loan_defaulters.csv', index=False)


In [None]:
df = df.drop(columns=['customerid' ,'systemloanid', 'approveddate','creationdate'], axis=1)

In [None]:
df.info()

### Oversampling

In [None]:
df['good_bad_flag'].value_counts()

In [None]:
sns.countplot(x =df['good_bad_flag'])

### Observation
There is presence of oversampling 
,the good class is more than the bad class

## Data Preprocessing

#### Encoding of target column

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder,StandardScaler
lb_encoder = LabelEncoder()

In [None]:
df['good_bad_flag'] = lb_encoder.fit_transform(df['good_bad_flag'] )

In [None]:
df['good_bad_flag']

In [None]:
x = df.drop('good_bad_flag',axis=1)
y = df['good_bad_flag']

In [None]:
cat_cols = x.select_dtypes(include='object').columns.tolist()
num_cols = x.select_dtypes(include=np.number).columns.tolist()


In [None]:
num_cols

#### Splitting our Data

In [None]:
from sklearn.model_selection import train_test_split
# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.05, random_state=42 , stratify=y)

### Creation of Pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib


preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), cat_cols)
    ]
)

# Build pipeline
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(class_weight="balanced"))
])

# Train pipeline
pipeline.fit(x_train, y_train)
# Save pipeline instead of raw model
joblib.dump(pipeline, "loan_pipeline.pkl")


#### Encoding and Scaling using pipeline

In [None]:
# 1. Transform raw features
x_train_transformed = preprocessor.fit_transform(x_train)
x_test_transformed = preprocessor.transform(x_test)

In [None]:
x_train

### Training Models

In [None]:
# pip install xgboost

In [None]:
# pip install lightgbm

In [None]:
# !pip show xgboost

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

#### Model Training & Evaluation

In [None]:
from sklearn.metrics import (accuracy_score,confusion_matrix,recall_score,precision_score,classification_report,f1_score,ConfusionMatrixDisplay)

In [None]:
models = {
    'Logistic Regression': LogisticRegression(class_weight='balanced'),
    'Decision Tree': DecisionTreeClassifier(random_state=42,class_weight='balanced'),
    'Random Forest': RandomForestClassifier(random_state=42,class_weight='balanced'),
    'XG Boost': XGBClassifier(random_state=42),
    'Light GBM': LGBMClassifier(random_state=42,is_unbalance=True)
}

In [None]:
results= {}
fig,axes = plt.subplots(1,5,figsize=(18,4))

for (name,model), ax in zip(models.items(),axes.flatten()):
    #training the model
    model.fit(x_train_transformed,y_train)

    #making predictions
    train_pred = model.predict(x_train_transformed)
    test_pred =model.predict(x_test_transformed)

    #evaluating model prediction
    #Accuracy:
    train_score = accuracy_score(y_train, train_pred)
    test_score = accuracy_score(y_test,test_pred)
    #Precision for class 0:
    precision = precision_score(y_test,test_pred, pos_label=0)
    #Recall for class 0:
    recall = recall_score(y_test,test_pred,pos_label=0)
    #F1 score for class 0
    f1 = f1_score(y_test,test_pred,pos_label=0)

    #storing evaluation to results dictionary
    results[name] = {
        'Train_Accuracy' : train_score,
        'Test_Accuracy' : train_score,
        'Precision_0' : precision,
        'Recall_0' : recall,
        'F1 Score_0': f1
        
    }
    #Confusion_matrix
    cm = confusion_matrix(test_pred,y_test)
    # cm2 = confusion_matrix(train_pred,y_train)
    
    disp = ConfusionMatrixDisplay(cm)
    # disp = ConfusionMatrixDisplay(cm2)
    disp.plot(ax=ax, cmap='Blues')
    ax.set_title(name)
plt.tight_layout()
plt.show()

metrics_df = pd.DataFrame(results)
print(metrics_df.round(3))



#those staying - 0
#those going -1

### Model improvement : Balancing oversampling in good_bad_flag feature

In [None]:
from imblearn.over_sampling import SMOTE
# from imblearn.over_sampling import ADASYN


In [None]:
smote = SMOTE(random_state=42)
x_train_resampled,y_train_resampled = smote.fit_resample(x_train_transformed,y_train)


In [None]:
y_train_resampled.value_counts()

In [None]:
results= {}
fig,axes = plt.subplots(1,5,figsize=(18,4))

for (name,model), ax in zip(models.items(),axes.flatten()):
    #training the model
    model.fit(x_train_resampled,y_train_resampled)

    #making predictions
    train_pred = model.predict(x_train_resampled)
    test_pred = model.predict(x_test_transformed)

    #evaluating model prediction
    #Accuracy:
    train_score = accuracy_score(y_train_resampled, train_pred)
    test_score = accuracy_score(y_test,test_pred)
    #Precision for 0:
    precision = precision_score(y_test,test_pred, pos_label=0)
    #Recall for 0:
    recall = recall_score(y_test,test_pred, pos_label=0)
    # F1 score for 0
    f1 = f1_score(y_test,test_pred, pos_label=0)
    

    #storing evaluation to results dictionary
    results[name] = {
        'Train_Accuracy' : train_score,
        'Test_Accuracy' : train_score,
        'Precision_0' : precision,
        'Recall_0' : recall,
        'F1 Score_0': f1
    }
    #Confusion_matrix
    cm = confusion_matrix(test_pred,y_test)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(ax=ax, cmap='Blues')
    ax.set_title(name)
plt.tight_layout()
plt.show()

metrics_df = pd.DataFrame(results)
print(metrics_df.round(3))



#those staying - 0
#those going -1

In [None]:
y_test.value_counts()

## Conclusions

##### Model performance dropped after applying SMOTE 

### Logistic regression stands out:
Logistic Regression is the best-performing model for this task due to it's high recall for defaulters of 0.694. Despite having higher False Positives, it significantly outperforms others in identifying actual defaulters, which aligns with the business objective of minimizing financial risk.
- As a result Logistic regression will be deployed

## Model deployment

In [None]:
# pip install streamlit
#Deployment was carried out in app.py file in folder 