## -- Development of the Credit Risk Scoring Model --

Project Overview

This notebook explains the development of a credit risk scoring model. It covers data preprocessing, exploratory data analysis, and the implementation of various machine learning algorithms to predict loan default status. The goal is to build robust models that accurately assess risk.

#### 1. Import necessary Liberties
This section imports all necessary Python libraries for data manipulation, visualization, preprocessing, and machine learning model development.

In [3]:
# Import necessary libraries
import pandas as pd # for data manipulation and analysis.
import numpy as np # for numerical operations.
import seaborn as sns # for data visualization.
import matplotlib.pyplot as plt # for data visualization.
from sklearn.preprocessing import StandardScaler, OneHotEncoder # for preprocessing, model selection, and evaluation.

#### 2. Load Data
The credit risk dataset is loaded into a pandas DataFrame, and its initial structure is inspected. This step confirms successful data ingestion and provides a first look at the raw data.

In [4]:
# Read the credit risk dataset from a CSV file into a pandas DataFrame.
credit_risk = pd.read_csv("/home/ephraim/Projects/Credit-Risk-Scoring-Model/data/credit_risk_dataset.csv")
credit_risk.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [5]:
credit_risk.shape

(32581, 12)

#### 3. Data Prep & Descriptive Stats: Outliers
This section focuses on identifying and handling outliers in key numerical features, specifically person_age and person_emp_length. Outliers can disproportionately influence model training, so their removal or capping is crucial for model performance.

In [6]:
# The .describe() method generates descriptive statistics for the numerical columns in the DataFrame.
# This helps in identifying potential outliers and data errors.
credit_risk.describe()
# min age 20 makes sense, age 144 doesn't make sense --> data error --> delete probably
# person_income --> plot and see
# person_emp_length --> 123 years doesn't make sense --> delete
# loan status --> target value
# loan_percent_income: total monthly loan payments / total monthly income * 100
# cb_person_cred_hist_length --> max 30 years

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.218164,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.413006,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


In [7]:
# Create copy to ensure that the original data remains unchanged during the data cleaning and preprocessing steps.
credit_risk_copy = credit_risk.copy()

In [8]:
# credit_risk.pivot_table(index='person_age', columns='loan_status', values = 'person_income',aggfunc='count').reset_index().sort_values(by='person_age',ascending=False)
## age greater than 70 --> no defaulter --> delete records with age > 70

In [9]:
# Removes rows where 'person_age' is greater than 70.
cr_age_rmvd = credit_risk[credit_risk['person_age']<=70]
cr_age_rmvd.reset_index(drop=True, inplace=True)
# Shows the DataFrame's shape before and after the removal
print(credit_risk.shape)
print(cr_age_rmvd.shape)
cr_age_rmvd.head()


(32581, 12)
(32568, 12)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [10]:
# cr_age_rmvd.pivot_table(index='person_emp_length', columns='loan_status',values='person_income',aggfunc='count').reset_index().sort_values(by='person_emp_length', ascending=False)
## shows outliers in for person_emp_length

In [11]:
# remove records with employment length above 47 years
person_emp_rmvd = cr_age_rmvd[cr_age_rmvd['person_emp_length']<=47]
# confirm removal of records
print(cr_age_rmvd.shape)
person_emp_rmvd.reset_index(drop = True, inplace=True) 
print(person_emp_rmvd.shape)
person_emp_rmvd.head()

(32568, 12)
(31671, 12)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
1,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
2,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
3,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
4,21,9900,OWN,2.0,VENTURE,A,2500,7.14,1,0.25,N,2


In [12]:
# updated overview of the dataframe, with outliers in employment length and age removed
person_emp_rmvd.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,31671.0,31671.0,31671.0,31671.0,28626.0,31671.0,31671.0,31671.0
mean,27.717754,66492.31,4.780714,9660.637492,11.04007,0.215497,0.169621,5.804395
std,6.159859,52774.13,4.028718,6334.716643,3.229507,0.411173,0.106275,4.048776
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,39366.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,56000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,80000.0,7.0,12500.0,13.48,0.0,0.23,8.0
max,70.0,2039784.0,38.0,35000.0,23.22,1.0,0.83,30.0


#### 4. Data Prep & Descriptive Stats: Missing Values
This section addresses missing values in the dataset, specifically within the loan_int_rate column. Missing values are imputed to ensure the completeness of the dataset for model training.



In [13]:
# count missing values to see what to do with them
person_emp_rmvd.isnull().sum()
# see loan_int_rate in descriptive statistics table above: mean and median are close to each other --> imbute missing values with mean or median 

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length                0
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3045
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

In [14]:
# Create copy to ensure that the original data remains unchanged during the data cleaning and preprocessing steps.
cr_data = person_emp_rmvd.copy()
# fill missing values with median
cr_data.fillna({'loan_int_rate':cr_data['loan_int_rate'].median()},inplace=True)

In [15]:
# check for missing values
cr_data.isnull().sum()

person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_status                   0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
dtype: int64

In [16]:
cr_data.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,31671.0,31671.0,31671.0,31671.0,31671.0,31671.0,31671.0,31671.0
mean,27.717754,66492.31,4.780714,9660.637492,11.035256,0.215497,0.169621,5.804395
std,6.159859,52774.13,4.028718,6334.716643,3.070364,0.411173,0.106275,4.048776
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,39366.0,2.0,5000.0,8.49,0.0,0.09,3.0
50%,26.0,56000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,80000.0,7.0,12500.0,13.16,0.0,0.23,8.0
max,70.0,2039784.0,38.0,35000.0,23.22,1.0,0.83,30.0


#### 5. Check for Class Imbalance and Remaining Variables
This section analyzes the distribution of the target variable (loan_status) to identify class imbalance. It also examines the distribution of other categorical features to understand their unique values and potential for encoding.

In [17]:
# See how many records of each target value exist
cr_data.groupby('loan_status').count()['person_age']

loan_status
0    24846
1     6825
Name: person_age, dtype: int64

In [18]:
# Calculate the percantage of defaulters to non-defaulters
6825/(6825+24846)

0.21549682675002368

In [19]:
cr_data.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
1,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
2,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
3,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
4,21,9900,OWN,2.0,VENTURE,A,2500,7.14,1,0.25,N,2


In [20]:
cr_data.groupby('person_home_ownership').count()['loan_intent']

person_home_ownership
MORTGAGE    13088
OTHER         107
OWN          2410
RENT        16066
Name: loan_intent, dtype: int64

In [21]:
cr_data.groupby('loan_intent').count()['person_home_ownership']

loan_intent
DEBTCONSOLIDATION    5064
EDUCATION            6288
HOMEIMPROVEMENT      3510
MEDICAL              5891
PERSONAL             5365
VENTURE              5553
Name: person_home_ownership, dtype: int64

In [22]:
cr_data.groupby('loan_grade').count()['person_home_ownership']

loan_grade
A    10365
B    10181
C     6318
D     3555
E      952
F      236
G       64
Name: person_home_ownership, dtype: int64

In [23]:
# remove loan grade assuming information is not available at moment of credit scoring
cr_data_copy = cr_data.drop('loan_grade', axis=1)
display(cr_data_copy.shape)
display(cr_data_copy.head())

(31671, 11)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,21,9600,OWN,5.0,EDUCATION,1000,11.14,0,0.1,N,2
1,25,9600,MORTGAGE,1.0,MEDICAL,5500,12.87,1,0.57,N,3
2,23,65500,RENT,4.0,MEDICAL,35000,15.23,1,0.53,N,2
3,24,54400,RENT,8.0,MEDICAL,35000,14.27,1,0.55,Y,4
4,21,9900,OWN,2.0,VENTURE,2500,7.14,1,0.25,N,2


#### 6. Categorical Feature Treatement
Here, categorical features are converted into a numerical format using one-hot encoding. This transformation is essential for machine learning algorithms, which typically require numerical input.

In [24]:
cr_data_cat_treated = cr_data_copy.copy()

In [25]:
## Verify the different values of different features
# cr_data_cat_treated.groupby('person_home_ownership').count()['person_age']
# cr_data_cat_treated.groupby('loan_intent').count()['person_age']
cr_data_cat_treated.groupby('cb_person_default_on_file').count()['person_age']

cb_person_default_on_file
N    26043
Y     5628
Name: person_age, dtype: int64

In [26]:
# One-hot encode the 'person_home_owenership' column (MORTGAGE is represented by all zeros)
person_home_ownership = pd.get_dummies(cr_data_cat_treated['person_home_ownership'], drop_first = True).astype(int)
print(person_home_ownership.head())
person_home_ownership.columns = ['OTHER','OWN','RENT']
print(person_home_ownership.head())

   OTHER  OWN  RENT
0      0    1     0
1      0    0     0
2      0    0     1
3      0    0     1
4      0    1     0
   OTHER  OWN  RENT
0      0    1     0
1      0    0     0
2      0    0     1
3      0    0     1
4      0    1     0


In [27]:
# One-hot encode the 'loan_intent' column (DEBTCONSOLIDATION is represented by all zeros)
loan_intent = pd.get_dummies(cr_data_cat_treated['loan_intent'], drop_first = True).astype(int)
print(loan_intent.head())
loan_intent.columns = ['EDUCATION', 'HOMEIMPROVEMENT', 'MEDICAL', 'PERSONAL', 'VENTURE']
loan_intent.head()

   EDUCATION  HOMEIMPROVEMENT  MEDICAL  PERSONAL  VENTURE
0          1                0        0         0        0
1          0                0        1         0        0
2          0                0        1         0        0
3          0                0        1         0        0
4          0                0        0         0        1


Unnamed: 0,EDUCATION,HOMEIMPROVEMENT,MEDICAL,PERSONAL,VENTURE
0,1,0,0,0,0
1,0,0,1,0,0
2,0,0,1,0,0
3,0,0,1,0,0
4,0,0,0,0,1


In [28]:
# One-hot encode the 'cb_person_default_on_file_binary' column (Yes = 1, No = 0)
cr_data_cat_treated['cb_person_default_on_file_binary'] = np.where(cr_data_cat_treated['cb_person_default_on_file']=='Y',1,0)
cr_data_cat_treated.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,cb_person_default_on_file_binary
0,21,9600,OWN,5.0,EDUCATION,1000,11.14,0,0.1,N,2,0
1,25,9600,MORTGAGE,1.0,MEDICAL,5500,12.87,1,0.57,N,3,0
2,23,65500,RENT,4.0,MEDICAL,35000,15.23,1,0.53,N,2,0
3,24,54400,RENT,8.0,MEDICAL,35000,14.27,1,0.55,Y,4,1
4,21,9900,OWN,2.0,VENTURE,2500,7.14,1,0.25,N,2,0


#### 7. Standaridize Numerical Features
Numerical features are standardized (scaled) to have a mean of 0 and a standard deviation of 1. This process helps prevent features with larger numerical ranges from dominating the learning process of certain algorithms.

In [29]:
# Drop categorical features for scaling process
data_to_scale = cr_data_cat_treated.drop(['person_home_ownership','loan_intent','loan_status', 'cb_person_default_on_file', 'cb_person_default_on_file_binary'], axis = 1)
data_to_scale.head()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length
0,21,9600,5.0,1000,11.14,0.1,2
1,25,9600,1.0,5500,12.87,0.57,3
2,23,65500,4.0,35000,15.23,0.53,2
3,24,54400,8.0,35000,14.27,0.55,4
4,21,9900,2.0,2500,7.14,0.25,2


In [30]:
# Scaler to be used
scaler = StandardScaler()
# Formula for Scaling: (x - mean of x) / std of x

In [31]:
# Verify features to be scaled
data_to_scale.columns

Index(['person_age', 'person_income', 'person_emp_length', 'loan_amnt',
       'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length'],
      dtype='object')

In [32]:
# Standardize the numerical features.
# StandardScaler is used to remove the mean and scale to unit variance.
# .fit_transform() calculates the mean and standard deviation from the data and then applies the scaling.
scaled_data = scaler.fit_transform(data_to_scale)
# Create a new DataFrame from the scaled data.
# The scaled data is a NumPy array, so it is converted back to a DataFrame.
scaled_df = pd.DataFrame(scaled_data, columns =['person_age', 'person_income', 'person_emp_length', 'loan_amnt',
      'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length'])
scaled_df.head()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length
0,-1.090587,-1.078051,0.054432,-1.367192,0.034115,-0.655113,-0.939656
1,-0.441211,-1.078051,-0.938456,-0.65681,0.597575,3.767461,-0.692664
2,-0.765899,-0.018803,-0.19379,4.000141,1.366226,3.391072,-0.939656
3,-0.603555,-0.229137,0.799097,4.000141,1.053554,3.579267,-0.445671
4,-1.090587,-1.072366,-0.690234,-1.130398,-1.268682,0.756347,-0.939656


In [33]:
# data is scaled
print(round(np.mean(scaled_df.person_age),2))
round(np.std(scaled_df.person_income),2)

-0.0


np.float64(1.0)

In [34]:
scaled_df.shape

(31671, 7)

In [35]:
# Include categorical variables again
scaled_data_combined = pd.concat([scaled_df, person_home_ownership, loan_intent],axis=1)
print(scaled_data_combined.shape)
scaled_data_combined['cb_person_default_on_file']= cr_data_cat_treated['cb_person_default_on_file_binary']
scaled_data_combined['loan_status'] = cr_data_cat_treated['loan_status']
scaled_data_combined.head()

(31671, 15)


Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,OTHER,OWN,RENT,EDUCATION,HOMEIMPROVEMENT,MEDICAL,PERSONAL,VENTURE,cb_person_default_on_file,loan_status
0,-1.090587,-1.078051,0.054432,-1.367192,0.034115,-0.655113,-0.939656,0,1,0,1,0,0,0,0,0,0
1,-0.441211,-1.078051,-0.938456,-0.65681,0.597575,3.767461,-0.692664,0,0,0,0,0,1,0,0,0,1
2,-0.765899,-0.018803,-0.19379,4.000141,1.366226,3.391072,-0.939656,0,0,1,0,0,1,0,0,0,1
3,-0.603555,-0.229137,0.799097,4.000141,1.053554,3.579267,-0.445671,0,0,1,0,0,1,0,0,1,1
4,-1.090587,-1.072366,-0.690234,-1.130398,-1.268682,0.756347,-0.939656,0,1,0,0,0,0,0,1,0,1


#### 8. Class Balancing (SMOTE)
To address the identified class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) is applied. SMOTE generates synthetic samples for the minority class, helping to balance the dataset and improve model performance on the minority class.

In [36]:
# Class Imbalance
scaled_data_combined.groupby('loan_status').count()['EDUCATION']

loan_status
0    24846
1     6825
Name: EDUCATION, dtype: int64

In [37]:
# Class Imbalance
6825 / (6825+24846)

0.21549682675002368

In [39]:
# Over-sampling the minority class with SMOTE - Synthetic Minority Over_Sampling Technique.
# Used to handle imbalanced datasets by creating synthetic data points.
from imblearn.over_sampling import SMOTE
smote = SMOTE()

In [40]:
# Separate features and target for over-sampling
target = scaled_data_combined['loan_status']
features = scaled_data_combined.drop('loan_status', axis = 1)

In [41]:
# Over-sample
balanced_features, balanced_targets = smote.fit_resample(features,target)

In [42]:
# Count of records before and after over-sampling
print(features.shape)
balanced_features.shape
# oversampling completed

(31671, 16)


(49692, 16)

In [43]:
# Count of targets of each class before over_sampling
scaled_data_combined.groupby('loan_status').size()

loan_status
0    24846
1     6825
dtype: int64

In [44]:
# Count of targets of each class after over_sampling
balanced_target_df = pd.DataFrame({'target':balanced_targets})
balanced_target_df.groupby('target').size()
# data is balanced

target
0    24846
1    24846
dtype: int64

#### 9. Models Training and Evaluation
This section involves training three distinct classification models: Logistic Regression, Random Forest, and XGBoost. Each model is trained on the preprocessed and balanced data, and their performance is evaluated using standard classification metrics and confusion matrices.

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier # a popular gradient boosting library.

In [46]:
# Split dataset in train and in test set
x_train, x_test, y_train, y_test = train_test_split(balanced_features, balanced_targets, test_size=0.20, random_state=42)

##### 9.1 Logistic Regression
Implementation and evaluation of the Logistic Regression model. This provides a baseline understanding of linear separability and feature importance.

In [47]:
# Create logistic regression model
logit = LogisticRegression()

In [48]:
# Fit data to logistic regression model
logit.fit(x_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [50]:
import pickle # for serializing and deserializing Python objects, which is useful for saving and loading models.
import os # for interacting with the operating system, which might be used for file path

# Define the full path to an accessible location to, e.g. desktop
save_path = r'/home/ephraim/Projects/Credit-Risk-Scoring-Model/models/logisticPDmodel.pkl'

# Save the model
with open(save_path, 'wb') as file:
    pickle.dump(logit, file)

In [51]:
# Calculate the model's accuracy score on the training data.
logit.score(x_train,y_train)

0.7819787185872764

In [52]:
# Predict the class labels for the test data.
logit_prediction = logit.predict(x_test)

In [53]:
# Print a classification report to evaluate the model's performance on the test data.
print(classification_report(y_test, logit_prediction))

              precision    recall  f1-score   support

           0       0.78      0.78      0.78      4995
           1       0.78      0.78      0.78      4944

    accuracy                           0.78      9939
   macro avg       0.78      0.78      0.78      9939
weighted avg       0.78      0.78      0.78      9939



In [54]:
# Print the coefficient values for each feature in the logistic regression model.
print(logit.coef_[0])

[-0.04660199  0.06712773 -0.03113517 -0.69687403  0.98032013  1.41589396
 -0.01972225 -0.73272247 -2.13306561  0.44888198 -1.2254961  -0.36454253
 -0.56414702 -1.01053503 -1.55618727  0.09715312]


In [55]:
# Create a DataFrame to store feature names and their corresponding importance scores.
features_imp_logit = pd.DataFrame({'features': balanced_features.columns, 'logit_imp':logit.coef_[0]})
# features_imp_logit.sort_values(by='logit_imp',ascending=False)
## values around 0 are less important

##### 9.2 Random Forest
Implementation and evaluation of the Random Forest Classifier. This ensemble method typically offers strong predictive performance and feature importance insights.

In [56]:
# Create random forest model
rf = RandomForestClassifier()

In [57]:
# Fit daa ato model
rf.fit(x_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [58]:
# Define the full path to an accessible location to, e.g. desktop
save_path = r'/home/ephraim/Projects/Credit-Risk-Scoring-Model/models/RandomForestPDmodel.pkl'

# Save the model
with open(save_path, 'wb') as file:
    pickle.dump(rf, file)

In [59]:
# Accuracy of random forest model
rf.score(x_train, y_train)

1.0

In [60]:
# Predict targets with the features of the test data
rf_prediction = rf.predict(x_test)
rf_prediction

array([0, 0, 0, ..., 0, 0, 1], shape=(9939,))

In [61]:
# Print classification report to evaluate model performance
print(classification_report(y_test, rf_prediction))

              precision    recall  f1-score   support

           0       0.91      0.97      0.94      4995
           1       0.97      0.91      0.94      4944

    accuracy                           0.94      9939
   macro avg       0.94      0.94      0.94      9939
weighted avg       0.94      0.94      0.94      9939



In [62]:
# Evaluate feature importance
rf.feature_importances_

array([0.0602934 , 0.14580095, 0.07724074, 0.08127472, 0.208456  ,
       0.20651526, 0.06406075, 0.0004416 , 0.02140507, 0.04721246,
       0.01536798, 0.01576841, 0.00849298, 0.01258271, 0.01757348,
       0.0175135 ])

In [63]:
# Create a DataFrame to store feature names and their corresponding importance scores.
features_imp_rf = pd.DataFrame({'features': balanced_features.columns, 'rf_imp':rf.feature_importances_})
# features_imp_rf.sort_values(by='rf_imp', ascending=False)
## values around 0 are less important

##### 9.3 XgBoost Model
Implementation and evaluation of the XGBoost Classifier, a highly efficient gradient boosting algorithm known for its accuracy and speed.

In [64]:
# Create XGBoost Model
xgb_model = XGBClassifier(tree_method = 'exact')

In [65]:
# model.fit(x,y.values.ravel())
## Fit data to model
xgb_model.fit(x_train,y_train.values.ravel())

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [67]:
# Define the full path to an accessible location e.g. desktop
save_path = r'/home/ephraim/Projects/Credit-Risk-Scoring-Model/models/RandomForestPDmodel.pkl'

# Save the model
with open(save_path, 'wb') as file:
    pickle.dump(xgb_model, file)

In [68]:
# Accuracy of XGBoost Model
xgb_model.score(x_train,y_train.values.ravel())

0.9665182501949539

In [69]:
# Predict targets with the features of the test data
xgb_prediction = xgb_model.predict(x_test)
xgb_prediction

array([0, 0, 0, ..., 0, 0, 1], shape=(9939,))

In [70]:
# Print classification report to evaluate model performance
print(classification_report(y_test, xgb_prediction))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95      4995
           1       0.98      0.91      0.94      4944

    accuracy                           0.95      9939
   macro avg       0.95      0.95      0.95      9939
weighted avg       0.95      0.95      0.95      9939



In [71]:
# Evaluate feature importance
xgb_model.feature_importances_

array([0.0365541 , 0.03623047, 0.05745076, 0.01010816, 0.07080212,
       0.1362861 , 0.06876947, 0.00623982, 0.17073427, 0.1173819 ,
       0.06108671, 0.05706625, 0.01514678, 0.04897811, 0.0903582 ,
       0.01680682], dtype=float32)

In [72]:
# Create a DataFrame to store feature names and their corresponding importance scores.
features_imp_xgb = pd.DataFrame({'features': balanced_features.columns, 'xgb_imp': xgb_model.feature_importances_})
# features_imp_xgb.sort_values(by = 'xgb_imp', ascending = False)

In [73]:
# Combine all feature importances in one dataframe
features_imp = pd.concat([features_imp_logit, features_imp_rf, features_imp_xgb], axis=1)
features_imp

Unnamed: 0,features,logit_imp,features.1,rf_imp,features.2,xgb_imp
0,person_age,-0.046602,person_age,0.060293,person_age,0.036554
1,person_income,0.067128,person_income,0.145801,person_income,0.03623
2,person_emp_length,-0.031135,person_emp_length,0.077241,person_emp_length,0.057451
3,loan_amnt,-0.696874,loan_amnt,0.081275,loan_amnt,0.010108
4,loan_int_rate,0.98032,loan_int_rate,0.208456,loan_int_rate,0.070802
5,loan_percent_income,1.415894,loan_percent_income,0.206515,loan_percent_income,0.136286
6,cb_person_cred_hist_length,-0.019722,cb_person_cred_hist_length,0.064061,cb_person_cred_hist_length,0.068769
7,OTHER,-0.732722,OTHER,0.000442,OTHER,0.00624
8,OWN,-2.133066,OWN,0.021405,OWN,0.170734
9,RENT,0.448882,RENT,0.047212,RENT,0.117382


#### 10. Further Exploration
This section combines the predictions from all trained models and integrates them back into the original dataset for comprehensive analysis. This allows for a holistic view of how each model performs on the original data structure.

In [74]:
# Creates a DataFrame for the XGBoost, Random Forest, and Logistic Regression models' predictions.
xgb_prediction_df = pd.DataFrame({'test_indices_xgb':x_test.index, 'xgb_pred':xgb_prediction})
rf_prediction_df = pd.DataFrame({'test_indices_rf':x_test.index, 'rf_pred':rf_prediction})
logit_prediction_df = pd.DataFrame({'test_indices_logit':x_test.index, 'logit_pred':logit_prediction})

In [75]:
xgb_prediction_df.head()

Unnamed: 0,test_indices_xgb,xgb_pred
0,24808,0
1,9935,0
2,14054,0
3,147,0
4,4070,1


In [76]:
# Merges the original dataset with the XGBoost predictions DataFrame using the index.
merged_with_orig = cr_data_copy.merge(xgb_prediction_df, left_index = True, right_on = 'test_indices_xgb', how = 'left')
merged_with_orig.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,test_indices_xgb,xgb_pred
,21,9600,OWN,5.0,EDUCATION,1000,11.14,0,0.1,N,2,0,
9831.0,25,9600,MORTGAGE,1.0,MEDICAL,5500,12.87,1,0.57,N,3,1,1.0
,23,65500,RENT,4.0,MEDICAL,35000,15.23,1,0.53,N,2,2,
,24,54400,RENT,8.0,MEDICAL,35000,14.27,1,0.55,Y,4,3,
442.0,21,9900,OWN,2.0,VENTURE,2500,7.14,1,0.25,N,2,4,1.0


In [77]:
# Merges the previously created DataFrame with the Random Forest predictions.
merged_with_rf = merged_with_orig.merge(rf_prediction_df, left_index = True, right_on = 'test_indices_rf', how = 'left')
merged_with_rf.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,test_indices_xgb,xgb_pred,test_indices_rf,rf_pred
,21,9600,OWN,5.0,EDUCATION,1000,11.14,0,0.1,N,2,0,,,
,25,9600,MORTGAGE,1.0,MEDICAL,5500,12.87,1,0.57,N,3,1,1.0,9831.0,
,23,65500,RENT,4.0,MEDICAL,35000,15.23,1,0.53,N,2,2,,,
,24,54400,RENT,8.0,MEDICAL,35000,14.27,1,0.55,Y,4,3,,,
,21,9900,OWN,2.0,VENTURE,2500,7.14,1,0.25,N,2,4,1.0,442.0,


In [78]:
# Merges with the Logistic Regression predictions, creating a single DataFrame with all model predictions.
merged_with_final = merged_with_rf.merge(logit_prediction_df, left_index = True, right_on = 'test_indices_logit', how = 'left')
merged_with_final.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,test_indices_xgb,xgb_pred,test_indices_rf,rf_pred,test_indices_logit,logit_pred
,21,9600,OWN,5.0,EDUCATION,1000,11.14,0,0.1,N,2,0,,,,,
,25,9600,MORTGAGE,1.0,MEDICAL,5500,12.87,1,0.57,N,3,1,1.0,9831.0,,,
,23,65500,RENT,4.0,MEDICAL,35000,15.23,1,0.53,N,2,2,,,,,
,24,54400,RENT,8.0,MEDICAL,35000,14.27,1,0.55,Y,4,3,,,,,
,21,9900,OWN,2.0,VENTURE,2500,7.14,1,0.25,N,2,4,1.0,442.0,,,


In [79]:
# Views dimensions of the final merged DataFrame
merged_with_final.shape
# the shape of the DataFrame in the beginning

(31671, 17)

In [80]:
# Removes rows with any missing values, which are the rows that were not part of the test set.
merged_with_final.dropna(inplace=True)

In [81]:
# Drops the temporary index columns used for merging.
final_data_with_pred = merged_with_final.drop(['test_indices_xgb','test_indices_rf','test_indices_logit'],axis=1)
final_data_with_pred.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,xgb_pred,rf_pred,logit_pred
7782.0,24,78956,RENT,5.0,MEDICAL,35000,11.11,1,0.44,N,4,1.0,0.0,1.0
7968.0,25,12600,OWN,3.0,PERSONAL,1750,13.61,0,0.14,N,3,0.0,1.0,0.0
9251.0,22,153000,MORTGAGE,5.0,DEBTCONSOLIDATION,24000,15.62,1,0.16,Y,2,1.0,1.0,0.0
5662.0,22,16094,MORTGAGE,2.0,VENTURE,5500,7.14,1,0.34,N,3,1.0,0.0,1.0
143.0,21,18000,OWN,1.0,PERSONAL,2500,11.36,0,0.14,N,3,0.0,1.0,0.0


In [83]:
# Exports the final DataFrame to an Excel file, including all predictions.
import openpyxl
final_data_with_pred.to_excel(r"/home/ephraim/Projects/Credit-Risk-Scoring-Model/predictions/pd_prediction1.xlsx",index=False)

In [84]:
# Prints the classification report for the XGBoost model to evaluate its performance on the test data.
print(classification_report(final_data_with_pred['loan_status'],final_data_with_pred['xgb_pred']))

              precision    recall  f1-score   support

           0       0.91      0.97      0.94       210
           1       0.86      0.70      0.77        63

    accuracy                           0.90       273
   macro avg       0.89      0.83      0.86       273
weighted avg       0.90      0.90      0.90       273



In [85]:
from sklearn.metrics import confusion_matrix

In [86]:
# Computes and displays the confusion matrix for the XGBoost model to show the counts of correct and incorrect predictions.
confusion_matrix(final_data_with_pred['loan_status'],final_data_with_pred['xgb_pred'])

array([[203,   7],
       [ 19,  44]])