## Credit Default Prediction Model.

### Introduction:

This A/B testing model aims to focus whether a customer will default on loans based on various highlighted metrics from a sample dataset. It can be used by banks or financial institutions to minimize credit risk and raise loan returns on investment.

### Table of Contents:

1. Loading the required libraries.
2. Loading the dataset.
3. Exploring the dataset.
4. Data Wrangling.
5. Designing the model.
6. Model accuracy.
7. Conclusion.

### Loading the required libraries.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import scipy

### Loading the dataset

In [5]:
df = pd.read_csv("D:\Personal Projects 2\Practice Files July 2024\credit_score.csv")

  df = pd.read_csv("D:\Personal Projects 2\Practice Files July 2024\credit_score.csv")


### Exploring the dataset

In [7]:
## checking the shape of the dataset

df.shape

(1000, 87)

In [9]:
## checking the datatypes in the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 87 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CUST_ID                  1000 non-null   object 
 1   INCOME                   1000 non-null   int64  
 2   SAVINGS                  1000 non-null   int64  
 3   DEBT                     1000 non-null   int64  
 4   R_SAVINGS_INCOME         1000 non-null   float64
 5   R_DEBT_INCOME            1000 non-null   float64
 6   R_DEBT_SAVINGS           1000 non-null   float64
 7   T_CLOTHING_12            1000 non-null   int64  
 8   T_CLOTHING_6             1000 non-null   int64  
 9   R_CLOTHING               1000 non-null   float64
 10  R_CLOTHING_INCOME        1000 non-null   float64
 11  R_CLOTHING_SAVINGS       1000 non-null   float64
 12  R_CLOTHING_DEBT          1000 non-null   float64
 13  T_EDUCATION_12           1000 non-null   int64  
 14  T_EDUCATION_6            

In [11]:
## checking the last 7 rows of the dataset

df.tail(7)

Unnamed: 0,CUST_ID,INCOME,SAVINGS,DEBT,R_SAVINGS_INCOME,R_DEBT_INCOME,R_DEBT_SAVINGS,T_CLOTHING_12,T_CLOTHING_6,R_CLOTHING,...,R_EXPENDITURE_SAVINGS,R_EXPENDITURE_DEBT,CAT_GAMBLING,CAT_DEBT,CAT_CREDIT_CARD,CAT_MORTGAGE,CAT_SAVINGS_ACCOUNT,CAT_DEPENDENTS,CREDIT_SCORE,DEFAULT
993,CZPS645EDZ,192987,791247,1370208,4.1,7.1,1.7317,8304,3428,0.4128,...,0.2439,0.1408,No,1,1,0,1,1,587,0
994,CZPWHY47MO,18830,2354,393068,0.125,20.8746,166.9788,1282,592,0.4618,...,9.9987,0.0599,No,1,0,0,1,0,411,0
995,CZQHJC9HDH,328892,1465066,5501471,4.4546,16.7273,3.7551,16701,10132,0.6067,...,0.2041,0.0543,High,1,1,1,1,1,418,0
996,CZRA4MLB0P,81404,88805,680837,1.0909,8.3637,7.6667,5400,1936,0.3585,...,0.8333,0.1087,No,1,0,0,1,0,589,1
997,CZSOD1KVFX,0,42428,30760,3.2379,8.1889,0.725,0,0,0.8779,...,0.25,0.3448,No,1,0,0,1,0,499,0
998,CZWC76UAUT,36011,8002,604181,0.2222,16.7777,75.5037,1993,1271,0.6377,...,5.0002,0.0662,No,1,1,0,1,0,507,0
999,CZZV5B3SAL,44266,309859,44266,6.9999,1.0,0.1429,1574,1264,0.803,...,0.1587,1.1111,No,1,0,0,1,0,657,0


In [13]:
## investigating the summary statistics of the dataset

df.describe()

Unnamed: 0,INCOME,SAVINGS,DEBT,R_SAVINGS_INCOME,R_DEBT_INCOME,R_DEBT_SAVINGS,T_CLOTHING_12,T_CLOTHING_6,R_CLOTHING,R_CLOTHING_INCOME,...,R_EXPENDITURE_INCOME,R_EXPENDITURE_SAVINGS,R_EXPENDITURE_DEBT,CAT_DEBT,CAT_CREDIT_CARD,CAT_MORTGAGE,CAT_SAVINGS_ACCOUNT,CAT_DEPENDENTS,CREDIT_SCORE,DEFAULT
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,121610.019,413189.6,790718.0,4.063477,6.068449,5.867252,6822.401,3466.32,0.454848,0.055557,...,0.943607,0.91334,0.605276,0.944,0.236,0.173,0.993,0.15,586.712,0.284
std,113716.699591,442916.0,981790.4,3.968097,5.847878,16.788356,7486.225932,5118.942977,0.236036,0.037568,...,0.168989,1.625278,1.299382,0.230037,0.424835,0.378437,0.083414,0.35725,63.413882,0.451162
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0034,...,0.6667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,300.0,0.0
25%,30450.25,59719.75,53966.75,1.0,1.4545,0.2062,1084.5,319.5,0.26395,0.0297,...,0.8333,0.1587,0.1,1.0,0.0,0.0,1.0,0.0,554.75,0.0
50%,85090.0,273850.5,395095.5,2.54545,4.91155,2.0,4494.0,1304.0,0.46885,0.0468,...,0.9091,0.32795,0.1786,1.0,0.0,0.0,1.0,0.0,596.0,0.0
75%,181217.5,622260.0,1193230.0,6.3071,8.587475,4.5096,10148.5,4555.5,0.6263,0.0694,...,1.0,0.8333,0.5882,1.0,0.0,0.0,1.0,0.0,630.0,1.0
max,662094.0,2911863.0,5968620.0,16.1112,37.0006,292.8421,43255.0,39918.0,1.0583,0.2517,...,2.0002,10.0099,10.0053,1.0,1.0,1.0,1.0,1.0,800.0,1.0


### Data Wrangling

In [15]:
## checking for any null values 

df.isnull().sum()

CUST_ID                0
INCOME                 0
SAVINGS                0
DEBT                   0
R_SAVINGS_INCOME       0
                      ..
CAT_MORTGAGE           0
CAT_SAVINGS_ACCOUNT    0
CAT_DEPENDENTS         0
CREDIT_SCORE           0
DEFAULT                0
Length: 87, dtype: int64

In [17]:
## checking for any duplicates

df.duplicated().sum()

0

In [19]:
## removing any unwanted columns 

data = df.drop(["CUST_ID", "CAT_GAMBLING"], axis = 1)

data.shape

(1000, 85)

### Designing the model.

In [23]:
## loading the model

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

In [25]:
## Defining the dependent and independent values 

X = data.drop(["DEFAULT"], axis = 1)  ## independent values
y = data["DEFAULT"]  ## dependent values

In [27]:
## Didiving the dataset into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

In [29]:
## Fitting the model

log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [31]:
## Checking for the coefficients

coefficients = log_reg.coef_
coefficients

array([[ 6.47777843e-06, -1.94111535e-07,  8.56363019e-07,
        -3.39541081e-07,  2.81444896e-07,  8.10980133e-07,
        -1.62942530e-05,  2.86346573e-05, -3.86713099e-08,
        -4.10003997e-09, -1.75762138e-09, -7.95643433e-09,
        -1.14407222e-05, -6.45560826e-06, -4.04868959e-08,
         7.49186149e-10,  1.25200517e-08, -5.72404146e-10,
        -5.93814744e-05,  5.81465683e-05, -4.03841707e-08,
        -2.41253694e-08, -2.83844772e-08, -2.95046083e-08,
         7.02211048e-08,  2.69063904e-06, -5.33896884e-08,
        -2.53353029e-11,  6.12725443e-11, -1.65670735e-11,
         1.86983464e-05,  1.26625374e-05, -3.91022989e-08,
        -1.70749057e-09, -6.76484981e-11, -1.31401958e-09,
         1.82349780e-05, -1.51192917e-05, -4.03198207e-08,
        -1.01603837e-08,  3.59334024e-09, -2.39137705e-08,
        -1.05696767e-04, -2.26532703e-05, -2.55083477e-08,
        -6.67676093e-09,  5.66950513e-10, -1.54249678e-08,
        -2.05423000e-06, -1.04141930e-06, -4.07123859e-0

In [33]:
## Investigating the intercept

intercept = log_reg.intercept_
intercept

array([-8.04452043e-08])

In [35]:
## predicting the model 

model = log_reg.predict(X_test)
model

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

### Model Accuracy

In [37]:
## Calculating the confusion matrix

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model)

array([[175,  12],
       [ 51,  12]], dtype=int64)

In [39]:
## Calculating the accuracy of the model from the confusion matrix

(180 + 11) / (180 + 14 + 45 + 11)

0.764

### Conclusion:

The model has an accuracy of 76.4% making it a good predictor of whether a customer will default a loan or not.