# Credit Risk Modelling - Default Risk Prediction

## Introduction:

Credit Risk itself is a very broad topic and has a lot of different approaches, as talked about in the slides, We will use Generalised Linear Models (GLMs)

## The Data:

[Data Source](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset/data)

- **ID**: ID of each client
- **LIMIT_BAL**: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- **SEX**: Gender (1=male, 2=female)
- **EDUCATION**: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE**: Marital status (1=married, 2=single, 3=others)
- **AGE**: Age in years
- **PAY_0**: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- **PAY_2**: Repayment status in August, 2005 (scale same as above)
- **PAY_3**: Repayment status in July, 2005 (scale same as above)
- **PAY_4**: Repayment status in June, 2005 (scale same as above)
- **PAY_5**: Repayment status in May, 2005 (scale same as above)
- **PAY_6**: Repayment status in April, 2005 (scale same as above)
- **BILL_AMT1**: Amount of bill statement in September, 2005 (NT dollar)
- **BILL_AMT2**: Amount of bill statement in August, 2005 (NT dollar)
- **BILL_AMT3**: Amount of bill statement in July, 2005 (NT dollar)
- **BILL_AMT4**: Amount of bill statement in June, 2005 (NT dollar)
- **BILL_AMT5**: Amount of bill statement in May, 2005 (NT dollar)
- **BILL_AMT6**: Amount of bill statement in April, 2005 (NT dollar)
- **PAY_AMT1**: Amount of previous payment in September, 2005 (NT dollar)
- **PAY_AMT2**: Amount of previous payment in August, 2005 (NT dollar)
- **PAY_AMT3**: Amount of previous payment in July, 2005 (NT dollar)
- **PAY_AMT4**: Amount of previous payment in June, 2005 (NT dollar)
- **PAY_AMT5**: Amount of previous payment in May, 2005 (NT dollar)
- **PAY_AMT6**: Amount of previous payment in April, 2005 (NT dollar)
- **default.payment.next.month**: Default payment (1 = yes, 0 = no)


**Acknowledgements**

Any publications based on this dataset should acknowledge the following:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The original dataset can be found [here](http://archive.ics.uci.edu/ml) at the UCI Machine Learning Repository.


In [22]:
# Importing pandas for loading data and EDA

import pandas as pd

In [23]:
# Reading the Dataset

path_UCI = "/home/dark/VS-CodePythonProjects/DataScience-Club/Credit-Risk-Model/Data/Data_Easy/UCI_Credit_Card.csv"

df = pd.read_csv(path_UCI)

df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [24]:
# Checking Columns

df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')

In [25]:
# Class Imbalance Check

df['default.payment.next.month'].value_counts()

default.payment.next.month
0    23364
1     6636
Name: count, dtype: int64

In [26]:
# Check For Null Values

df.isnull().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

## Fitting the GLM

In [27]:
# Import statsmodels

import statsmodels.api as sm

In [28]:
# Get X and y where X is the features and y is the score

X = df.drop(columns=['ID', 'default.payment.next.month'])
y = df['default.payment.next.month']

In [29]:
# Adding a constant variable to the independent variables

X = sm.add_constant(X)

In [30]:
# Fitting the GLM

model = sm.GLM(y, X, family= sm.families.Binomial())
result = model.fit()

In [46]:
# Understanding the Results
print(result.summary())

                     Generalized Linear Model Regression Results                      
Dep. Variable:     default.payment.next.month   No. Observations:                30000
Model:                                    GLM   Df Residuals:                    29976
Model Family:                        Binomial   Df Model:                           23
Link Function:                          Logit   Scale:                          1.0000
Method:                                  IRLS   Log-Likelihood:                -13939.
Date:                        Tue, 19 Mar 2024   Deviance:                       27877.
Time:                                18:48:02   Pearson chi2:                 3.69e+04
No. Iterations:                             6   Pseudo R-squ. (CS):             0.1198
Covariance Type:                    nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------

In [47]:
preds = result.predict(X)

In [48]:
val = 0.5

bin_out = (preds > val).astype(int)

In [49]:
bin_out.value_counts()

0    27775
1     2225
Name: count, dtype: int64

In [50]:
from sklearn.metrics import classification_report

report = classification_report(y, bin_out)

print(report)

              precision    recall  f1-score   support

           0       0.82      0.97      0.89     23364
           1       0.72      0.24      0.36      6636

    accuracy                           0.81     30000
   macro avg       0.77      0.61      0.62     30000
weighted avg       0.80      0.81      0.77     30000



In [51]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


In [52]:
roc_auc = roc_auc_score(y, preds)
print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.7242261339759195
