# Default of credit card clients

This notebook analyses the default of credit card clients with the package MS InterpretML. We use the [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). This dataset was uploaded to [kaggle.com](kaggle.com) in 2016 and there is no copyright for it. It contains information on taiwanese Credit Card Clients from April 2005 to September 2005.

## Content of the dataset

Data set size: 30000 

### Variables:

ID: ID of each client (numbers the datapoints consecutively)

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

SEX: Gender (1=male, 2=female)

EDUCATION: (1=graduate school, 2=university, 3=high school, 0,4,5,6 = others)

MARRIAGE: Marital status (1=married, 2=single, 3=divorce, 0=others)

AGE: Age in years

PAY_0: Repayment status in September, 2005 (-2= no consumption, -1=pay duly, 0= the use of revolving credit, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)

**(Note: The explanation of the variables given for the dataset was incomplete. We adjusted the variable explanation in relation to a kaggle user, who contacted the responsible professor and asked for the missing explanations. You can find his post [here](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/discussion/34608))**

## Aim

We want to train a model to predict the default of credit card clients. Therefore, we want to use a logistic regression and a decision tree classification and explain the resulting predictions with lime. At the end, we compare the results of the decison tree and the logistic regression.

In [1]:
import pandas as pd
import numpy as np

## Load and process data

First we load the dataset provided by a csv in a pandas df and to get an overview about the dataset, we look at the first rows, the distributions and the shape.


In [2]:
# read csv
df = pd.read_csv('UCI_Credit_Card.csv')

In [3]:
# Overview over dataset
pd.set_option("display.max.columns", None)
print(df.head())
print('-------------------------------------')
print(df.describe())
print('-------------------------------------')
print(df.shape)
print('-------------------------------------')
print(df.info())

   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0   1    20000.0    2          2         1   24      2      2     -1     -1   
1   2   120000.0    2          2         2   26     -1      2      0      0   
2   3    90000.0    2          2         2   34      0      0      0      0   
3   4    50000.0    2          2         1   37      0      0      0      0   
4   5    50000.0    1          2         1   57     -1      0     -1      0   

   PAY_5  PAY_6  BILL_AMT1  BILL_AMT2  BILL_AMT3  BILL_AMT4  BILL_AMT5  \
0     -2     -2     3913.0     3102.0      689.0        0.0        0.0   
1      0      2     2682.0     1725.0     2682.0     3272.0     3455.0   
2      0      0    29239.0    14027.0    13559.0    14331.0    14948.0   
3      0      0    46990.0    48233.0    49291.0    28314.0    28959.0   
4      0      0     8617.0     5670.0    35835.0    20940.0    19146.0   

   BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  \
0   

## Null values

The dataset has no null values.


In [4]:
df.isna().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

## Distribution and visualization of categorical data

To get a better idea of the data.

- The majority of people pays their bills on time
- There are still 6’598 default payments out of 30’000, which is about 22%.


- The majority of the subjects are in their twenties or thirties
- The higher the age (starting at 30), the fewer the count of people.


- The dataset contains more male than female subjects. In our opinion, it is very problematic to use something like sex for predicting default rates, because that can be sexist, but never the less we stay with it. Later we will see, that it won't play a big role anyway.


- Most people in the dataset hold an university degree
- Overall we can say the educational background of test persons is quite high

- Most people in the dataset are single or married
- The minority is divorced

In [41]:
for i in [2,3,4,6,7,8,9,10,23]: 
    print(df.iloc[:,i].value_counts())
    print('--------')

2    14024
1    10581
3     4873
Name: EDUCATION, dtype: int64
--------
2    15738
1    13425
3      315
Name: MARRIAGE, dtype: int64
--------
29    1584
27    1449
28    1388
30    1373
26    1236
31    1203
25    1169
34    1142
32    1137
33    1124
24    1113
35    1094
36    1088
37    1021
39     937
38     935
23     914
40     846
41     809
42     780
44     688
43     664
45     598
46     555
22     551
47     491
48     456
49     441
50     401
51     332
53     317
52     297
54     241
55     207
56     175
58     122
57     120
59      81
60      66
21      64
61      56
62      44
63      31
64      30
65      24
66      23
67      16
69      15
70      10
68       5
73       4
71       3
72       3
75       3
74       1
79       1
Name: AGE, dtype: int64
--------
 0    15420
-1     5955
 2     3903
-2     3691
 3      326
 4       97
 1       28
 5       25
 7       20
 6       12
 8        1
Name: PAY_2, dtype: int64
--------
 0    15461
-1     5829
-2     3996
 2   

In [6]:
from interpret import show
from interpret.data import ClassHistogram
hist = ClassHistogram().explain_data(
    df.iloc[:,0:23], df['default.payment.next.month'] , name = 'Histogram')
show(hist)


## Rename Columns
We change the name of the independent variable to default_pay, because it is shorter and we change the column PAY_0 to PAY_1 for consistency.


In [7]:
df.rename(columns={"default.payment.next.month" : "default_pay", "PAY_0" : "PAY_1"}, inplace=True)
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_pay
0,1,20000.0,2,2,1,24,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## Convert NT-Dollar to Euro

The amount of  bill statement and the payments are measured in Taiwan- Dollar. To get a better feeling for it, we changed it to Euro (Exchange rate: Euro ≈ 0.03 * Taiwan-Dollar 9. Juni, 18:11 UTC) 

In [8]:
columns = ['LIMIT_BAL', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'
           ,'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
for x in columns:
    df[x] = df[x]*0.03


df.describe()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_pay
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,5024.52968,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,-0.2911,1536.699927,1475.372255,1410.394644,1297.888469,1209.342029,1166.152812,169.907415,177.634905,156.770445,144.782306,143.981629,156.465077,0.2212
std,8660.398374,3892.429847,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,1.133187,1.149988,2209.075817,2135.213063,2080.481623,1929.985684,1823.914673,1786.623226,496.898411,691.226112,528.208844,469.984792,458.34917,533.323973,0.415062
min,1.0,300.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,-4967.4,-2093.31,-4717.92,-5100.0,-2440.02,-10188.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,1500.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,106.7625,89.5425,79.9875,69.8025,52.89,37.68,30.0,24.99,11.7,8.88,7.575,3.5325,0.0
50%,15000.5,4200.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,671.445,636.0,602.655,571.56,543.135,512.13,63.0,60.27,54.0,45.0,45.0,45.0,0.0
75%,22500.25,7200.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,0.0,2012.73,1920.1875,1804.9425,1635.18,1505.715,1475.9475,150.18,150.0,135.15,120.3975,120.945,120.0,0.0
max,30000.0,30000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,8.0,28935.33,29517.93,49922.67,26747.58,27815.13,28849.92,26206.56,50527.77,26881.2,18630.0,12795.87,15859.98,1.0


## Change sex 2 to 0 (male clients)

We change the numerical representation of male clients from 2 to 0 to get a dummy variable and it is common to use 0/1 to differenciate between male/female.


In [9]:
df['SEX'].where(~ df['SEX'].isin([2]), 0, inplace= True)
df['SEX'].value_counts()

0    18112
1    11888
Name: SEX, dtype: int64

## Delete rows with values other/unknown

Even though there are no Null values in the dataset, the columns Education and Marriage have other/unknown values. These are relatively rare, so we decided to delete these rows, because they don't add value to our model and we can't interpret them.


In [10]:
df['EDUCATION'].where(~ df['EDUCATION'].isin([0,4,5,6]), 0, inplace= True)
df['EDUCATION'].value_counts()

2    14030
1    10585
3     4917
0      468
Name: EDUCATION, dtype: int64

In [11]:
df = df[df.EDUCATION != 0]
df['EDUCATION'].value_counts()

2    14030
1    10585
3     4917
Name: EDUCATION, dtype: int64

In [12]:
df = df[df.MARRIAGE != 0]
df['MARRIAGE'].value_counts()

2    15738
1    13425
3      315
Name: MARRIAGE, dtype: int64

## Delete ID and rearrange index

We delete the ID, because from our point of view, it is just a random consecutively numbering of the datapoints. 

In [13]:
df.reset_index(inplace = True)
df.drop(['index','ID'], inplace=True, axis=1)
df.tail()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_pay
29473,6600.0,1,3,1,39,0,0,0,0,0,0,5668.44,5784.45,6250.95,2640.12,937.11,479.4,255.0,600.0,150.09,91.41,150.0,30.0,0
29474,4500.0,1,3,2,43,-1,-1,-1,-1,0,0,50.49,54.84,105.06,269.37,155.7,0.0,55.11,105.78,269.94,3.87,0.0,0.0,0
29475,900.0,1,2,2,37,4,3,2,-1,0,0,106.95,100.68,82.74,626.34,617.46,580.71,0.0,0.0,660.0,126.0,60.0,93.0,1
29476,2400.0,1,3,1,41,1,-1,0,0,0,-1,-49.35,2351.37,2289.12,1583.22,355.65,1468.32,2577.0,102.27,35.34,57.78,1588.92,54.12,1
29477,1500.0,1,2,1,46,0,0,0,0,0,0,1437.87,1467.15,1492.92,1096.05,972.84,459.39,62.34,54.0,42.9,30.0,30.0,30.0,1


## Categorize Data

We categorize ordinal and nominal data now, to change them to dummy variables later.

In [14]:
df['MARRIAGE'] = df['MARRIAGE'].astype('category')
df['SEX'] = df['SEX'].astype('category')
df['EDUCATION'] = df['EDUCATION'].astype('category')

for i in [1,2,3,4,5,6]:
    df['PAY_'+str(i)] =  df['PAY_'+str(i)].astype('category')

df['default_pay'] = df['default_pay'].astype('bool')

# Show changes
df.dtypes

LIMIT_BAL       float64
SEX            category
EDUCATION      category
MARRIAGE       category
AGE               int64
PAY_1          category
PAY_2          category
PAY_3          category
PAY_4          category
PAY_5          category
PAY_6          category
BILL_AMT1       float64
BILL_AMT2       float64
BILL_AMT3       float64
BILL_AMT4       float64
BILL_AMT5       float64
BILL_AMT6       float64
PAY_AMT1        float64
PAY_AMT2        float64
PAY_AMT3        float64
PAY_AMT4        float64
PAY_AMT5        float64
PAY_AMT6        float64
default_pay        bool
dtype: object

## Overview

After all these changes, we want to have an overview to check, if everything went the right way

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29478 entries, 0 to 29477
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   LIMIT_BAL    29478 non-null  float64 
 1   SEX          29478 non-null  category
 2   EDUCATION    29478 non-null  category
 3   MARRIAGE     29478 non-null  category
 4   AGE          29478 non-null  int64   
 5   PAY_1        29478 non-null  category
 6   PAY_2        29478 non-null  category
 7   PAY_3        29478 non-null  category
 8   PAY_4        29478 non-null  category
 9   PAY_5        29478 non-null  category
 10  PAY_6        29478 non-null  category
 11  BILL_AMT1    29478 non-null  float64 
 12  BILL_AMT2    29478 non-null  float64 
 13  BILL_AMT3    29478 non-null  float64 
 14  BILL_AMT4    29478 non-null  float64 
 15  BILL_AMT5    29478 non-null  float64 
 16  BILL_AMT6    29478 non-null  float64 
 17  PAY_AMT1     29478 non-null  float64 
 18  PAY_AMT2     29478 non-nul

## Correlation Matrix

To show the correlation between the different cardinal columns, we use a correlation matrix. The most interesting findings are:

* LIMIT_BAL has by far the biggest correlation with default payment
* BILL_AMTX and Pay_AMTX are highly correlated among themselves, but its declining dependent on time
* BILL_AMT1 is more correlated with default_payment than BILL_AMT2 and so on...
* PAY_AMT1 is more correlated with default_payment than PAY_AMT2 and so on...

In [16]:
cor_matrix = df.corr()
cor_matrix

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_pay
LIMIT_BAL,1.0,0.144144,0.284742,0.277728,0.282649,0.294452,0.296579,0.290785,0.195954,0.178087,0.210895,0.203605,0.217565,0.219985,-0.153776
AGE,0.144144,1.0,0.054975,0.052723,0.052099,0.049722,0.048309,0.046623,0.025661,0.02256,0.029109,0.021757,0.021597,0.019072,0.014109
BILL_AMT1,0.284742,0.054975,1.0,0.951523,0.892086,0.861662,0.831895,0.805335,0.140552,0.098794,0.156212,0.157828,0.164642,0.17581,-0.019322
BILL_AMT2,0.277728,0.052723,0.951523,1.0,0.927723,0.893785,0.861924,0.834376,0.281052,0.10053,0.152403,0.146623,0.155024,0.170664,-0.013894
BILL_AMT3,0.282649,0.052099,0.892086,0.927723,1.0,0.925291,0.885919,0.855902,0.244535,0.31818,0.131743,0.142448,0.177232,0.179433,-0.013637
BILL_AMT4,0.294452,0.049722,0.861662,0.893785,0.925291,1.0,0.940543,0.902546,0.233271,0.207868,0.300782,0.128677,0.159737,0.174788,-0.009458
BILL_AMT5,0.296579,0.048309,0.831895,0.861924,0.885919,0.940543,1.0,0.947345,0.218636,0.181628,0.253169,0.293384,0.140731,0.161856,-0.006279
BILL_AMT6,0.290785,0.046623,0.805335,0.834376,0.855902,0.902546,0.947345,1.0,0.201863,0.173387,0.235196,0.250379,0.307183,0.115428,-0.005292
PAY_AMT1,0.195954,0.025661,0.140552,0.281052,0.244535,0.233271,0.218636,0.201863,1.0,0.286485,0.255142,0.200196,0.149753,0.18643,-0.074011
PAY_AMT2,0.178087,0.02256,0.098794,0.10053,0.31818,0.207868,0.181628,0.173387,0.286485,1.0,0.245461,0.179567,0.18256,0.157896,-0.058363


## Crosstab

To analyze the dependencies of the ordinal and nominal date, we use Crosstabs instead of the Correlation Matrix.

The most interesting findings are:
* There is a big difference in defaults between single and divorced clients
* Highly educated people default less
* Male clients default less than female clients
* Bigger payment delay results in higher chance of default
* The default rate is bigger for clients with payment delay in Septemeber (Pay_1) than in August (Pay_2) and so on..

In [17]:
# Crosstab marriage relative by row
crosstab_marriage = pd.crosstab(df['MARRIAGE'],
                                df['default_pay'],
                               normalize = 'index')
crosstab_marriage

default_pay,False,True
MARRIAGE,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.762458,0.237542
2,0.788728,0.211272
3,0.733333,0.266667


In [18]:
# Crosstab education relative by row
crosstab_education = pd.crosstab(df['EDUCATION'],
                                df['default_pay'],
                               normalize = 'index')
crosstab_education

default_pay,False,True
EDUCATION,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.80758,0.19242
2,0.762621,0.237379
3,0.746973,0.253027


In [19]:
# Crosstab sex relative by row
crosstab_sex = pd.crosstab(df['SEX'],
                                df['default_pay'],
                               normalize = 'index')
crosstab_sex

default_pay,False,True
SEX,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.789524,0.210476
1,0.755895,0.244105


In [20]:
# crosstabs for PAY

for i in range(1,7):
    crosstab_pay = pd.crosstab(df['PAY_' + str(i)],
            df['default_pay'],
            normalize = 'index')
    print(crosstab_pay)
    print('---------------')

default_pay     False     True 
PAY_1                          
-2           0.865622  0.134378
-1           0.830303  0.169697
 0           0.870811  0.129189
 1           0.657166  0.342834
 2           0.303754  0.696246
 3           0.237500  0.762500
 4           0.315789  0.684211
 5           0.458333  0.541667
 6           0.454545  0.545455
 7           0.222222  0.777778
 8           0.421053  0.578947
---------------
default_pay     False     True 
PAY_2                          
-2           0.814143  0.185857
-1           0.837951  0.162049
 0           0.839494  0.160506
 1           0.821429  0.178571
 2           0.441455  0.558545
 3           0.383436  0.616564
 4           0.484536  0.515464
 5           0.400000  0.600000
 6           0.250000  0.750000
 7           0.400000  0.600000
 8           1.000000  0.000000
---------------
default_pay     False     True 
PAY_3                          
-2           0.811812  0.188188
-1           0.841825  0.158175
 0      

## Split independent and dependent variables 

In [21]:
# indebendent variables X
X = df.iloc[:,0:23]
print('X_Columns:\n', list(X.columns), '\n')

# dependent variable y
y = df.iloc[:,23]
print('y-Name: ',y.name)

X_Columns:
 ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'] 

y-Name:  default_pay


## Sample

Look at a sample, to find anomalies

In [22]:
# look at sample
sample = df.sample(20)
sample

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_pay
10362,900.0,1,2,2,25,-2,-2,-2,-2,-2,-2,25.08,25.08,11.7,11.7,11.7,0.0,25.08,11.7,11.7,11.7,0.0,23.4,False
20758,11400.0,1,1,1,40,0,0,0,0,0,0,6952.77,7000.71,7088.76,7115.28,7091.37,7176.63,300.0,270.0,300.63,300.0,300.0,300.0,False
5960,10500.0,0,2,1,35,-2,-2,-1,-1,-2,-2,2501.37,789.15,805.74,1202.94,325.65,-0.06,793.08,807.63,1202.94,325.65,0.06,0.0,False
17874,9000.0,1,2,1,41,0,0,0,0,0,0,4890.99,4214.16,2887.02,1934.46,3016.74,2281.68,165.15,150.0,90.0,2850.0,120.0,105.0,False
20112,6000.0,1,1,2,24,-1,-1,-1,-1,0,0,275.01,6.0,5.94,544.62,410.31,60.54,6.0,5.94,544.68,0.0,34.5,0.0,False
11783,4200.0,0,2,1,51,-2,-2,-2,-2,-2,-2,146.55,19.47,-0.27,-0.27,95.31,-0.3,19.5,0.0,0.0,95.58,0.0,0.0,False
16485,2400.0,1,3,2,40,1,2,2,2,2,2,2431.59,2380.11,2480.94,2444.01,1474.5,1539.09,9.0,186.0,64.17,0.0,120.0,0.0,False
5898,600.0,0,1,2,25,-1,0,0,0,0,0,473.7,507.18,538.08,179.85,236.55,289.71,41.16,45.0,30.0,60.0,60.0,0.0,True
5146,1500.0,0,1,2,35,0,0,0,2,0,0,921.48,893.04,934.65,903.72,899.52,888.87,48.0,129.0,0.0,33.0,45.0,45.0,False
24529,4500.0,1,1,2,39,-2,-1,-1,-1,-2,-2,262.98,176.7,30.0,0.0,0.0,0.0,178.5,30.3,0.0,0.0,0.0,0.0,False


## Split data 

To split the dataset in train and test data we use sklearn. Prior, we have to encode the categorical variables to get dummies.
The dataset is quite imbalanced, so we tried to solve this issue by taking only 6598 random clients who don't default payment (same size as with default). The TPR increased with that change of the dataset, but therefore, we had a big decrease of the TNR and also of the AUC. So we decided to stay with the original dataset.

In [23]:
from sklearn.model_selection import train_test_split

X_enc = pd.get_dummies(X, prefix_sep='.')
X_train, X_test, y_train, y_test = train_test_split(X_enc, y, test_size=0.2, random_state=2)

## Visualize diffrences between train, test data

We want to see, if there are big differences between the distributions of train and test data. Despite almost 30000 datapoints, there were relatively big differences between the distributon of a few columns. This depends in our opinion on too few datapoints, so we tried to use a split with small differences.

In [24]:
# Histograms
hist1 = ClassHistogram().explain_data(X_train, y_train, name = 'Train Data')
show(hist1)
hist2 = ClassHistogram().explain_data(X_test, y_test, name = 'Test Data')
show(hist2)

# Build Classification Tree Model

For our Classification Tree we used the model implemented in interpret. We tried different maximal depths of the tree and the best results are realized with max_depth = 7

In [25]:
from interpret.glassbox import LogisticRegression, ClassificationTree

tree = ClassificationTree(max_depth = 7)

In [26]:
tree.fit(X_train, y_train)


<interpret.glassbox.decisiontree.ClassificationTree at 0x7f4434a25250>

In [27]:
from interpret.perf import ROC,PR
tree_roc = ROC(tree.predict_proba).explain_perf(X_test, y_test, name='ROC of Classification Tree')
show(tree_roc)

In [28]:
from sklearn.metrics import classification_report
print(classification_report(y_test, tree.predict(X_test)))

              precision    recall  f1-score   support

       False       0.84      0.95      0.89      4599
        True       0.66      0.33      0.44      1297

    accuracy                           0.82      5896
   macro avg       0.75      0.64      0.67      5896
weighted avg       0.80      0.82      0.79      5896



## Confusion Matrix

You can see in the confusion matrix, that our prediction for "no default" is quite good. We predict, that just 225 clients will default beside of that, they don't. The prediction of default payments is not that good and only approximately every third default is detected.

In [29]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, tree.predict(X_test)).ravel()
print('                    Actual=True      Actual=False \n Predicted = True: '
      ,tp,'            ', fp ,'\n','Predicted = False:',fn,'            ', tn)

                    Actual=True      Actual=False 
 Predicted = True:  434              225 
 Predicted = False: 863              4374


## Accuracy

There is overfitting in our model, but we think it is relatively low and therefore not noteworthy.

In [30]:
# Accuracy
from sklearn.metrics import accuracy_score, make_scorer
print('Training accuracy:', accuracy_score(y_train, tree.predict(X_train)))
print('Test accuracy:', accuracy_score(y_test, tree.predict(X_test)))

Training accuracy: 0.8273259265541515
Test accuracy: 0.8154681139755766


## LIME

We use LimeTabular to explain individual predictions of datapoints. Our observation is, that most important variables are PAY_X and it is decreasing over time (PAY_1 > PAY_2...). AGE, EDUCATION and SEX are rarely under the most important variables.

In [31]:
from interpret.blackbox import LimeTabular,PartialDependence
#Blackbox explainers need a predict function, and optionally a dataset
lime = LimeTabular(predict_fn=tree.predict_proba, data=X_train,random_state=1)

#Pick the instances to explain, optionally pass in labels if you have them
lime_local = lime.explain_local(X_test[0:20], y_test[0:20], name='LIME')

show(lime_local)

# Build Logistic Regression Model

For the logistic regression, we use the model implemented in interpret as well.
To use the model, we have to set the maximal iteration over 5000, otherwise there is no convergence of the model. To set maximal iterations even higher has no influence on the results of the model.

In [32]:
logR = LogisticRegression(max_iter =5000)

In [33]:
logR.fit(X_train, y_train)

<interpret.glassbox.linear.LogisticRegression at 0x7f442c935ed0>

In [34]:
logR_roc = ROC(logR.predict_proba).explain_perf(X_test, y_test, name='Logistic Regression')
show(logR_roc)

In [35]:
print(classification_report(y_test, logR.predict(X_test)))

              precision    recall  f1-score   support

       False       0.84      0.95      0.89      4599
        True       0.67      0.35      0.46      1297

    accuracy                           0.82      5896
   macro avg       0.76      0.65      0.68      5896
weighted avg       0.80      0.82      0.80      5896



## Confusion Matrix

You can see in the confusion matrix, that our prediction for no default is quite good, we predict, that just 221 clients will default beside of that, they don't. The prediction of default payments is not that good and only approximately every third default is detected. 

In [36]:
# Confusion Matrix
tn, fp, fn, tp = confusion_matrix(y_test, logR.predict(X_test)).ravel()
print('                    Actual=True      Actual=False \n Predicted = True: '
      ,tp,'            ', fp ,'\n','Predicted = False:',fn,'            ', tn)

                    Actual=True      Actual=False 
 Predicted = True:  456              221 
 Predicted = False: 841              4378


## Accuracy

In the logistic regression model is no overfitting.

In [37]:
#Accuracy
print('Training accuracy:', accuracy_score(y_train, logR.predict(X_train)))
print('Test accuracy:', accuracy_score(y_test, logR.predict(X_test)))

Training accuracy: 0.8179967772029514
Test accuracy: 0.8198778833107191


## LIME

We use LimeTabular to explain individual predictions of datapoints. Our observation is, that most important variables are PAY_X and it is decreasing over time (PAY_1 > PAY_2...). AGE, EDUCATION and SEX are rarely under the most important variables.

In [38]:
#Blackbox explainers need a predict function, and optionally a dataset
lime = LimeTabular(predict_fn=logR.predict_proba, data=X_train, random_state=1)

#Pick the instances to explain, optionally pass in labels if you have them
lime_local = lime.explain_local(X_test[0:20], y_test[0:20], name='LIME')

show(lime_local, show_all = True)

# Conclusion

After setting the models and analyze the individual results, we want to compare the logistic regression with the classification tree. The results of both are in our opinion very similar, but the logistic regression is a little better, respective the AUC and also on TPR and TNR. But the difference is not significant, because we build the models a few times with different split in train and test data and the results changed quite a bit. Also there are most probably more factors, which influence the default of Clients. So for a final result, we would have needed more datapoints same as more details about the client.