In [1]:
#import the usual tools and read-in the edited data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
%matplotlib inline

In [2]:
df = pd.read_excel('Capstone Data.xls', index_col = 0)

In [3]:
pd.set_option("display.max_columns", 50)

In [4]:
df.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,20000,2,2,1,24,1,1,2,2,5,5,0,0,0,689,3102,3913,0,0,0,0,689,0,1
2,120000,2,2,2,26,5,3,3,3,5,2,3261,3455,3272,2682,1725,2682,2000,0,1000,1000,1000,0,1
3,90000,2,2,2,34,3,3,3,3,3,3,15549,14948,14331,13559,14027,29239,5000,1000,1000,1000,1500,1518,0
4,50000,2,2,1,37,3,3,3,3,3,3,29547,28959,28314,49291,48233,46990,1000,1069,1100,1200,2019,2000,0
5,50000,1,2,1,57,3,3,3,2,3,2,19131,19146,20940,35835,5670,8617,679,689,9000,10000,36681,2000,0


# Features

Before checking the explanatory variables for measures of association with the explained variable ('DEFAULT'), I created from the existing data some new features that conventional wisdom would suggest may have some effect upon the explained variable. I also took summaries of some of the variables that were measured at different time periods in order to reduce some collinearity in the model and hopefully make interpretation more straightforward.

The payment amount made by an individual each month is recorded over 6 consecutive time periods. To condense and get a better view of indiviudal average behaviour, I made a feature of the mean payment amount each month. To examine whether there was significant difference in the effect upon Default of payments closer to the Default period or those further away, I also took averages for the first 3 months (period A) and the last 3 months (period B). Any siginifcant difference or, perhaps more importantly, lack of difference may indicate that likelihood of default can be spotted looking further back in time looking out to 4-6 months in advance.

In [5]:
df['PAY_AMT_AVG_FULL'] = (df['PAY_AMT1'] + df['PAY_AMT2'] + df['PAY_AMT3'] + df['PAY_AMT4'] + df['PAY_AMT5'] + df['PAY_AMT6']) / 6
df['PAY_AMT_AVG_A'] = (df['PAY_AMT1'] + df['PAY_AMT2'] + df['PAY_AMT3']) / 3
df['PAY_AMT_AVG_B'] = (df['PAY_AMT4'] + df['PAY_AMT5'] + df['PAY_AMT6']) / 3

Using the same intuition as above, I repeated the process for the amount that an individual is billed each month over the 6 recorded periods.

In [6]:
df['BILL_AMT_AVG_FULL'] = (df['BILL_AMT1'] + df['BILL_AMT2'] + df['BILL_AMT3'] + df['BILL_AMT4'] + df['BILL_AMT5'] + df['BILL_AMT6']) / 6
df['BILL_AMT_AVG_A'] = (df['BILL_AMT1'] + df['BILL_AMT2'] + df['BILL_AMT3']) / 3
df['BILL_AMT_AVG_B'] = (df['BILL_AMT4'] + df['BILL_AMT5'] + df['BILL_AMT6']) / 3

I repeated again for the repayment score. As a recap, this was a score inferred from the original data set; a higher score indicating that the individual has delayed payment for a greater amount of months (delay being distinct from actual default) and a lower score little or no delay in payment.

In [7]:
df['REPAY_AVG_FULL'] = (df['PAY_1'] + df['PAY_2'] + df['PAY_3'] + df['PAY_4'] + df['PAY_5'] + df['PAY_6']) / 6
df['REPAY_AVG_A'] = (df['PAY_1'] + df['PAY_2'] + df['PAY_3']) / 3
df['REPAY_AVG_B'] = (df['PAY_4'] + df['PAY_5'] + df['PAY_6']) / 3

Conventional wisdom would suggest that the bill amount requires further context in explaining the likelihood of Default, particularly is reltaion to the individual's ability to pay the bill. The Limit Amount on an individual's account is our best estimator here of their ability to pay (as judged by the credit card company), so it is interesting to look at the size of the bill with that in mind. I looked at the ratio between the two and took averages as with the previous explanatory variables above.

In [8]:
df['BILL_LIMIT_RAT_1'] = df['BILL_AMT1'] / df['LIMIT_BAL']
df['BILL_LIMIT_RAT_2'] = df['BILL_AMT2'] / df['LIMIT_BAL']
df['BILL_LIMIT_RAT_3'] = df['BILL_AMT3'] / df['LIMIT_BAL']
df['BILL_LIMIT_RAT_4'] = df['BILL_AMT4'] / df['LIMIT_BAL']
df['BILL_LIMIT_RAT_5'] = df['BILL_AMT5'] / df['LIMIT_BAL']
df['BILL_LIMIT_RAT_6'] = df['BILL_AMT6'] / df['LIMIT_BAL']

df['BILL_LIMIT_RAT_AVG_FULL'] = (df['BILL_LIMIT_RAT_1'] + df['BILL_LIMIT_RAT_2'] + df['BILL_LIMIT_RAT_3'] + df['BILL_LIMIT_RAT_4'] + df['BILL_LIMIT_RAT_5'] + df['BILL_LIMIT_RAT_6']) / 6
df['BILL_LIMIT_RAT_AVG_A'] = (df['BILL_LIMIT_RAT_1'] + df['BILL_LIMIT_RAT_2'] + df['BILL_LIMIT_RAT_3']) / 3
df['BILL_LIMIT_RAT_AVG_B'] = (df['BILL_LIMIT_RAT_4'] + df['BILL_LIMIT_RAT_5'] + df['BILL_LIMIT_RAT_6']) / 3

Again, as an alternative to just looking at the absolute values in isolation, I wanted to get a basic idea of the trajectory of payments in the run-up and a look at an individual's payment pattern. I created a payment score based on how many consecutive times an individual's monthly payments decreased immediately prior to the default period with a maximum score of 5 (payment decreases in last month an every month prior) and minimum score of 0 (payment increases in last month). This would also potentially be of use identifying higher-risk behaviour in advance.

In [9]:
def pay_score(row):
    if (row['PAY_AMT6'] < row['PAY_AMT5']) & (row['PAY_AMT5'] < row['PAY_AMT4']) & (row['PAY_AMT4'] < row['PAY_AMT3']) & (row['PAY_AMT3'] < row['PAY_AMT2']) & (row['PAY_AMT2'] < row['PAY_AMT1']):
        return 5
    elif (row['PAY_AMT6'] < row['PAY_AMT5']) & (row['PAY_AMT5'] < row['PAY_AMT4']) & (row['PAY_AMT4'] < row['PAY_AMT3']) & (row['PAY_AMT3'] < row['PAY_AMT2']) & (row['PAY_AMT2'] >= row['PAY_AMT1']):
        return 4
    elif (row['PAY_AMT6'] < row['PAY_AMT5']) & (row['PAY_AMT5'] < row['PAY_AMT4']) & (row['PAY_AMT4'] < row['PAY_AMT3']) & (row['PAY_AMT3'] >= row['PAY_AMT2']):
        return 3
    elif (row['PAY_AMT6'] < row['PAY_AMT5']) & (row['PAY_AMT5'] < row['PAY_AMT4']) & (row['PAY_AMT4'] >= row['PAY_AMT3']):
        return 2
    elif (row['PAY_AMT6'] < row['PAY_AMT5']) & (row['PAY_AMT5'] >= row['PAY_AMT4']):
        return 1
    elif (row['PAY_AMT6'] >= row['PAY_AMT5']):
        return 0

In [10]:
df['PAY_SCORE'] = df.apply(lambda row: pay_score(row), axis=1)

In [11]:
df['PAY_SCORE'].value_counts()

0    18024
1     9281
2     2219
3      370
4       72
5       34
Name: PAY_SCORE, dtype: int64

Repeated above to also create a bill score; highest score of 5 (bill increases in last month and every month prior) and lowest score 0 (bill decreases in last month before Default period).

In [12]:
def bill_score(row):
    if (row['BILL_AMT6'] > row['BILL_AMT5']) & (row['BILL_AMT5'] > row['BILL_AMT4']) & (row['BILL_AMT4'] > row['BILL_AMT3']) & (row['BILL_AMT3'] > row['BILL_AMT2']) & (row['BILL_AMT2'] > row['BILL_AMT1']):
        return 5
    elif (row['BILL_AMT6'] > row['BILL_AMT5']) & (row['BILL_AMT5'] > row['BILL_AMT4']) & (row['BILL_AMT4'] > row['BILL_AMT3']) & (row['BILL_AMT3'] > row['BILL_AMT2']) & (row['BILL_AMT2'] <= row['BILL_AMT1']):
        return 4
    elif (row['BILL_AMT6'] > row['BILL_AMT5']) & (row['BILL_AMT5'] > row['BILL_AMT4']) & (row['BILL_AMT4'] > row['BILL_AMT3']) & (row['BILL_AMT3'] <= row['BILL_AMT2']):
        return 3
    elif (row['BILL_AMT6'] > row['BILL_AMT5']) & (row['BILL_AMT5'] > row['BILL_AMT4']) & (row['BILL_AMT4'] <= row['BILL_AMT3']):
        return 2
    elif (row['BILL_AMT6'] > row['BILL_AMT5']) & (row['BILL_AMT5'] <= row['BILL_AMT4']):
        return 1
    elif (row['BILL_AMT6'] <= row['BILL_AMT5']): 
        return 0

In [13]:
df['BILL_SCORE'] = df.apply(lambda row: bill_score(row), axis=1)

In [14]:
df['BILL_SCORE'].value_counts()

0    16850
1     6647
2     2545
5     1654
3     1378
4      926
Name: BILL_SCORE, dtype: int64

Also applied this to the repayment score to see whether consecutively increasing payment delay in the months immediately prior to default period was useful in explaining the dependent variable.

In [15]:
def repay_score(row):
    if (row['PAY_6'] > row['PAY_5']) & (row['PAY_5'] > row['PAY_4']) & (row['PAY_4'] > row['PAY_3']) & (row['PAY_3'] > row['PAY_2']) & (row['PAY_2'] > row['PAY_1']):
        return 5
    elif (row['PAY_6'] > row['PAY_5']) & (row['PAY_5'] > row['PAY_4']) & (row['PAY_4'] > row['PAY_3']) & (row['PAY_3'] > row['PAY_2']) & (row['PAY_2'] <= row['PAY_1']):
        return 4
    elif (row['PAY_6'] > row['PAY_5']) & (row['PAY_5'] > row['PAY_4']) & (row['PAY_4'] > row['PAY_3']) & (row['PAY_3'] <= row['PAY_2']):
        return 3
    elif (row['PAY_6'] > row['PAY_5']) & (row['PAY_5'] > row['PAY_4']) & (row['PAY_4'] <= row['PAY_3']):
        return 2
    elif (row['PAY_6'] > row['PAY_5']) & (row['PAY_5'] <= row['PAY_4']):
        return 1
    elif (row['PAY_6'] <= row['PAY_5']): 
        return 0

In [16]:
df['REPAY_SCORE'] = df.apply(lambda row: repay_score(row), axis=1)

In [17]:
df['REPAY_SCORE'].value_counts()

0    26301
1     3509
2      101
3       39
5       36
4       14
Name: REPAY_SCORE, dtype: int64

To smooth out some of the previously seen higher variance for very particular ages where there were few individuals, I split the age data into categorical groups. This would hopefully more clearly highlight any relationship between age and default, rather than relying on a total consistency of correlation across all individual years.

In [18]:
df.AGE.describe()

count    30000.000000
mean        35.485500
std          9.217904
min         21.000000
25%         28.000000
50%         34.000000
75%         41.000000
max         79.000000
Name: AGE, dtype: float64

In [19]:
bins = [20, 30, 40, 50, 60, 70, 80]
group_names = ['21-30', '31-40', '41-50', '51-60', '61-70', '71-80']

df['AGE_GROUP'] = pd.cut(df['AGE'], bins, labels = group_names)

As previously discussed, I also cut rows where the description was missing from the original data set and set particular columns to the correct data type. The loss of rows is realtively slight.

In [20]:
#drop unidentified marriage group
df = df[df['MARRIAGE'] != 0.0]
len(df)

29946

In [21]:
#drop unidentified education groups
df = df[(df['EDUCATION'] <= 3.0) & (df['EDUCATION'] >= 1.0)]
len(df)

29478

In [22]:
for col in ['MARRIAGE', 'SEX', 'EDUCATION', 'AGE_GROUP', 'DEFAULT']:
    df[col] = df[col].astype('category')

In [23]:
df.to_excel('Capstone Data Final.xls')

In [24]:
df = pd.read_excel('Capstone Data Final.xls', index_col = 0)

# Exploratory Statistics

Due to the mixed nature of the explanatory variables and the explained variable itself, I required two different tests to measure for any associations; one for the continuous variables and one for the categorical variables.

### Continuous Explanatory Variables
The appropriate test statistic for association between a continuous explanatory variable and a dichotomous explained variable is the point birserial correlation coefficient. I firstly examined the relevant variables with this test.

In [25]:
#creating dataframe of just the columns of continuous variables
df_cont = df[['PAY_AMT_AVG_FULL', 'PAY_AMT_AVG_A', 'PAY_AMT_AVG_B', 'BILL_AMT_AVG_FULL', 'BILL_AMT_AVG_A', 'BILL_AMT_AVG_B',
             'BILL_LIMIT_RAT_AVG_FULL', 'BILL_LIMIT_RAT_AVG_A', 'BILL_LIMIT_RAT_AVG_B', 'REPAY_AVG_FULL', 'REPAY_AVG_A', 'REPAY_AVG_B', 'PAY_SCORE', 'BILL_SCORE', 'REPAY_SCORE']]

In [26]:
cols = df_cont.columns.tolist()

In [27]:
for col in cols:
    result = scipy.stats.pointbiserialr(df_cont[col], df['DEFAULT'])
    print (col, ':',  result)

PAY_AMT_AVG_FULL : PointbiserialrResult(correlation=-0.1028140290304192, pvalue=4.3084250798492785e-70)
PAY_AMT_AVG_A : PointbiserialrResult(correlation=-0.083886744172108896, pvalue=3.4759414131253916e-47)
PAY_AMT_AVG_B : PointbiserialrResult(correlation=-0.086675426091355978, pvalue=2.8858218386948041e-50)
BILL_AMT_AVG_FULL : PointbiserialrResult(correlation=-0.012270581536249492, pvalue=0.035139641769359915)
BILL_AMT_AVG_A : PointbiserialrResult(correlation=-0.0072382171915603482, pvalue=0.21397672555775793)
BILL_AMT_AVG_B : PointbiserialrResult(correlation=-0.016089355643618214, pvalue=0.0057366078190698565)
BILL_LIMIT_RAT_AVG_FULL : PointbiserialrResult(correlation=0.11602927235982868, pvalue=7.0125645887504676e-89)
BILL_LIMIT_RAT_AVG_A : PointbiserialrResult(correlation=0.12497608588778034, pvalue=6.4399539751104277e-103)
BILL_LIMIT_RAT_AVG_B : PointbiserialrResult(correlation=0.099873028896928881, pvalue=3.1759686638043302e-66)
REPAY_AVG_FULL : PointbiserialrResult(correlation=0

In many of the instances above there is a significant p-value despite a pretty low correlation coefficient, most likely due to the sheer size of the sample. 

The Payment Amounts follow conventional wisdom in having a negative coefficient whereas the Bill Amounts are perhaps surprisingly also negative. As discussed above, this is possibly due to the lack of context in the Bill Amounts alone and in fact we can see that the ratio off Bill Amount to Limit Amount is positive correlated with Default, as expected. The Bill Amounts also do not appear to be significant on their own, again, probably for the reasons already outlined.

The Repayment Delay variables appear to be the most strongly correlated overall, with the later period (B) more strongly correlated than the earlier (A). The Repayment Delay Score is also positively correlated; as expected, if delays increase, likelihood of Default also does.

The Pay Score and Bill Score though are negatively correlated which runs contrary to intuition as the higher scores either represent payments getting lower or bills getting higher over time. The coefficients are so small though that there appears to be pretty much no relationship despite the p-values.

Some of the variables will be dropped accordingly.

In [28]:
#quick look at correlation between the variables
df_cont.corr()

Unnamed: 0,PAY_AMT_AVG_FULL,PAY_AMT_AVG_A,PAY_AMT_AVG_B,BILL_AMT_AVG_FULL,BILL_AMT_AVG_A,BILL_AMT_AVG_B,BILL_LIMIT_RAT_AVG_FULL,BILL_LIMIT_RAT_AVG_A,BILL_LIMIT_RAT_AVG_B,REPAY_AVG_FULL,REPAY_AVG_A,REPAY_AVG_B,PAY_SCORE,BILL_SCORE,REPAY_SCORE
PAY_AMT_AVG_FULL,1.0,0.779997,0.871202,0.34269,0.367464,0.306119,0.029855,0.050664,0.009488,-0.075987,-0.036067,-0.106941,0.072398,0.011251,-0.038141
PAY_AMT_AVG_A,0.779997,1.0,0.372324,0.279179,0.293926,0.254094,0.006446,0.020674,-0.00648,-0.070322,-0.050749,-0.081113,0.058224,0.060223,-0.04334
PAY_AMT_AVG_B,0.871202,0.372324,1.0,0.289232,0.314405,0.254671,0.039222,0.058921,0.019155,-0.05753,-0.013679,-0.094972,0.061697,-0.030558,-0.022567
BILL_AMT_AVG_FULL,0.34269,0.279179,0.289232,1.0,0.972689,0.979573,0.545919,0.52115,0.52847,0.281191,0.274628,0.250643,0.032593,0.138539,-0.079648
BILL_AMT_AVG_A,0.367464,0.293926,0.314405,0.972689,1.0,0.906145,0.528931,0.552895,0.470014,0.284666,0.290523,0.24089,0.057609,0.08288,-0.070579
BILL_AMT_AVG_B,0.306119,0.254094,0.254671,0.979573,0.906145,1.0,0.536568,0.470671,0.555814,0.265785,0.24875,0.248042,0.009483,0.180651,-0.083994
BILL_LIMIT_RAT_AVG_FULL,0.029855,0.006446,0.039222,0.545919,0.528931,0.536568,1.0,0.956253,0.966614,0.562218,0.523258,0.527694,0.039028,0.088545,-0.06359
BILL_LIMIT_RAT_AVG_A,0.050664,0.020674,0.058921,0.52115,0.552895,0.470671,0.956253,1.0,0.849369,0.567278,0.550611,0.509171,0.068933,0.014453,-0.04961
BILL_LIMIT_RAT_AVG_B,0.009488,-0.00648,0.019155,0.52847,0.470014,0.555814,0.966614,0.849369,1.0,0.517472,0.461781,0.506081,0.010036,0.147092,-0.071275
REPAY_AVG_FULL,-0.075987,-0.070322,-0.05753,0.281191,0.284666,0.265785,0.562218,0.567278,0.517472,1.0,0.936428,0.932711,0.019985,0.062795,0.086734


No greatly surprising results in above. Where there is naturally high correlation and essentially duplication of data (e.g. between averages over 6 periods and those of 3), variables will be dropped accordingly to make way for those of more significance.

### Categorical Explanatory Variables
For the categorical variables a different test of association is required. I decided to use Pearsons' Chi Square Independence Test under the Null Hypothesis that the categorical explanatory variable and the dichotmous explained variable are independent of one another. The test is suitable for all grid sizes and the expected counts are suffciently large for this test also.

In [29]:
for col in ['MARRIAGE', 'SEX', 'EDUCATION', 'AGE_GROUP']:
    observed = pd.crosstab(df[col], df['DEFAULT'], margins = False)
    result = scipy.stats.chi2_contingency(observed)
    print (col, ':', result)

MARRIAGE : (32.142243959046255, 1.0480944138946852e-07, 2, array([[ 10420.10991248,   3004.89008752],
       [ 12215.39588846,   3522.60411154],
       [   244.49419906,     70.50580094]]))
SEX : (45.744079662370098, 1.3475601294287318e-11, 1, array([[  9084.31779632,   2619.68220368],
       [ 13795.68220368,   3978.31779632]]))
EDUCATION : (98.816227756796266, 3.4860112262384947e-22, 2, array([[  8212.67657236,   2368.32342764],
       [ 10885.03697673,   3138.96302327],
       [  3782.28645091,   1090.71354909]]))
AGE_GROUP : (36.866440293668418, 6.3702925695916702e-07, 5, array([[  8.41448131e+03,   2.42651869e+03],
       [  8.17076328e+03,   2.35623672e+03],
       [  4.56622023e+03,   1.31677977e+03],
       [  1.51974489e+03,   4.38255106e+02],
       [  1.97147703e+02,   5.68522966e+01],
       [  1.16425809e+01,   3.35741909e+00]]))


Although the Chi Square test gives no indication as to the relationship between the explanatory and the explained, the very small p-values suggest that we can reject the notion of independence and accept that there is some degree of association. As such, all the categorical variables will be retained.

With the Payment and Bill amounts performing rather poorly in isolation, I attempted to see whether a ratio between the two would be helpful in explaining Default. Intuitively a ratio of 1:1 would suggest a good user that pays the whole bill. 

I looked at the amount for both variables totalled over the 6 periods to smooth out any unusual behaviour occuring in just a single month. Importantly, to stick to real numbers it would be necessary to delete any Total Bill Amounts that were zero.

In [30]:
#create a reduced dataframe of just postivie amounts for Total Bill
df['TOTAL_BILL'] = df['BILL_AMT1'] + df['BILL_AMT2'] + df['BILL_AMT3'] + df['BILL_AMT4'] + df['BILL_AMT5'] + df['BILL_AMT6']
df_pos_bill = df[df['TOTAL_BILL'] !=0]
reduced_df = df_pos_bill.copy()

In [31]:
#for this dataframe also calculate Total Pay and ratio between that and Total Bill
reduced_df['TOTAL_PAY'] = reduced_df['PAY_AMT1'] + reduced_df['PAY_AMT2'] + reduced_df['PAY_AMT3'] + reduced_df['PAY_AMT4'] + reduced_df['PAY_AMT5'] + reduced_df['PAY_AMT6']
reduced_df['TOTAL_PAY_BILL'] = reduced_df['TOTAL_PAY'] / reduced_df['TOTAL_BILL']

In [32]:
scipy.stats.pointbiserialr(reduced_df['TOTAL_PAY_BILL'], reduced_df['DEFAULT'])

PointbiserialrResult(correlation=-0.011549367757713662, pvalue=0.05071230917306397)

Almost zero correlation and p-value is of questionable significance. Perhaps just as importantly, using this reduced dataframe would mean the loss of 852 rows which is far too high a sacrifice for this. Variable will not be used.

In addition to the above experiment with Payment/Bill, the other variables I will drop from further modelling are all the Bill Amounts (Full, A & B being so close to zero), the averaged Payment Amounts for periods A and B (the full 6 month period being a stronger correlation), the Bill Amount/Limit Amount Ratio averaged over 6 months (periods A & B being more informative separately), the Repayment Delay  averaged over 6 months (periods A & B being more informative separately), the Pay Score and the Bill Score (correlation coefficients so close to zero.