# EDA

## Purpose
The purpose of this notebook is to perform exploratory data analysis (EDA) on the training dataset to clean and prepare the dataset for classification model training. First, the data will be examined for error and null values and correction will be inferred based on majority data. Second, dummy variables will be created for binary categorical data. Last, all cleaning and feature engineering will be applied to the test data. Both data set will be saved as a new CSV.

## Data Correction and Feature Engineering Summary
1. Education: merge all non 1, 2, or 3 values to 4
2. Marital Status: merge all non 1, 2, or 3 values to 3
    - Feature Engineer this as dummy variable of married (1=True 0=False)
3. History of Payment (pay_status_month): Fix any values represented as -2 or 0. Change duly paid status of 0 to mark to complete ordinal sequence
4. Carry over bills per month
5. Carry over bill ratio per month


## Column IDs

- Unnamed 0: Most likely a client ID. Not needed for this project
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- X2: Gender (1 = male; 2 = female).
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- X4: Marital status (1 = married; 2 = single; 3 = others).
- X5: Age (year).
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
    - -2 probably represents overpayment or bill at 0
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
- Y: default payment (1=defaulted, 0=not), this is the **predictand**
    - 1 = next month will be default payment
    - 0 = not 1

## Data Import and Clean

In [152]:
# below allows local import for custom functions in src folder
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# custom tools
import src.cc_pred_tools as tl
    
import pandas as pd
pd.set_option('display.max_columns', 100) #set to show all columns

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [153]:
directory = r"..\data\training_data.csv"  #this directory may have to change if working on MAC OS
df = pd.read_csv(directory)

In [154]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


There are no null values in this dataset

#### Column Renaming

In [155]:
df = df.drop(columns="Unnamed: 0") # drop unncessary columns

In [156]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [157]:
rename_list = ["max_credit", "gender", "education", "marital_status", "age",
               "pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun", "pay_status_may", "pay_status_apr",
               "bill_sep", "bill_aug", "bill_jul", "bill_jun", "bill_may", "bill_apr",
               "payment_sep", "payment_aug", "payments_jul", "payment_jun", "payment_may", "payment_apr",
                "default"]
col_rename = dict(zip(df.columns,rename_list))

In [158]:
df = df.rename(columns=col_rename)

#### Null and Unique Values

- Check for any null values and infer new data to fill in null if necessary
- Check categorical data for their unique values to check for error

In [159]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22500 entries, 0 to 22499
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   max_credit      22500 non-null  object
 1   gender          22500 non-null  object
 2   education       22500 non-null  object
 3   marital_status  22500 non-null  object
 4   age             22500 non-null  object
 5   pay_status_sep  22500 non-null  object
 6   pay_status_aug  22500 non-null  object
 7   pay_status_jul  22500 non-null  object
 8   pay_status_jun  22500 non-null  object
 9   pay_status_may  22500 non-null  object
 10  pay_status_apr  22500 non-null  object
 11  bill_sep        22500 non-null  object
 12  bill_aug        22500 non-null  object
 13  bill_jul        22500 non-null  object
 14  bill_jun        22500 non-null  object
 15  bill_may        22500 non-null  object
 16  bill_apr        22500 non-null  object
 17  payment_sep     22500 non-null  object
 18  paymen

There are no null values in this dataset. However, checking for unique values shows that most values are integers in string form along with hidden column names.

In [160]:
category_col = ["gender", "education", "marital_status",
               "pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun", "pay_status_may", "pay_status_apr",
               "default"]

In [161]:
for category in category_col:
    print(category, df[category].unique())

gender ['2' '1' 'SEX']
education ['1' '3' '2' '4' '6' '5' '0' 'EDUCATION']
marital_status ['2' '1' '3' '0' 'MARRIAGE']
pay_status_sep ['0' '-1' '-2' '2' '1' '3' '8' '4' '6' '5' '7' 'PAY_0']
pay_status_aug ['0' '-1' '-2' '2' '3' '4' '7' '5' '1' '8' '6' 'PAY_2']
pay_status_jul ['0' '-1' '-2' '2' '3' '4' '6' '5' '7' '8' '1' 'PAY_3']
pay_status_jun ['0' '-1' '-2' '2' '4' '5' '3' '7' '6' '8' '1' 'PAY_4']
pay_status_may ['0' '-1' '-2' '2' '4' '3' '7' '5' '6' '8' 'PAY_5']
pay_status_apr ['0' '-1' '-2' '2' '4' '3' '7' '6' '8' '5' 'PAY_6']
default ['1' '0' 'default payment next month']


Searching for the index with hidden column results in a single index

In [162]:
string_row_df = df[df["gender"] == "SEX"]
string_row_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default
18381,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month


This index can be dropped.

In [163]:
df = df.drop(index=string_row_df.index[0])

Running the for loop for unique values again shows that the string values are cleaned up.

In [164]:
for category in category_col:
    print(category, sorted(df[category].unique()))

gender ['1', '2']
education ['0', '1', '2', '3', '4', '5', '6']
marital_status ['0', '1', '2', '3']
pay_status_sep ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_aug ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_jul ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_jun ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_may ['-1', '-2', '0', '2', '3', '4', '5', '6', '7', '8']
pay_status_apr ['-1', '-2', '0', '2', '3', '4', '5', '6', '7', '8']
default ['0', '1']


There are few observations that are outside of the definitions.
- There are -2 and 0 within payment status columns, they might mean that the balance previous months was 0 or it has been overpaid
- There are 0, 5, and 6 for education, which might indicated no eduation or higher than masters

In order to filter out errors, all values must be changed as integers

### Change all values to int64

In [165]:
df = df.astype("int64")
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22499 entries, 0 to 22499
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   max_credit      22499 non-null  int64
 1   gender          22499 non-null  int64
 2   education       22499 non-null  int64
 3   marital_status  22499 non-null  int64
 4   age             22499 non-null  int64
 5   pay_status_sep  22499 non-null  int64
 6   pay_status_aug  22499 non-null  int64
 7   pay_status_jul  22499 non-null  int64
 8   pay_status_jun  22499 non-null  int64
 9   pay_status_may  22499 non-null  int64
 10  pay_status_apr  22499 non-null  int64
 11  bill_sep        22499 non-null  int64
 12  bill_aug        22499 non-null  int64
 13  bill_jul        22499 non-null  int64
 14  bill_jun        22499 non-null  int64
 15  bill_may        22499 non-null  int64
 16  bill_apr        22499 non-null  int64
 17  payment_sep     22499 non-null  int64
 18  payment_aug     22499 non-

### Fixing Errors in Education
Changing 0, 5, 6 in education to 4

In [166]:
df.groupby("education").age.count()

education
0       11
1     7919
2    10516
3     3713
4       90
5      208
6       42
Name: age, dtype: int64

1, 2, and 3 level of education have signifcant data set. 0, 5, 6 seems to belong in 4 given how anemic this choice is

In [167]:
df.education = df.education.apply(tl.clean_education_col)

In [168]:
df.groupby("education").age.count()

education
1     7919
2    10516
3     3713
4      351
Name: age, dtype: int64

### Errors in Marital Status
Similar to education column, the unexpected value counts are extremely low (44 counts on values of 0). Therefore, all values of 0 will be changes to 3 ("others")

In [169]:
df.groupby("marital_status").age.count()

marital_status
0       44
1    10195
2    12026
3      234
Name: age, dtype: int64

In [170]:
df.marital_status = df.marital_status.apply(tl.clean_marital_status)

In [171]:
df.groupby("marital_status").age.count()

marital_status
1    10195
2    12026
3      278
Name: age, dtype: int64

***

## Default Analysis

### Does education Matter?

Does having higher education (value 1 or 2) have statistical significance?

In [172]:
df.groupby(["education","default"]).gender.count()

education  default
1          0          6388
           1          1531
2          0          7998
           1          2518
3          0          2764
           1           949
4          0           321
           1            30
Name: gender, dtype: int64

In [173]:
college_default_rate = (1531+2518)/(6388+1531+7998+2518)
no_college_default_rate = (949+30)/(2764+949+321+30)

print("with college ed or higher: ", college_default_rate)
print("with no college ed: ", no_college_default_rate)

with college ed or higher:  0.21963656088961214
with no college ed:  0.24089566929133857


In [174]:
ed_default = df[((df.education == 1) | (df.education == 2)) & (df.default == 1)]
ed_no_default = df[((df.education == 1) | (df.education == 2)) & (df.default == 0)]
no_ed_default = df[((df.education == 3) | (df.education == 4)) & (df.default == 1)]
no_ed_no_default = df[((df.education == 3) | (df.education == 4)) & (df.default == 0)]

In [175]:
print("Ed vs No Ed Default Pval", stats.ttest_ind(ed_default, no_ed_default, equal_var=False)[1])
print("Ed vs No Ed No-Default Pval", stats.ttest_ind(ed_no_default, no_ed_no_default, equal_var=False)[1])

Ed vs No Ed Default Pval [8.02678779e-27 2.61467357e-01 0.00000000e+00 2.21700550e-16
 2.27169549e-49 3.26315739e-03 1.97432389e-04 2.29775118e-03
 2.92919678e-04 4.31629067e-03 7.86366910e-02 5.76363082e-02
 2.06779111e-02 4.69951053e-03 2.31620589e-04 1.05570832e-04
 1.51643722e-04 6.16803944e-01 9.68832231e-01 1.68727651e-02
 1.96254404e-02 1.85339259e-03 1.28311136e-01            nan]
Ed vs No Ed No-Default Pval [2.37512457e-076 8.17047552e-001 0.00000000e+000 2.87364564e-026
 7.49474992e-141 1.20080800e-012 3.87662807e-012 5.60555886e-011
 9.20282243e-010 4.41578706e-007 1.50666271e-005 2.39859176e-001
 8.28025612e-002 1.79019674e-001 3.99758931e-003 3.72226919e-004
 2.38897092e-005 6.82023242e-004 6.91780320e-002 7.07222760e-004
 2.26351527e-006 6.64211868e-010 5.46900827e-006             nan]


All columns show statistical significance (alpha = 0.05) when separated by higher education. 

***

## Feature Engineering

### Higher Education

Separate per statistical analysis above on higher education

In [176]:
df["higher_education"] = df.education.apply(lambda x: 1 if x == 1 or x ==2 else 0)

### Married Feature
This column will be created using the marital_status column where 1 is given for 1 in marital_status column and 0 is given for all other columns

In [177]:
df["married"] = df.marital_status.apply(lambda x: 1 if x ==1 else 0)

### Update gender column
Updating this column to 0 (male) and 1 (female)

In [178]:
df.gender = df.gender.apply(lambda x: 0 if x==1 else 1)

### Bill Carry Over
This feature is how much bill carries over each month calculated by `bill_month` - `payment_month`

In [179]:
bills = ["bill_sep", "bill_aug", "bill_jul", "bill_jun", "bill_may", "bill_apr"]
payments = ["payment_sep", "payment_aug", "payments_jul", "payment_jun", "payment_may", "payment_apr"]
carry_overs = ["carry_sep", "carry_aug", "carry_jul", "carry_jun", "carry_may", "carry_apr"]

for bill, payment, carry in zip(bills, payments, carry_overs):
    df[carry] = df[bill] - df[payment]

### Carry Over Ratio
This feature is how much of the carry over bill is the portion of max_credit

In [180]:
carry_ratios = ["carry_ratio_sep", "carry_ratio_aug", "carry_ratio_jul", "carry_ratio_jun", "carry_ration_may", "carry_ratio_apr"]

for carry, ratio in zip(carry_overs, carry_ratios):
    df[ratio] = df[carry] / df.max_credit

### Bill Change between months

1. Create new columns that show total amount of bill change between month to month
2. Create categorical columns for change in either positive or negative direction

In [181]:
changes = ["apr_may_bill_change", "may_jun_bill_change", "jun_jul_bill_change", "jul_aug_bill_change", "aug_sep_bill_change"]
bills = ["bill_apr","bill_may", "bill_jun", "bill_jul", "bill_aug", "bill_sep"]

In [182]:
for change, prev_month, next_month in zip(changes, bills[:-1], bills[1:]):
    df[change] = df[next_month] - df[prev_month] 

In [183]:
df["average_bill_change"] = (df["apr_may_bill_change"] + df["may_jun_bill_change"] + 
                             df["jun_jul_bill_change"] + df["jul_aug_bill_change"] + 
                             df["aug_sep_bill_change"]) / 5

### Fully Paid Status
-2 would mean fully paid
-1 would mean paid on time

In [184]:
pay_status = ["pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun",
       "pay_status_may", "pay_status_apr"]
fully_paid = ["full_pay_sep", "full_pay_aug", "full_pay_jul", "full_pay_jun",
       "full_pay__may", "full_pay_apr"]

In [185]:
for pay_stat, full_pay in zip(pay_status, fully_paid):
    df[full_pay] = df[pay_stat].apply(lambda x: 1 if x == -2 else 0)

### Total Pay Delay and Habitual Deliquency

How many months of payments are delayed?
If the sum of payment delays are greater than 6 (if the client averages at least 1 delayed payment per month)

In [186]:
df["pay_status_sum"] = (df.pay_status_sep + df.pay_status_aug + 
                        df.pay_status_jul + df.pay_status_jun + 
                        df.pay_status_may + df.pay_status_apr)
# make categorical
df["habit_delay"] = df["pay_status_sum"].apply(lambda x: 1 if x >=6 else 0)

## Export Data

In [187]:
# reogranize the data for better visualization before export
columns = list(df.drop(columns="default").columns) #all columns but default
columns.append("default") #add default at the end
df = df[columns]

In [188]:
df.head(15)

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,higher_education,married,carry_sep,carry_aug,carry_jul,carry_jun,carry_may,carry_apr,carry_ratio_sep,carry_ratio_aug,carry_ratio_jul,carry_ratio_jun,carry_ration_may,carry_ratio_apr,apr_may_bill_change,may_jun_bill_change,jun_jul_bill_change,jul_aug_bill_change,aug_sep_bill_change,average_bill_change,full_pay_sep,full_pay_aug,full_pay_jul,full_pay_jun,full_pay__may,full_pay_apr,pay_status_sum,habit_delay,default
0,220000,1,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1,0,212598,214150,207779,215187,170872,40826,0.966355,0.973409,0.94445,0.978123,0.776691,0.185573,-2746,39334,-3293,4268,430,7598.6,0,0,0,0,0,0,0,0,1
1,200000,1,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0.0,0,0,0,0,0,0,-6,0,0
2,180000,1,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0.0,1,1,1,1,1,1,-12,0,0
3,80000,0,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,1,0,49519,50172,46071,42334,40768,41027,0.618988,0.62715,0.575887,0.529175,0.5096,0.512837,-271,1626,3711,4279,-500,1769.0,0,0,0,0,0,0,0,0,0
4,10000,0,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1,0,6257,6895,4278,5144,2339,1697,0.6257,0.6895,0.4278,0.5144,0.2339,0.1697,-58,2805,-566,3117,262,1112.0,0,0,0,0,0,0,0,0,1
5,240000,0,2,1,42,0,0,-1,-1,-1,-1,66394,63650,7117,2898,5530,67010,1273,7117,2905,5530,46275,17446,1,1,65121,56533,4212,-2632,-40745,49564,0.271338,0.235554,0.01755,-0.010967,-0.169771,0.206517,-61480,-2632,4219,56533,2744,-123.2,0,0,0,0,0,0,-4,0,0
6,110000,1,2,2,23,0,0,0,0,0,0,111271,111532,107998,79211,77881,80628,6000,4016,3000,3000,4000,4000,1,0,105271,107516,104998,76211,73881,76628,0.957009,0.977418,0.954527,0.692827,0.671645,0.696618,-2747,1330,28787,3534,-261,6128.6,0,0,0,0,0,0,0,0,0
7,50000,0,2,2,31,2,0,0,0,-1,-1,49804,28662,29476,4011,1000,0,2000,1500,1000,1000,0,0,1,0,47804,27162,28476,3011,1000,0,0.95608,0.54324,0.56952,0.06022,0.02,0.0,1000,3011,25465,-814,21142,9960.8,0,0,0,0,0,0,0,0,0
8,180000,1,2,2,35,-2,-2,-2,-2,-2,-2,-117,2573,-77,-77,1823,227,2690,0,0,1900,230,0,1,0,-2807,2573,-77,-1977,1593,227,-0.015594,0.014294,-0.000428,-0.010983,0.00885,0.001261,1596,-1900,0,2650,-2690,-68.8,1,1,1,1,1,1,-12,0,0
9,50000,0,3,1,58,0,0,0,0,0,0,31236,30141,18683,19055,19462,19343,3088,1500,651,682,680,595,0,1,28148,28641,18032,18373,18782,18748,0.56296,0.57282,0.36064,0.36746,0.37564,0.37496,119,-407,-372,11458,1095,2378.6,0,0,0,0,0,0,0,0,0


In [189]:
df.to_csv(r"..\data\training_data_eda.csv")

***
# Holdout Data

All cleaning and feature engineering will be performed on holdout data as well. 

In [190]:
holdout_df = pd.read_csv(r"..\data\holdout_data.csv")

In [191]:
holdout_df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,5501,180000,2,2,1,44,0,0,0,0,0,0,161186,167080,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000
1,28857,130000,2,2,1,48,-2,-2,-2,-2,-2,-2,0,1240,1487,1279,749,440,1240,1487,1279,749,440,849
2,11272,60000,2,1,1,43,-1,3,2,0,0,-1,495,330,495,330,165,340,0,330,0,0,340,0
3,8206,240000,1,1,1,42,0,0,0,0,0,0,72339,91045,91027,51508,51127,0,20000,2213,1030,1023,6790,10893
4,6362,100000,2,2,1,28,2,0,0,0,0,2,73073,74739,70844,63924,57326,59654,3500,3003,1910,2400,3300,0


In [192]:
holdout_df = holdout_df.drop(columns="Unnamed: 0")

There is no need to update the data into int64 since the data is already imported as int64

In [193]:
holdout_df = holdout_df.rename(columns=col_rename)

In [194]:
holdout_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr
0,180000,2,2,1,44,0,0,0,0,0,0,161186,167080,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000
1,130000,2,2,1,48,-2,-2,-2,-2,-2,-2,0,1240,1487,1279,749,440,1240,1487,1279,749,440,849
2,60000,2,1,1,43,-1,3,2,0,0,-1,495,330,495,330,165,340,0,330,0,0,340,0
3,240000,1,1,1,42,0,0,0,0,0,0,72339,91045,91027,51508,51127,0,20000,2213,1030,1023,6790,10893
4,100000,2,2,1,28,2,0,0,0,0,2,73073,74739,70844,63924,57326,59654,3500,3003,1910,2400,3300,0


## Data Clean

Apply same cleaning method as the prediction data above

### Education

In [195]:
holdout_df.education.unique()

array([2, 1, 3, 4, 5, 6, 0], dtype=int64)

In [196]:
holdout_df.education = holdout_df.education.apply(tl.clean_education_col)

In [197]:
holdout_df.education.unique()

array([2, 1, 3, 4], dtype=int64)

### Marital Status

In [198]:
holdout_df.marital_status.unique()

array([1, 2, 3, 0], dtype=int64)

In [199]:
holdout_df.marital_status = holdout_df.marital_status.apply(tl.clean_marital_status)

In [200]:
holdout_df.marital_status.unique()

array([1, 2, 3], dtype=int64)

## Apply Feature Engineering

### Higher Education

In [201]:
holdout_df["higher_education"] = holdout_df.education.apply(lambda x: 1 if x == 1 or x ==2 else 0)

### Marital Status

In [202]:
holdout_df["married"] = holdout_df.marital_status.apply(lambda x: 1 if x ==1 else 0)

### Gender

In [203]:
holdout_df.gender = holdout_df.gender.apply(lambda x: 0 if x==1 else 1)

### Bill Carry Over

In [204]:
for bill, payment, carry in zip(bills, payments, carry_overs):
    holdout_df[carry] = holdout_df[bill] - holdout_df[payment]

### Bill Carry Over Ration

In [205]:
for carry, ratio in zip(carry_overs, carry_ratios):
    holdout_df[ratio] = holdout_df[carry] / holdout_df.max_credit

### Bill Change between months

In [206]:
changes = ["apr_may_bill_change", "may_jun_bill_change", "jun_jul_bill_change", "jul_aug_bill_change", "aug_sep_bill_change"]
bills = ["bill_apr","bill_may", "bill_jun", "bill_jul", "bill_aug", "bill_sep"]

In [207]:
for change, prev_month, next_month in zip(changes, bills[:-1], bills[1:]):
    holdout_df[change] = holdout_df[next_month] - holdout_df[prev_month] 

In [208]:
holdout_df["average_bill_change"] = (holdout_df["apr_may_bill_change"] + holdout_df["may_jun_bill_change"] + 
                                     holdout_df["jun_jul_bill_change"] + holdout_df["jul_aug_bill_change"] + 
                                     holdout_df["aug_sep_bill_change"]) / 5

### Fully Paid Status

In [209]:
pay_status = ["pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun",
       "pay_status_may", "pay_status_apr"]
fully_paid = ["full_pay_sep", "full_pay_aug", "full_pay_jul", "full_pay_jun",
       "full_pay__may", "full_pay_apr"]

In [215]:
for pay_stat, full_pay in zip(pay_status, fully_paid):
    holdout_df[full_pay] = holdout_df[pay_stat].apply(lambda x: 1 if x == -2 else 0)

### Total Pay Delay and Habitual Deliquency

How many months of payments are delayed?

In [216]:
holdout_df["pay_status_sum"] = (holdout_df.pay_status_sep + holdout_df.pay_status_aug + 
                                holdout_df.pay_status_jul + holdout_df.pay_status_jun + 
                                holdout_df.pay_status_may + holdout_df.pay_status_apr)
holdout_df["habit_delay"] = holdout_df["pay_status_sum"].apply(lambda x: 1 if x >=6 else 0)

### Export Finished Dataframe

In [218]:
# ensure all feature engineering is applied
# if nothing is printed all feature engineering has been applied
for column in  df.drop(columns="default").columns:
    if column not in holdout_df.columns:
        print(f"{column} not found in holdout_df")

In [219]:
holdout_df.to_csv(r"..\data\holdout_data_eda.csv")