# EDA

## Purpose
The purpose of this notebook is to perform exploratory data analysis (EDA) on the training dataset to clean and prepare the dataset for classification model training. First, the data will be examined for error and null values and correction will be inferred based on majority data. Second, dummy variables will be created for binary categorical data. Last, all cleaning and feature engineering will be applied to the test data. Both data set will be saved as a new CSV.

## Data Correction and Feature Engineering Summary
1. Education: merge all non 1, 2, or 3 values to 4
2. Marital Status: merge all non 1, 2, or 3 values to 3
    - Feature Engineer this as dummy variable of married (1=True 0=False)
3. History of Payment (pay_status_month): Fix any values represented as -2 or 0. Change duly paid status of 0 to mark to complete ordinal sequence
4. Carry over bills per month
5. Carry over bill ratio per month


## Column IDs

- Unnamed 0: Most likely a client ID. Not needed for this project
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- X2: Gender (1 = male; 2 = female).
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- X4: Marital status (1 = married; 2 = single; 3 = others).
- X5: Age (year).
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
    - -2 probably represents overpayment or bill at 0
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
- Y: default payment (1=defaulted, 0=not), this is the **predictand**
    - 1 = next month will be default payment
    - 0 = not 1

## Data Import and Clean

In [33]:
# below allows local import for custom functions in src folder
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import src.cc_pred_tools as tl
    
import pandas as pd
pd.set_option('display.max_columns', 100) #set to show all columns

import matplotlib
import seaborn as sns
%matplotlib inline

In [2]:
directory = r"..\data\training_data.csv"  #this directory may have to change if working on MAC OS
df = pd.read_csv(directory)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


There are no null values in this dataset

#### Column Renaming

In [4]:
df = df.drop(columns="Unnamed: 0") # drop unncessary columns

In [5]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [6]:
rename_list = ["max_credit", "gender", "education", "marital_status", "age",
               "pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun", "pay_status_may", "pay_status_apr",
               "bill_sep", "bill_aug", "bill_jul", "bill_jun", "bill_may", "bill_apr",
               "payment_sep", "payment_aug", "payments_jul", "payment_jun", "payment_may", "payment_apr",
                "default"]
col_rename = dict(zip(df.columns,rename_list))

In [7]:
df = df.rename(columns=col_rename)

#### Null and Unique Values

- Check for any null values and infer new data to fill in null if necessary
- Check categorical data for their unique values to check for error

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22500 entries, 0 to 22499
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   max_credit      22500 non-null  object
 1   gender          22500 non-null  object
 2   education       22500 non-null  object
 3   marital_status  22500 non-null  object
 4   age             22500 non-null  object
 5   pay_status_sep  22500 non-null  object
 6   pay_status_aug  22500 non-null  object
 7   pay_status_jul  22500 non-null  object
 8   pay_status_jun  22500 non-null  object
 9   pay_status_may  22500 non-null  object
 10  pay_status_apr  22500 non-null  object
 11  bill_sep        22500 non-null  object
 12  bill_aug        22500 non-null  object
 13  bill_jul        22500 non-null  object
 14  bill_jun        22500 non-null  object
 15  bill_may        22500 non-null  object
 16  bill_apr        22500 non-null  object
 17  payment_sep     22500 non-null  object
 18  paymen

There are no null values in this dataset. However, checking for unique values shows that most values are integers in string form along with hidden column names.

In [9]:
category_col = ["gender", "education", "marital_status",
               "pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun", "pay_status_may", "pay_status_apr",
               "default"]

In [10]:
for category in category_col:
    print(category, df[category].unique())

gender ['2' '1' 'SEX']
education ['1' '3' '2' '4' '6' '5' '0' 'EDUCATION']
marital_status ['2' '1' '3' '0' 'MARRIAGE']
pay_status_sep ['0' '-1' '-2' '2' '1' '3' '8' '4' '6' '5' '7' 'PAY_0']
pay_status_aug ['0' '-1' '-2' '2' '3' '4' '7' '5' '1' '8' '6' 'PAY_2']
pay_status_jul ['0' '-1' '-2' '2' '3' '4' '6' '5' '7' '8' '1' 'PAY_3']
pay_status_jun ['0' '-1' '-2' '2' '4' '5' '3' '7' '6' '8' '1' 'PAY_4']
pay_status_may ['0' '-1' '-2' '2' '4' '3' '7' '5' '6' '8' 'PAY_5']
pay_status_apr ['0' '-1' '-2' '2' '4' '3' '7' '6' '8' '5' 'PAY_6']
default ['1' '0' 'default payment next month']


Searching for the index with hidden column results in a single index

In [11]:
string_row_df = df[df["gender"] == "SEX"]
string_row_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default
18381,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month


This index can be dropped.

In [12]:
df = df.drop(index=string_row_df.index[0])

Running the for loop for unique values again shows that the string values are cleaned up.

In [13]:
for category in category_col:
    print(category, sorted(df[category].unique()))

gender ['1', '2']
education ['0', '1', '2', '3', '4', '5', '6']
marital_status ['0', '1', '2', '3']
pay_status_sep ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_aug ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_jul ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_jun ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_may ['-1', '-2', '0', '2', '3', '4', '5', '6', '7', '8']
pay_status_apr ['-1', '-2', '0', '2', '3', '4', '5', '6', '7', '8']
default ['0', '1']


There are few observations that are outside of the definitions.
- There are -2 and 0 within payment status columns, they might mean that the balance previous months was 0 or it has been overpaid
- There are 0, 5, and 6 for education, which might indicated no eduation or higher than masters

In order to filter out errors, all values must be changed as integers

### Change all values to int64

In [14]:
df = df.astype("int64")
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22499 entries, 0 to 22499
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   max_credit      22499 non-null  int64
 1   gender          22499 non-null  int64
 2   education       22499 non-null  int64
 3   marital_status  22499 non-null  int64
 4   age             22499 non-null  int64
 5   pay_status_sep  22499 non-null  int64
 6   pay_status_aug  22499 non-null  int64
 7   pay_status_jul  22499 non-null  int64
 8   pay_status_jun  22499 non-null  int64
 9   pay_status_may  22499 non-null  int64
 10  pay_status_apr  22499 non-null  int64
 11  bill_sep        22499 non-null  int64
 12  bill_aug        22499 non-null  int64
 13  bill_jul        22499 non-null  int64
 14  bill_jun        22499 non-null  int64
 15  bill_may        22499 non-null  int64
 16  bill_apr        22499 non-null  int64
 17  payment_sep     22499 non-null  int64
 18  payment_aug     22499 non-

### Fixing Errors in Education
Changing 0, 5, 6 in education to 4

In [15]:
df.groupby("education").age.count()

education
0       11
1     7919
2    10516
3     3713
4       90
5      208
6       42
Name: age, dtype: int64

1, 2, and 3 level of education have signifcant data set. 0, 5, 6 seems to belong in 4 given how anemic this choice is

In [16]:
df.education = df.education.apply(tl.clean_education_col)

In [17]:
df.groupby("education").age.count()

education
1     7919
2    10516
3     3713
4      351
Name: age, dtype: int64

### Errors in Marital Status
Similar to education column, the unexpected value counts are extremely low (44 counts on values of 0). Therefore, all values of 0 will be changes to 3 ("others")

In [18]:
df.groupby("marital_status").age.count()

marital_status
0       44
1    10195
2    12026
3      234
Name: age, dtype: int64

In [19]:
df.marital_status = df.marital_status.apply(tl.clean_marital_status)

In [20]:
df.groupby("marital_status").age.count()

marital_status
1    10195
2    12026
3      278
Name: age, dtype: int64

## Feature Engineering

### Married Feature
This column will be created using the marital_status column where 1 is given for 1 in marital_status column and 0 is given for all other columns

In [27]:
df["married"] = df.marital_status.apply(lambda x: 1 if x ==1 else 0)

In [28]:
df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default,married
0,220000,1,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1,0
1,200000,1,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0,0
2,180000,1,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,0,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0,0
4,10000,0,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1,0


### Update gender column
Updating this column to 0 (male) and 1 (female)

In [29]:
df.gender = df.gender.apply(lambda x: 0 if x==1 else 1)

In [30]:
df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default,married
0,220000,0,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1,0
1,200000,0,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0,0
2,180000,0,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0,0
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1,0


### Bill Carry Over
This feature is how much bill carries over each month calculated by `bill_month` - `payment_month`

In [31]:
bills = ["bill_sep", "bill_aug", "bill_jul", "bill_jun", "bill_may", "bill_apr"]
payments = ["payment_sep", "payment_aug", "payments_jul", "payment_jun", "payment_may", "payment_apr"]
carry_overs = ["carry_sep", "carry_aug", "carry_jul", "carry_jun", "carry_may", "carry_apr"]

for bill, payment, carry in zip(bills, payments, carry_overs):
    df[carry] = df[bill] - df[payment]

In [34]:
df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default,married,carry_sep,carry_aug,carry_jul,carry_jun,carry_may,carry_apr
0,220000,0,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1,0,212598,214150,207779,215187,170872,40826
1,200000,0,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0,0,0,0,0,0,0,0
2,180000,0,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0,0,49519,50172,46071,42334,40768,41027
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1,0,6257,6895,4278,5144,2339,1697


### Carry Over Ratio
This feature is how much of the carry over bill is the portion of max_credit

In [35]:
carry_ratios = ["carry_ratio_sep", "carry_ratio_aug", "carry_ratio_jul", "carry_ratio_jun", "carry_ration_may", "carry_ratio_apr"]

for carry, ratio in zip(carry_overs, carry_ratios):
    df[ratio] = df[carry] / df.max_credit

In [38]:
df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default,married,carry_sep,carry_aug,carry_jul,carry_jun,carry_may,carry_apr,carry_ratio_sep,carry_ratio_aug,carry_ratio_jul,carry_ratio_jun,carry_ration_may,carry_ratio_apr
0,220000,0,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1,0,212598,214150,207779,215187,170872,40826,0.966355,0.973409,0.94445,0.978123,0.776691,0.185573
1,200000,0,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,180000,0,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0,0,49519,50172,46071,42334,40768,41027,0.618988,0.62715,0.575887,0.529175,0.5096,0.512837
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1,0,6257,6895,4278,5144,2339,1697,0.6257,0.6895,0.4278,0.5144,0.2339,0.1697


In [41]:
df.groupby("default").age.count()

default
0    17471
1     5028
Name: age, dtype: int64

### Fix pay_status

Convert negative values to 0 for correct ordinal sequence

In [71]:
pay_status = ['pay_status_sep', 'pay_status_aug', 'pay_status_jul', 'pay_status_jun',
       'pay_status_may', 'pay_status_apr']

In [72]:
for pay_stat in pay_status:
    df[pay_stat].apply(lambda x: 0 if x <= 0 else x)

In [73]:
new_columns = ['default', 'max_credit', 'gender', 'education', 'marital_status', 'married','age',
       'pay_status_sep', 'pay_status_aug', 'pay_status_jul', 'pay_status_jun',
       'pay_status_may', 'pay_status_apr', 'bill_sep', 'bill_aug', 'bill_jul',
       'bill_jun', 'bill_may', 'bill_apr', 'payment_sep', 'payment_aug',
       'payments_jul', 'payment_jun', 'payment_may', 'payment_apr', 
        'carry_sep', 'carry_aug', 'carry_jul', 'carry_jun',
       'carry_may', 'carry_apr', 'carry_ratio_sep', 'carry_ratio_aug',
       'carry_ratio_jul', 'carry_ratio_jun', 'carry_ration_may',
       'carry_ratio_apr']

In [74]:
df = df[new_columns]

In [75]:
df.to_csv(r"..\data\training_data_eda.csv")

***
# Holdout Data

All cleaning and feature engineering will be performed on holdout data as well. 

In [48]:
holdout_df = pd.read_csv(r"..\data\holdout_data.csv")

In [49]:
holdout_df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,5501,180000,2,2,1,44,0,0,0,0,0,0,161186,167080,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000
1,28857,130000,2,2,1,48,-2,-2,-2,-2,-2,-2,0,1240,1487,1279,749,440,1240,1487,1279,749,440,849
2,11272,60000,2,1,1,43,-1,3,2,0,0,-1,495,330,495,330,165,340,0,330,0,0,340,0
3,8206,240000,1,1,1,42,0,0,0,0,0,0,72339,91045,91027,51508,51127,0,20000,2213,1030,1023,6790,10893
4,6362,100000,2,2,1,28,2,0,0,0,0,2,73073,74739,70844,63924,57326,59654,3500,3003,1910,2400,3300,0


In [50]:
holdout_df = holdout_df.drop(columns="Unnamed: 0")

In [51]:
holdout_df.info() #no need to update to int64!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 23 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   X1      7501 non-null   int64
 1   X2      7501 non-null   int64
 2   X3      7501 non-null   int64
 3   X4      7501 non-null   int64
 4   X5      7501 non-null   int64
 5   X6      7501 non-null   int64
 6   X7      7501 non-null   int64
 7   X8      7501 non-null   int64
 8   X9      7501 non-null   int64
 9   X10     7501 non-null   int64
 10  X11     7501 non-null   int64
 11  X12     7501 non-null   int64
 12  X13     7501 non-null   int64
 13  X14     7501 non-null   int64
 14  X15     7501 non-null   int64
 15  X16     7501 non-null   int64
 16  X17     7501 non-null   int64
 17  X18     7501 non-null   int64
 18  X19     7501 non-null   int64
 19  X20     7501 non-null   int64
 20  X21     7501 non-null   int64
 21  X22     7501 non-null   int64
 22  X23     7501 non-null   int64
dtypes: int64(23)


In [52]:
holdout_df = holdout_df.rename(columns=col_rename)

In [53]:
holdout_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr
0,180000,2,2,1,44,0,0,0,0,0,0,161186,167080,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000
1,130000,2,2,1,48,-2,-2,-2,-2,-2,-2,0,1240,1487,1279,749,440,1240,1487,1279,749,440,849
2,60000,2,1,1,43,-1,3,2,0,0,-1,495,330,495,330,165,340,0,330,0,0,340,0
3,240000,1,1,1,42,0,0,0,0,0,0,72339,91045,91027,51508,51127,0,20000,2213,1030,1023,6790,10893
4,100000,2,2,1,28,2,0,0,0,0,2,73073,74739,70844,63924,57326,59654,3500,3003,1910,2400,3300,0


### Education

In [54]:
holdout_df.education.unique()

array([2, 1, 3, 4, 5, 6, 0], dtype=int64)

In [55]:
holdout_df.education = holdout_df.education.apply(tl.clean_education_col)

In [56]:
holdout_df.education.unique()

array([2, 1, 3, 4], dtype=int64)

### Marital Status

In [57]:
holdout_df.marital_status.unique()

array([1, 2, 3, 0], dtype=int64)

In [58]:
holdout_df.marital_status = holdout_df.marital_status.apply(tl.clean_marital_status)

In [59]:
holdout_df.marital_status.unique()

array([1, 2, 3], dtype=int64)

In [60]:
holdout_df["married"] = holdout_df.marital_status.apply(lambda x: 1 if x ==1 else 0)

### Gender

In [61]:
holdout_df.gender = holdout_df.gender.apply(lambda x: 0 if x==1 else 1)

In [62]:
holdout_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,married
0,180000,1,2,1,44,0,0,0,0,0,0,161186,167080,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000,1
1,130000,1,2,1,48,-2,-2,-2,-2,-2,-2,0,1240,1487,1279,749,440,1240,1487,1279,749,440,849,1
2,60000,1,1,1,43,-1,3,2,0,0,-1,495,330,495,330,165,340,0,330,0,0,340,0,1
3,240000,0,1,1,42,0,0,0,0,0,0,72339,91045,91027,51508,51127,0,20000,2213,1030,1023,6790,10893,1
4,100000,1,2,1,28,2,0,0,0,0,2,73073,74739,70844,63924,57326,59654,3500,3003,1910,2400,3300,0,1


### Bill Carry Over

In [65]:
for bill, payment, carry in zip(bills, payments, carry_overs):
    holdout_df[carry] = holdout_df[bill] - holdout_df[payment]

### Bill Carry Over Ration

In [67]:
for carry, ratio in zip(carry_overs, carry_ratios):
    holdout_df[ratio] = holdout_df[carry] / holdout_df.max_credit

In [68]:
holdout_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,married,carry_sep,carry_aug,carry_jul,carry_jun,carry_may,carry_apr,carry_ratio_sep,carry_ratio_aug,carry_ratio_jul,carry_ratio_jun,carry_ration_may,carry_ratio_apr
0,180000,1,2,1,44,0,0,0,0,0,0,161186,167080,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000,1,151186,159080,163788,168764,155667,156953,0.839922,0.883778,0.909933,0.937578,0.864817,0.871961
1,130000,1,2,1,48,-2,-2,-2,-2,-2,-2,0,1240,1487,1279,749,440,1240,1487,1279,749,440,849,1,-1240,-247,208,530,309,-409,-0.009538,-0.0019,0.0016,0.004077,0.002377,-0.003146
2,60000,1,1,1,43,-1,3,2,0,0,-1,495,330,495,330,165,340,0,330,0,0,340,0,1,495,0,495,330,-175,340,0.00825,0.0,0.00825,0.0055,-0.002917,0.005667
3,240000,0,1,1,42,0,0,0,0,0,0,72339,91045,91027,51508,51127,0,20000,2213,1030,1023,6790,10893,1,52339,88832,89997,50485,44337,-10893,0.218079,0.370133,0.374987,0.210354,0.184737,-0.045387
4,100000,1,2,1,28,2,0,0,0,0,2,73073,74739,70844,63924,57326,59654,3500,3003,1910,2400,3300,0,1,69573,71736,68934,61524,54026,59654,0.69573,0.71736,0.68934,0.61524,0.54026,0.59654


In [69]:
holdout_df.to_csv(r"..\data\holdout_data_eda.csv")