# EDA

## Purpose

## Check Values
## Rename Column
## Save Methods in src

## Column IDs

- Unnamed 0: Most likely a client ID. Not needed for this project
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- X2: Gender (1 = male; 2 = female).
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- X4: Marital status (1 = married; 2 = single; 3 = others).
- X5: Age (year).
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
    - -2 probably represents overpayment or bill at 0
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
- Y: default payment (1=defaulted, 0=not), this is the **predictand**
    - 1 = next month will be default payment
    - 0 = not 1

### Data Import

In [24]:
# below allows local import for custom functions in src folder
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import src.cc_pred_tools as tl
    
import pandas as pd
pd.set_option('display.max_columns', 30) #set to show all columns

In [2]:
directory = r"..\data\training_data.csv"  #this directory may have to change if working on MAC OS
df = pd.read_csv(directory)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


There are no null values in this dataset

#### Column Renaming

In [5]:
df = df.drop(columns="Unnamed: 0") # drop unncessary columns

In [6]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [19]:
rename_list = ["max_credit", "gender", "education", "marital_status", "age",
               "pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun", "pay_status_may", "pay_status_apr",
               "bill_sep", "bill_aug", "bill_jul", "bill_jun", "bill_may", "bill_apr",
               "payment_sep", "payment_aug", "payments_jul", "payment_jun", "payment_may", "payment_apr",
                "default"]
col_rename = dict(zip(df.columns,rename_list))

In [20]:
df = df.rename(columns=col_rename)

#### Null and Unique Values

- Check for any null values and infer new data to fill in null if necessary
- Check categorical data for their unique values to check for error

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22500 entries, 0 to 22499
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   max_credit      22500 non-null  object
 1   gender          22500 non-null  object
 2   education       22500 non-null  object
 3   marital_status  22500 non-null  object
 4   age             22500 non-null  object
 5   pay_status_sep  22500 non-null  object
 6   pay_status_aug  22500 non-null  object
 7   pay_status_jul  22500 non-null  object
 8   pay_status_jun  22500 non-null  object
 9   pay_status_may  22500 non-null  object
 10  pay_status_apr  22500 non-null  object
 11  bill_sep        22500 non-null  object
 12  bill_aug        22500 non-null  object
 13  bill_jul        22500 non-null  object
 14  bill_jun        22500 non-null  object
 15  bill_may        22500 non-null  object
 16  bill_apr        22500 non-null  object
 17  payment_sep     22500 non-null  object
 18  paymen

There are no null values in this dataset. However, all values will need to convert to integers or floats. 

In [33]:
category_col = ["gender", "education", "marital_status",
               "pay_status_sep", "pay_status_aug", "pay_status_jul", "pay_status_jun", "pay_status_may", "pay_status_apr",
               "default"]

In [34]:
for category in category_col:
    print(category, df[category].unique())

gender ['2' '1' 'SEX']
education ['1' '3' '2' '4' '6' '5' '0' 'EDUCATION']
marital_status ['2' '1' '3' '0' 'MARRIAGE']
pay_status_sep ['0' '-1' '-2' '2' '1' '3' '8' '4' '6' '5' '7' 'PAY_0']
pay_status_aug ['0' '-1' '-2' '2' '3' '4' '7' '5' '1' '8' '6' 'PAY_2']
pay_status_jul ['0' '-1' '-2' '2' '3' '4' '6' '5' '7' '8' '1' 'PAY_3']
pay_status_jun ['0' '-1' '-2' '2' '4' '5' '3' '7' '6' '8' '1' 'PAY_4']
pay_status_may ['0' '-1' '-2' '2' '4' '3' '7' '5' '6' '8' 'PAY_5']
pay_status_apr ['0' '-1' '-2' '2' '4' '3' '7' '6' '8' '5' 'PAY_6']
default ['1' '0' 'default payment next month']


There are none-numeric strings within categorical values. However, it seems that they are left over column name located somewhere. Running the cell below shows that they might be all from one row.

In [41]:
string_row_df = df[df["gender"] == "SEX"]
string_row_df.head()

Unnamed: 0,max_credit,gender,education,marital_status,age,pay_status_sep,pay_status_aug,pay_status_jul,pay_status_jun,pay_status_may,pay_status_apr,bill_sep,bill_aug,bill_jul,bill_jun,bill_may,bill_apr,payment_sep,payment_aug,payments_jul,payment_jun,payment_may,payment_apr,default
18381,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month


In [46]:
df = df.drop(index=string_row_df.index[0])

Running the for loop for unique values again shows that the string values are cleaned up.

In [51]:
for category in category_col:
    print(category, sorted(df[category].unique()))

gender ['1', '2']
education ['0', '1', '2', '3', '4', '5', '6']
marital_status ['0', '1', '2', '3']
pay_status_sep ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_aug ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_jul ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_jun ['-1', '-2', '0', '1', '2', '3', '4', '5', '6', '7', '8']
pay_status_may ['-1', '-2', '0', '2', '3', '4', '5', '6', '7', '8']
pay_status_apr ['-1', '-2', '0', '2', '3', '4', '5', '6', '7', '8']
default ['0', '1']
