# Exploratory Data Analysis for the Taiwan Default Credit data set 

## Imports 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Reading the data 

In [2]:
default_credit_df = pd.read_csv('../data/raw/default_credit_card_clients.csv')

default_credit_df = default_credit_df.set_index('ID').rename(
    columns = {'default payment next month': 'default_payment_next_month'})

-----

## Summary of the data set

This project aims to build a classification model using a machine learning algorithm to predict potential credit default accounts of Taiwan's credit card clients'. The data set is a Taiwan credit card data from April to September, 2005 sourced from the UCI machine learning repository and can be found [here](https://archive-beta.ics.uci.edu/ml/datasets/default+of+credit+card+clients). The data set consists of 23 features column and one target column. 

There are 30,000 observations in the data set and 23 features. There are no observations with missing values in the data set. Below we show the number of each observations for each of the classes in the data set.

|Default Payment next month = 0|Default Payment next month = 1|
|------|------|
|23364|6636|

Table 1. Counts of observation for each class in the target column.

In [8]:
# class count 
default_credit_df['default_payment_next_month'].value_counts()

0    23364
1     6636
Name: default_payment_next_month, dtype: int64

In [9]:
# default_credit_df information
default_credit_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   LIMIT_BAL                   30000 non-null  int64
 1   SEX                         30000 non-null  int64
 2   EDUCATION                   30000 non-null  int64
 3   MARRIAGE                    30000 non-null  int64
 4   AGE                         30000 non-null  int64
 5   PAY_0                       30000 non-null  int64
 6   PAY_2                       30000 non-null  int64
 7   PAY_3                       30000 non-null  int64
 8   PAY_4                       30000 non-null  int64
 9   PAY_5                       30000 non-null  int64
 10  PAY_6                       30000 non-null  int64
 11  BILL_AMT1                   30000 non-null  int64
 12  BILL_AMT2                   30000 non-null  int64
 13  BILL_AMT3                   30000 non-null  int64
 14  BILL_A

In [10]:
# default_credit_df column summary
default_credit_df.describe()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_payment_next_month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,167484.322667,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,...,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,129747.661567,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,1.133187,...,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


## Partition the data set into training and test sets

Before proceeding further, we will split our data set into train and test set. $20$ % of the observations will be included in the test data and $80$ % in the train data set. Overall `default_of_credit_card_clients` has $30,000$ observations, thus the test set should have enough examples to provide good affirmation for the model: more precisely, the train set will have $24000$ observations, and test set $6000$.

Also, throughout the data analysis `random_state=123` will be used to make sure the results are consistent.

Worthy to note that no EDA will be performed on test data set, as test set is going to serve for the generalization of the model (unseen data for our model).

In [3]:
# splitting the dataset into train and test sets
train_df, test_df = train_test_split(default_credit_df, test_size=0.2, random_state=123)

In [4]:
# printing the number of observations for train and test sets
print('The number of observations for train set: ', train_df['default_payment_next_month'].shape[0])
print('The number of observations for test set: ', test_df['default_payment_next_month'].shape[0])

The number of observations for train set:  24000
The number of observations for test set:  6000


In [5]:
# percentage of zeros and ones in default column train set
train_percent_defaults = train_df['default_payment_next_month'].value_counts(normalize=True) * 100
train_percent_defaults.name = 'Default Count Percent'

# count of observations were default is one or zero in train set 
train_yes_default = len(train_df[train_df['default_payment_next_month'] == 1])
train_no_default = len(train_df[train_df['default_payment_next_month'] == 0])

In [6]:
# convert to a dataframe and make column names readable
default_percent_df = pd.DataFrame(train_percent_defaults)
default_percent_df = default_percent_df.rename(index = {0: 'No (0)', 1: 'Yes (1)'})

# make a dictionary of classes count values 
count_dic_no = {"Count": train_no_default 
               }

count_dic_yes = {"Count": train_yes_default
                }

# make a list from classes default payment counts 
list_default = [count_dic_no, count_dic_yes]

# convert to a dataframe
default_count = pd.DataFrame(list_default, index = ['No (0)', 'Yes (1)'])

# join two dataframes
df_default = default_percent_df.join(default_count)
df_default

Unnamed: 0,Default Count Percent,Count
No (0),77.783333,18668
Yes (1),22.216667,5332


The count, as well as percentage of overall distribution of classes indicates that there is an imbalance between `No (0)` and `Yes (1)` classes. It is certainly something we have to take into account for later on analysis, however the difference does not seem to be significant enough to start our analysis with over or under sampling assumption. Thus, we will start our analysis without any assumption, and if the confusion matrix or any other indicators during tuning will show that the model makes a lot more mistakes on class `1` (minority class), we will accordingly adjust the model to get the best results.

## Exploratory analysis on the training data set