## Data Preprocessing and Training - American Express Default 

- Objective: create a cleaned development dataset by preprocessing the existing data to then be used in the modeling stage.

- Process: perform different operations on the 'amex_clean_data.csv'. These operations may include: creating dummy variables for binary categorical data, SMOTE to balance classes, standarizing and/or scaling the data to make it fit for a classification model. 


In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
#loading data to a dataframe
train_df = pd.read_csv('/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/amex_train.csv', sep=',')

In [3]:
# separating dependent variable from predictors to encode categorical data
y = train_df['credit_card_default']
train_df.drop('credit_card_default', axis=1, inplace=True)

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45528 entries, 0 to 45527
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               45528 non-null  int64  
 1   age                      45528 non-null  int64  
 2   gender                   45528 non-null  object 
 3   owns_car                 45528 non-null  object 
 4   owns_house               45528 non-null  object 
 5   no_of_children           45528 non-null  float64
 6   net_yearly_income        45528 non-null  float64
 7   no_of_days_employed      45528 non-null  float64
 8   occupation_type          45528 non-null  object 
 9   total_family_members     45528 non-null  float64
 10  migrant_worker           45528 non-null  float64
 11  yearly_debt_payments     45433 non-null  float64
 12  credit_limit             45528 non-null  float64
 13  credit_limit_used_pctg   45528 non-null  int64  
 14  credit_score          

In [5]:
train_df.drop('Unnamed: 0', axis=1, inplace=True)

In [6]:
train_df.head(5)

Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,occupation_type,total_family_members,migrant_worker,yearly_debt_payments,credit_limit,credit_limit_used_pctg,credit_score,prev_defaults,default_in_last_6months,customer_id
0,46,F,N,Y,0.0,107934.04,612.0,Unknown,1.0,1.0,33070.28,18690.93,73,544.0,2,1,CST_115179
1,29,M,N,Y,0.0,109862.62,2771.0,Laborers,2.0,0.0,15329.53,37745.19,52,857.0,0,0,CST_121920
2,37,M,N,Y,0.0,230153.17,204.0,Laborers,2.0,0.0,48416.6,41598.36,43,650.0,0,0,CST_109330
3,39,F,N,Y,0.0,122325.82,11941.0,Core staff,2.0,0.0,22574.36,32627.76,20,754.0,0,0,CST_128288
4,46,M,Y,Y,0.0,387286.0,1459.0,Core staff,1.0,0.0,38282.95,52950.64,75,927.0,0,0,CST_151355


#### Categorical Data: 

Encoding categorical data for classification algorithms. The dataset only have 5 object type excluding customer id, I will apply the get_dummies pandas method since there are no ordinal variables and it is more practical for this project. 

In [7]:
# encoding categorical data with pd.get_dummies
train_df = pd.get_dummies(train_df, columns=['gender', 'owns_car', 'owns_house', 'occupation_type'], prefix=['gender', 'car', 'house', 'occup'])

In [8]:
#checking the encoded was done properly
train_df.tail(10)

Unnamed: 0,age,no_of_children,net_yearly_income,no_of_days_employed,total_family_members,migrant_worker,yearly_debt_payments,credit_limit,credit_limit_used_pctg,credit_score,...,occup_Low-skill Laborers,occup_Managers,occup_Medicine staff,occup_Private service staff,occup_Realty agents,occup_Sales staff,occup_Secretaries,occup_Security staff,occup_Unknown,occup_Waiters/barmen staff
45518,54,2.0,252941.68,2266.0,3.0,0.0,37513.38,45775.43,71,677.0,...,0,0,0,0,0,0,0,0,0,0
45519,41,2.0,293494.24,794.0,4.0,0.0,40402.37,83883.37,97,665.0,...,0,0,0,0,0,0,0,0,0,0
45520,47,0.0,291628.76,1677.0,1.0,0.0,15627.73,42980.82,85,915.0,...,0,0,0,0,0,0,0,0,0,0
45521,48,0.0,89435.47,365249.0,2.0,0.0,31233.88,21850.77,36,879.0,...,0,0,0,0,0,0,0,0,1,0
45522,54,1.0,138001.12,161.0,3.0,0.0,16609.04,18565.63,71,893.0,...,0,0,0,0,0,0,0,0,1,0
45523,55,2.0,96207.57,117.0,4.0,0.0,11229.54,29663.83,82,907.0,...,0,0,0,0,0,0,0,0,1,0
45524,31,0.0,383476.74,966.0,2.0,1.0,43369.91,139947.16,32,679.0,...,0,0,0,0,0,0,0,0,0,0
45525,27,0.0,260052.18,1420.0,2.0,0.0,22707.51,83961.83,46,727.0,...,0,0,0,0,0,0,0,0,0,0
45526,32,0.0,157363.04,2457.0,2.0,0.0,20150.1,25538.72,92,805.0,...,0,0,0,0,0,0,0,0,0,0
45527,38,1.0,316896.28,1210.0,3.0,0.0,34603.78,36630.76,26,682.0,...,0,0,0,0,0,0,0,0,1,0


In [9]:
# checking column names
train_df.columns

Index(['age', 'no_of_children', 'net_yearly_income', 'no_of_days_employed',
       'total_family_members', 'migrant_worker', 'yearly_debt_payments',
       'credit_limit', 'credit_limit_used_pctg', 'credit_score',
       'prev_defaults', 'default_in_last_6months', 'customer_id', 'gender_F',
       'gender_M', 'car_N', 'car_Y', 'house_N', 'house_Y', 'occup_Accountants',
       'occup_Cleaning staff', 'occup_Cooking staff', 'occup_Core staff',
       'occup_Drivers', 'occup_HR staff', 'occup_High skill tech staff',
       'occup_IT staff', 'occup_Laborers', 'occup_Low-skill Laborers',
       'occup_Managers', 'occup_Medicine staff', 'occup_Private service staff',
       'occup_Realty agents', 'occup_Sales staff', 'occup_Secretaries',
       'occup_Security staff', 'occup_Unknown', 'occup_Waiters/barmen staff'],
      dtype='object')

In [10]:
#changing one occupation type column's name
train_df.rename(columns={'occup_Waiters/barmen staff':'occup_waiters_barmen staff'}, inplace=True)

### Normalizing Dataset

Numerical variables will be normalized with MinMaxScaler method from the scikit-learn library. The data are mostly not normally distributed which indicates MinMaxScaler as the most appropiate method for normalizing the data. 

In [13]:
customer_id = train_df['customer_id']
train_df.drop('customer_id', axis=1, inplace=True)

In [16]:
# making a MinMaxScaler object
scaler = MinMaxScaler()
# fitting the data and transforming
scaled = scaler.fit_transform(train_df)
scaled


array([[7.18750000e-01, 0.00000000e+00, 5.73881709e-04, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [1.87500000e-01, 0.00000000e+00, 5.87585643e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.37500000e-01, 0.00000000e+00, 1.44233570e-03, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.25000000e-01, 0.00000000e+00, 1.65478947e-03, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.81250000e-01, 0.00000000e+00, 9.25109968e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.68750000e-01, 1.11111111e-01, 2.05870729e-03, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00]])

In [18]:
# creating a dataframe from scaled data
scaled_df = pd.DataFrame(scaled, columns=['age', 'no_of_children', 'net_yearly_income', 'no_of_days_employed',
       'total_family_members', 'migrant_worker', 'yearly_debt_payments',
       'credit_limit', 'credit_limit_used_pctg', 'credit_score',
       'prev_defaults', 'default_in_last_6months', 'gender_F',
       'gender_M', 'car_N', 'car_Y', 'house_N', 'house_Y', 'occup_Accountants',
       'occup_Cleaning staff', 'occup_Cooking staff', 'occup_Core staff',
       'occup_Drivers', 'occup_HR staff', 'occup_High skill tech staff',
       'occup_IT staff', 'occup_Laborers', 'occup_Low-skill Laborers',
       'occup_Managers', 'occup_Medicine staff', 'occup_Private service staff',
       'occup_Realty agents', 'occup_Sales staff', 'occup_Secretaries',
       'occup_Security staff', 'occup_Unknown', 'occup_waiters_barmen staff'])

In [19]:
scaled_df.head(10)

Unnamed: 0,age,no_of_children,net_yearly_income,no_of_days_employed,total_family_members,migrant_worker,yearly_debt_payments,credit_limit,credit_limit_used_pctg,credit_score,...,occup_Low-skill Laborers,occup_Managers,occup_Medicine staff,occup_Private service staff,occup_Realty agents,occup_Sales staff,occup_Secretaries,occup_Security staff,occup_Unknown,occup_waiters_barmen staff
0,0.71875,0.0,0.000574,0.00167,0.0,1.0,0.094615,0.000472,0.737374,0.097996,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.1875,0.0,0.000588,0.007581,0.111111,0.0,0.040175,0.001084,0.525253,0.7951,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.4375,0.0,0.001442,0.000553,0.111111,0.0,0.141708,0.001208,0.434343,0.334076,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.5,0.0,0.000676,0.032687,0.111111,0.0,0.062407,0.00092,0.20202,0.565702,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.71875,0.0,0.002559,0.003989,0.0,0.0,0.110611,0.001573,0.757576,0.951002,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.71875,0.0,0.001603,0.007929,0.111111,1.0,0.106818,0.001164,0.191919,0.973274,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.46875,0.111111,0.001671,0.015165,0.222222,0.0,0.149143,0.001199,0.424242,0.518931,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.71875,0.111111,0.001521,0.003959,0.222222,0.0,0.08522,0.000906,0.919192,0.904232,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.53125,0.0,0.0013,0.031619,0.111111,0.0,0.059177,0.001961,0.141414,0.63029,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.5,0.222222,0.001279,0.007636,0.333333,0.0,0.022314,0.000785,0.141414,0.36971,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Train Test Split 

The dataset is split on train and test sets to prevent data leakage. Then, I will save the data to a csv file to be used in the next step of the project - Modeling. 

In [21]:
# making train and test sets
features = ['age', 'no_of_children', 'net_yearly_income', 'no_of_days_employed',
       'total_family_members', 'migrant_worker', 'yearly_debt_payments',
       'credit_limit', 'credit_limit_used_pctg', 'credit_score',
       'prev_defaults', 'default_in_last_6months', 'gender_F',
       'gender_M', 'car_N', 'car_Y', 'house_N', 'house_Y', 'occup_Accountants',
       'occup_Cleaning staff', 'occup_Cooking staff', 'occup_Core staff',
       'occup_Drivers', 'occup_HR staff', 'occup_High skill tech staff',
       'occup_IT staff', 'occup_Laborers', 'occup_Low-skill Laborers',
       'occup_Managers', 'occup_Medicine staff', 'occup_Private service staff',
       'occup_Realty agents', 'occup_Sales staff', 'occup_Secretaries',
       'occup_Security staff', 'occup_Unknown', 'occup_waiters_barmen staff']
X = scaled_df[features]

In [22]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=246)

In [24]:
# saving preprocessed data sets
path = '/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/processed_data.csv'
scaled_df.to_csv(path)