## Data Preprocessing and Training - American Express Default 

- Objective: create a cleaned development dataset by preprocessing the existing data to then be used in the modeling stage.

- Process: perform different operations on the 'amex_clean_data.csv'. These operations may include: creating dummy variables for binary categorical data, SMOTE to balance classes, standarizing and/or scaling the data to make it fit for a classification model. 


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

In [3]:
#loading data to a dataframe
train_df = pd.read_csv('/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/amex_train.csv', sep=',')

In [4]:
train_df.columns

Index(['Unnamed: 0', 'age', 'gender', 'owns_car', 'owns_house',
       'no_of_children', 'net_yearly_income', 'no_of_days_employed',
       'occupation_type', 'total_family_members', 'migrant_worker',
       'yearly_debt_payments', 'credit_limit', 'credit_limit_used_pctg',
       'credit_score', 'prev_defaults', 'default_in_last_6months',
       'credit_card_default', 'customer_id'],
      dtype='object')

In [5]:
train_df.drop(columns='Unnamed: 0', axis=1, inplace=True)

### Train Test Split 

The dataset is split on train and test sets to prevent data leakage. Then, I will save the data to a csv file to be used in the next step of the project - Modeling. 

In [6]:
# separating dependent variable from predictors to encode categorical data
y = train_df.credit_card_default
customer_id = train_df[['customer_id']]
train_df.drop(columns=['credit_card_default', 'customer_id'], axis=1, inplace=True)

In [7]:
# checking shape
y.shape

(45528,)

In [8]:
# making train and test sets
features = ['age', 'gender', 'owns_car', 'owns_house',
       'no_of_children', 'net_yearly_income', 'no_of_days_employed',
       'occupation_type', 'total_family_members', 'migrant_worker',
       'yearly_debt_payments', 'credit_limit', 'credit_limit_used_pctg',
       'credit_score', 'prev_defaults', 'default_in_last_6months']
X = train_df[features]

In [9]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=246)

In [10]:
y_train.shape

(31869,)

In [11]:
X_train.shape

(31869, 16)

In [12]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31869 entries, 7786 to 19364
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      31869 non-null  int64  
 1   gender                   31869 non-null  object 
 2   owns_car                 31869 non-null  object 
 3   owns_house               31869 non-null  object 
 4   no_of_children           31869 non-null  float64
 5   net_yearly_income        31869 non-null  float64
 6   no_of_days_employed      31869 non-null  float64
 7   occupation_type          31869 non-null  object 
 8   total_family_members     31869 non-null  float64
 9   migrant_worker           31869 non-null  float64
 10  yearly_debt_payments     31809 non-null  float64
 11  credit_limit             31869 non-null  float64
 12  credit_limit_used_pctg   31869 non-null  int64  
 13  credit_score             31862 non-null  float64
 14  prev_defaults      

In [13]:
X_train.head(5)

Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,occupation_type,total_family_members,migrant_worker,yearly_debt_payments,credit_limit,credit_limit_used_pctg,credit_score,prev_defaults,default_in_last_6months
7786,54,F,N,Y,0.0,152620.28,365241.0,Unknown,1.0,0.0,17174.95,36475.15,28,657.0,0,0
4510,45,F,N,Y,0.0,195000.95,365247.0,Unknown,1.0,0.0,24914.37,37424.95,88,733.0,0,0
27835,49,F,N,Y,1.0,116471.27,2136.0,Private service staff,3.0,1.0,41638.64,17632.45,6,718.0,0,0
42119,46,M,N,N,1.0,274753.63,986.0,Security staff,3.0,0.0,21530.34,47433.1,56,740.0,0,0
17853,36,M,Y,N,2.0,234553.41,7326.0,Laborers,4.0,0.0,67125.98,42020.02,7,803.0,0,0


In [14]:
# replacing values
X_train['gender'] = X_train['gender'].replace(['F', 'M'], [0, 1])
X_train['owns_car'] = X_train['owns_car'].replace(['N', 'Y'], [0, 1])
X_train['owns_house'] = X_train['owns_house'].replace(['N', 'Y'], [0, 1])
# replacing values
X_test['gender'] = X_test['gender'].replace(['F', 'M'], [0, 1])
X_test['owns_car'] = X_test['owns_car'].replace(['N', 'Y'], [0, 1])
X_test['owns_house'] = X_test['owns_house'].replace(['N', 'Y'], [0, 1])

In [15]:
# checkin data quality
X_train.head(5)

Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,occupation_type,total_family_members,migrant_worker,yearly_debt_payments,credit_limit,credit_limit_used_pctg,credit_score,prev_defaults,default_in_last_6months
7786,54,0,0,1,0.0,152620.28,365241.0,Unknown,1.0,0.0,17174.95,36475.15,28,657.0,0,0
4510,45,0,0,1,0.0,195000.95,365247.0,Unknown,1.0,0.0,24914.37,37424.95,88,733.0,0,0
27835,49,0,0,1,1.0,116471.27,2136.0,Private service staff,3.0,1.0,41638.64,17632.45,6,718.0,0,0
42119,46,1,0,0,1.0,274753.63,986.0,Security staff,3.0,0.0,21530.34,47433.1,56,740.0,0,0
17853,36,1,1,0,2.0,234553.41,7326.0,Laborers,4.0,0.0,67125.98,42020.02,7,803.0,0,0


#### Categorical Data: 

Encoding categorical data for classification algorithms. The dataset only have 5 object type excluding customer id, I will apply the get_dummies pandas method since there are no ordinal variables and it is more practical for this project. 

In [16]:
# encoding data with get dummies
X_train = pd.get_dummies(X_train, columns=['occupation_type'], prefix=['ot'])
X_test = pd.get_dummies(X_test, columns=['occupation_type'], prefix=['ot'])

In [17]:
#checking data quality
X_train.tail(5)

Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,total_family_members,migrant_worker,yearly_debt_payments,...,ot_Low-skill Laborers,ot_Managers,ot_Medicine staff,ot_Private service staff,ot_Realty agents,ot_Sales staff,ot_Secretaries,ot_Security staff,ot_Unknown,ot_Waiters/barmen staff
14569,47,1,1,0,1.0,210172.75,205.0,3.0,0.0,42188.68,...,0,0,0,0,0,0,0,0,0,0
29550,24,0,0,1,0.0,160639.27,1310.0,2.0,0.0,46059.86,...,0,0,0,0,0,0,0,0,0,0
26244,40,0,1,1,1.0,342904.0,2089.0,3.0,0.0,39287.83,...,0,0,0,0,0,0,0,0,0,0
14494,49,0,0,1,0.0,191144.45,1518.0,2.0,0.0,43713.86,...,0,0,0,0,0,0,0,0,1,0
19364,33,0,0,1,0.0,113435.59,1414.0,2.0,0.0,13064.27,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#checking data quality
X_test.head(5)

Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,total_family_members,migrant_worker,yearly_debt_payments,...,ot_Low-skill Laborers,ot_Managers,ot_Medicine staff,ot_Private service staff,ot_Realty agents,ot_Sales staff,ot_Secretaries,ot_Security staff,ot_Unknown,ot_Waiters/barmen staff
28172,40,0,1,1,1.0,180097.34,1228.0,3.0,1.0,23711.41,...,0,0,0,0,0,1,0,0,0,0
15797,49,1,0,1,1.0,300847.13,1339.0,3.0,0.0,51920.12,...,0,0,0,0,0,0,0,0,0,0
33000,47,0,0,1,0.0,175328.41,1568.0,2.0,0.0,34157.04,...,0,0,0,0,0,1,0,0,0,0
15310,32,1,1,1,2.0,136905.24,365241.0,4.0,0.0,31362.68,...,0,0,0,0,0,0,0,0,1,0
285,30,0,0,0,0.0,79336.23,365247.0,1.0,0.0,19200.23,...,0,0,0,0,0,0,0,0,1,0


In [19]:
# checking data quality 
X_train.shape

(31869, 34)

In [20]:
# checking data quality 
X_test.shape

(13659, 34)

In [21]:
# checking column names
X_train.columns

Index(['age', 'gender', 'owns_car', 'owns_house', 'no_of_children',
       'net_yearly_income', 'no_of_days_employed', 'total_family_members',
       'migrant_worker', 'yearly_debt_payments', 'credit_limit',
       'credit_limit_used_pctg', 'credit_score', 'prev_defaults',
       'default_in_last_6months', 'ot_Accountants', 'ot_Cleaning staff',
       'ot_Cooking staff', 'ot_Core staff', 'ot_Drivers', 'ot_HR staff',
       'ot_High skill tech staff', 'ot_IT staff', 'ot_Laborers',
       'ot_Low-skill Laborers', 'ot_Managers', 'ot_Medicine staff',
       'ot_Private service staff', 'ot_Realty agents', 'ot_Sales staff',
       'ot_Secretaries', 'ot_Security staff', 'ot_Unknown',
       'ot_Waiters/barmen staff'],
      dtype='object')

In [22]:
#changing one occupation type column's name
X_train = X_train.rename(columns={'ot_Waiters/barmen staff':'ot_waiters_barmen staff'})
X_test = X_test.rename(columns={'ot_Waiters/barmen staff':'ot_waiters_barmen staff'})

In [23]:
X_train.columns

Index(['age', 'gender', 'owns_car', 'owns_house', 'no_of_children',
       'net_yearly_income', 'no_of_days_employed', 'total_family_members',
       'migrant_worker', 'yearly_debt_payments', 'credit_limit',
       'credit_limit_used_pctg', 'credit_score', 'prev_defaults',
       'default_in_last_6months', 'ot_Accountants', 'ot_Cleaning staff',
       'ot_Cooking staff', 'ot_Core staff', 'ot_Drivers', 'ot_HR staff',
       'ot_High skill tech staff', 'ot_IT staff', 'ot_Laborers',
       'ot_Low-skill Laborers', 'ot_Managers', 'ot_Medicine staff',
       'ot_Private service staff', 'ot_Realty agents', 'ot_Sales staff',
       'ot_Secretaries', 'ot_Security staff', 'ot_Unknown',
       'ot_waiters_barmen staff'],
      dtype='object')

### Normalizing Dataset

Numerical variables will be normalized with MinMaxScaler method from the scikit-learn library. The data are mostly not normally distributed which indicates MinMaxScaler as the most appropiate method for normalizing the data. 

In [24]:
# making a MinMaxScaler object
scaler = MinMaxScaler()
# fitting the data and transforming
scaled = scaler.fit_transform(X_train)
scaled

array([[0.96875, 0.     , 0.     , ..., 0.     , 1.     , 0.     ],
       [0.6875 , 0.     , 0.     , ..., 0.     , 1.     , 0.     ],
       [0.8125 , 0.     , 0.     , ..., 0.     , 0.     , 0.     ],
       ...,
       [0.53125, 0.     , 1.     , ..., 0.     , 0.     , 0.     ],
       [0.8125 , 0.     , 0.     , ..., 0.     , 1.     , 0.     ],
       [0.3125 , 0.     , 0.     , ..., 0.     , 0.     , 0.     ]])

In [25]:
# creating a dataframe from scaled data
scaled_df = pd.DataFrame(scaled, columns=['age', 'gender', 'owns_car', 'owns_house', 'no_of_children',
       'net_yearly_income', 'no_of_days_employed', 'total_family_members',
       'migrant_worker', 'yearly_debt_payments', 'credit_limit',
       'credit_limit_used_pctg', 'credit_score', 'prev_defaults',
       'default_in_last_6months', 'ot_Accountants', 'ot_Cleaning staff',
       'ot_Cooking staff', 'ot_Core staff', 'ot_Drivers', 'ot_HR staff',
       'ot_High skill tech staff', 'ot_IT staff', 'ot_Laborers',
       'ot_Low-skill Laborers', 'ot_Managers', 'ot_Medicine staff',
       'ot_Private service staff', 'ot_Realty agents', 'ot_Sales staff',
       'ot_Secretaries', 'ot_Security staff', 'ot_Unknown',
       'ot_waiters_barmen staff'])

In [26]:
scaled_df.head(10)

Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,total_family_members,migrant_worker,yearly_debt_payments,...,ot_Low-skill Laborers,ot_Managers,ot_Medicine staff,ot_Private service staff,ot_Realty agents,ot_Sales staff,ot_Secretaries,ot_Security staff,ot_Unknown,ot_waiters_barmen staff
0,0.96875,0.0,0.0,1.0,0.0,0.028468,0.99997,0.0,0.0,0.05392,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.6875,0.0,0.0,1.0,0.0,0.038086,0.999986,0.0,0.0,0.081857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.8125,0.0,0.0,1.0,0.111111,0.020265,0.005843,0.222222,1.0,0.142226,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.71875,1.0,0.0,0.0,0.111111,0.056184,0.002694,0.222222,0.0,0.069641,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.40625,1.0,1.0,0.0,0.222222,0.047061,0.020052,0.333333,0.0,0.234227,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.71875,0.0,0.0,1.0,0.111111,0.050416,0.009424,0.222222,0.0,0.118107,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.3125,0.0,0.0,1.0,0.0,0.026088,0.999967,0.111111,0.0,0.078448,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7,0.25,0.0,0.0,1.0,0.111111,0.032521,0.004397,0.111111,0.0,0.050584,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8,0.875,1.0,1.0,1.0,0.111111,0.045911,0.000972,0.222222,1.0,0.051455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,0.125,0.0,0.0,1.0,0.222222,0.026967,0.002031,0.222222,0.0,0.036474,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [27]:
scaled_df.shape

(31869, 34)

In [28]:
# fitting the test set
scaled_2 = scaler.fit_transform(X_test)
# transforming to df
scaled_2_df = pd.DataFrame(scaled_2, columns=['age', 'gender', 'owns_car', 'owns_house', 'no_of_children',
       'net_yearly_income', 'no_of_days_employed', 'total_family_members',
       'migrant_worker', 'yearly_debt_payments', 'credit_limit',
       'credit_limit_used_pctg', 'credit_score', 'prev_defaults',
       'default_in_last_6months', 'ot_Accountants', 'ot_Cleaning staff',
       'ot_Cooking staff', 'ot_Core staff', 'ot_Drivers', 'ot_HR staff',
       'ot_High skill tech staff', 'ot_IT staff', 'ot_Laborers',
       'ot_Low-skill Laborers', 'ot_Managers', 'ot_Medicine staff',
       'ot_Private service staff', 'ot_Realty agents', 'ot_Sales staff',
       'ot_Secretaries', 'ot_Security staff', 'ot_Unknown',
       'ot_waiters_barmen staff'])
scaled_2_df.head(5)


Unnamed: 0,age,gender,owns_car,owns_house,no_of_children,net_yearly_income,no_of_days_employed,total_family_members,migrant_worker,yearly_debt_payments,...,ot_Low-skill Laborers,ot_Managers,ot_Medicine staff,ot_Private service staff,ot_Realty agents,ot_Sales staff,ot_Secretaries,ot_Security staff,ot_Unknown,ot_waiters_barmen staff
0,0.53125,0.0,1.0,1.0,0.166667,0.001065,0.003329,0.285714,1.0,0.064419,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.8125,1.0,0.0,1.0,0.166667,0.001923,0.003633,0.285714,0.0,0.151118,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.75,0.0,0.0,1.0,0.0,0.001031,0.00426,0.142857,0.0,0.096523,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.28125,1.0,1.0,1.0,0.333333,0.000758,0.99997,0.428571,0.0,0.087935,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.21875,0.0,0.0,0.0,0.0,0.000349,0.999986,0.0,0.0,0.050553,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [29]:
# saving preprocessed data sets
path = '/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/X_train_scaled.csv'
path2 = '/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/y_train.csv'
path3 = '/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/X_test.csv'
path4 = '/Users/camilods16/Documents/Project-2-AmEx-Credit-Card-Default-/Project-2-AmEx-Credit-Card-Default-/data/processed/y_test.csv'
scaled_df.to_csv(path)
y_train.to_csv(path2)
X_test.to_csv(path3)
y_test.to_csv(path4)