# 4 Preprocessing and Training<a id='4_Preprocessing_and_Training'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Preprocessing and Training](#4_Preprocessing_and_Training)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Data Cleaning](#4.5_Data_Cleaning)
      * [4.5.1 Drop columns from hold out set](#4.5.1_Drop_columns_from_hold_out_set)
      * [4.5.2 Missing Values](#4.5.2_Missing_Values)
      * [4.5.3 Fill Remaining Missing Values](#4.5.3_Fill_Remaining_Missing_Values)
      * [4.5.4 Outlier Removal](#4.5.4_Outlier_Removal)
  * [4.6 Split Dataset](#4.6_Split_Dataset)
  * [4.7 Preprocessing](#4.7_Preprocessing)
      * [4.7.1 Cadegorical Variable Cleaning](#4.7_Cadegorical_Variable_Cleaning)
      * [4.7.2 One-hot Encoding](#4.7.2_One-hot_Encoding)
      * [4.7.3 Train Test Split](#4.7.3_Train_Test_Split)
      * [4.7.4 Scale the Data](#4.7.4_Scale_the_Data)
  * [4.8 Save the Dataset](#4.8_Save_the_Dataset)

## 4.2 Introduction<a id='4.2_Introduction'></a>

In this notebook, I am going to first do some final data cleaning. Next, I will do preprocessing of categorical features and scaling of the data. Finally, I will break the training data into training and test splits to prepare for machine learning. 

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
# import the necessary packages
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from numpy import savetxt

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In the previous notebook I saved the data as 'lc_step3.csv'. The hold-out dataset was saved in the data wrangling notebook as 'LC_2016_2017_cleaned.csv'. I will load this data now. 

In [2]:
df = pd.read_csv('lc_step3.csv', index_col = 'id')

In [3]:
df_hold = pd.read_csv('../data/LC_2016_2017_cleaned.csv', index_col='id')

## 4.5 Data Cleaning<a id='4.5_Data_Cleaning'></a>

Before preprocessing, I need to do some final data cleaning. I dropped a number of columns during EDA and now I am going to drop those from the hold-out set. Also, I did not 

### 4.5.1 Drop columns from hold out set<a id='4.5.1_Drop_columns_from_hold_out_set'></a>

In [4]:
# In the last notebook, I dropped a number of columns from the training set. Here I will drop those same columns
# from the hold-out set.
drop_col = ['installment', 'total_pymnt_inv', 'out_prncp_inv', 'recoveries', 'collection_recovery_fee', 'loan_status', 'emp_length', 'term(months)', 'grade']
df_hold.drop(drop_col, axis=1, inplace=True)
df_hold.shape

(759335, 34)

### 4.5.2 Missing Values<a id='4.5.2_Missing_Values'></a>

In [5]:
# training set missing values
missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='%', ascending=False).head(6)

Unnamed: 0,count,%
next_pymnt_d,251160,28.375429
total_rev_hi_lim,68203,7.705404
tot_cur_bal,68203,7.705404
last_pymnt_d,17654,1.994505
last_credit_pull_d,53,0.005988
out_prncp,0,0.0


In [6]:
missing_holdout = pd.concat([df_hold.isnull().sum(), 100 * df_hold.isnull().mean()], axis=1)
missing_holdout.columns=['count', '%']
missing_holdout.sort_values(by='%', ascending=False).head()

Unnamed: 0,count,%
next_pymnt_d,167915,22.113428
last_pymnt_d,948,0.124846
dti,355,0.046751
last_credit_pull_d,18,0.00237
total_pymnt,0,0.0


### 4.5.3 Fill Remaining Missing Values<a id='4.5.3_Fill_Remaining_Missing_Values'></a>

In [7]:
df.last_pymnt_d.fillna(df.issue_d, inplace=True)
df.last_credit_pull_d.fillna(df.issue_d, inplace=True)
df.next_pymnt_d.fillna(df.last_pymnt_d, inplace=True)
df.tot_cur_bal.fillna(df.loan_amnt, inplace=True)
df.total_rev_hi_lim.fillna(df.loan_amnt, inplace=True)

In [8]:
df_hold.last_pymnt_d.fillna(df_hold.issue_d, inplace=True)
df_hold.next_pymnt_d.fillna(df_hold.last_pymnt_d, inplace=True)
df_hold.last_credit_pull_d.fillna(df_hold.issue_d, inplace=True)

In [9]:
df_hold.dropna(inplace=True)

In [10]:
df.isnull().sum()

loan_amnt                     0
int_rate                      0
sub_grade                     0
home_ownership                0
annual_inc                    0
verification_status           0
issue_d                       0
purpose                       0
addr_state                    0
dti                           0
delinq_2yrs                   0
earliest_cr_line              0
inq_last_6mths                0
open_acc                      0
pub_rec                       0
revol_bal                     0
revol_util                    0
total_acc                     0
initial_list_status           0
out_prncp                     0
total_pymnt                   0
total_rec_prncp               0
total_rec_int                 0
total_rec_late_fee            0
last_pymnt_d                  0
last_pymnt_amnt               0
next_pymnt_d                  0
last_credit_pull_d            0
collections_12_mths_ex_med    0
acc_now_delinq                0
tot_coll_amt                  0
tot_cur_

In [11]:
df_hold.isnull().sum()

loan_amnt                     0
int_rate                      0
sub_grade                     0
home_ownership                0
annual_inc                    0
verification_status           0
issue_d                       0
purpose                       0
addr_state                    0
dti                           0
delinq_2yrs                   0
earliest_cr_line              0
inq_last_6mths                0
open_acc                      0
pub_rec                       0
revol_bal                     0
revol_util                    0
total_acc                     0
initial_list_status           0
out_prncp                     0
total_pymnt                   0
total_rec_prncp               0
total_rec_int                 0
total_rec_late_fee            0
last_pymnt_d                  0
last_pymnt_amnt               0
next_pymnt_d                  0
last_credit_pull_d            0
collections_12_mths_ex_med    0
acc_now_delinq                0
tot_coll_amt                  0
tot_cur_

### 4.5.4 Outlier Removal<a id='4.5.4_Outlier_Removal'></a>

In [12]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
loan_amnt,885132.0,14744.637608,8432.881175,500.0,8000.0,13000.0,20000.0,35000.0
int_rate,885132.0,13.249008,4.381665,5.32,9.99,12.99,16.2,28.99
annual_inc,885132.0,74556.736985,48070.236877,0.0,45000.0,64851.0,90000.0,999999.0
dti,885132.0,18.170001,17.206697,0.0,11.92,17.67,23.96,9999.0
delinq_2yrs,885132.0,0.314791,0.862853,0.0,0.0,0.0,0.0,39.0
inq_last_6mths,885132.0,0.693743,0.99706,0.0,0.0,0.0,1.0,33.0
open_acc,885132.0,11.55128,5.318696,0.0,8.0,11.0,14.0,90.0
pub_rec,885132.0,0.195635,0.582583,0.0,0.0,0.0,0.0,86.0
revol_bal,885132.0,16898.222807,22132.904973,0.0,6443.0,11870.0,20819.0,2904836.0
revol_util,885132.0,55.085171,23.816724,0.0,37.7,56.0,73.6,892.3


The column 'dti' - A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self:reported monthly income.

According to this definition, it would make sense to have a maximum value of 100. 

In [13]:
# defining a funcion to show the rows that appear to be outliers for a given column based on standard deviation
# input a column name and the number of std away from the mean (default is 3)
def std_outlier_detection(col, cut=3):
    mean = df[col].mean()
    std = df[col].std()
    cut_off = cut * std
    lower, upper = mean - cut_off, mean + cut_off
    return df[~((df[col]<upper) & (df[col]>lower))]

In [14]:
# defining a funcion to remove the rows that appear to be outliers for a given column
# input a column name and the number of std away from the mean (default is 3)
def outlier_removal(col, cut=3):
    mean = df[col].mean()
    std = df[col].std()
    cut_off = cut * std
    lower, upper = mean - cut_off, mean + cut_off
    df = df[(df[col]<upper) & (df[col]>lower)]
    return df.shape

In [15]:
# defining a funcion to remove the rows that are above an upper limit for a given column
# input a column name and the upper limit
def outlier_removal_defined(col, upper):
    df = df[df[col]<upper]

In [16]:
# how many rows are outliers for dti?
std_outlier_detection('dti')

Unnamed: 0_level_0,loan_amnt,int_rate,sub_grade,home_ownership,annual_inc,verification_status,issue_d,purpose,addr_state,dti,...,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,defaults
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
68393780,25000.0,20.5,E4,MORTGAGE,19800.0,Verified,Dec-2015,debt_consolidation,AL,83.64,...,Dec-2015,0.0,Jan-2016,Jan-2016,0.0,0.0,2151.0,80314.0,30200.0,0
67405134,6550.0,17.86,D5,MORTGAGE,1770.0,Not Verified,Dec-2015,credit_card,TX,1092.52,...,Dec-2015,0.0,Feb-2016,Jan-2016,0.0,0.0,0.0,192764.0,34400.0,0
67565514,15400.0,14.48,C5,MORTGAGE,35000.0,Verified,Dec-2015,debt_consolidation,NY,83.4,...,Dec-2015,0.0,Jan-2016,Jan-2016,0.0,0.0,0.0,105326.0,7000.0,0
67555887,15000.0,16.55,D2,MORTGAGE,20000.0,Source Verified,Dec-2015,debt_consolidation,TX,137.4,...,Dec-2015,0.0,Feb-2016,Jan-2016,0.0,0.0,0.0,143727.0,31600.0,0
67575819,28000.0,19.99,E4,OWN,22246.8,Verified,Dec-2015,debt_consolidation,CA,89.83,...,Dec-2015,0.0,Feb-2016,Jan-2016,0.0,0.0,0.0,62324.0,3800.0,0
67495417,28000.0,18.2,E1,RENT,17000.0,Source Verified,Dec-2015,debt_consolidation,NV,136.97,...,Dec-2015,0.0,Jan-2016,Jan-2016,0.0,0.0,0.0,102192.0,30500.0,0
66415476,20000.0,16.99,D3,MORTGAGE,0.0,Not Verified,Dec-2015,debt_consolidation,TX,9999.0,...,Jan-2016,712.96,Feb-2016,Jan-2016,0.0,0.0,0.0,8715.0,8100.0,0
66495833,20000.0,18.2,E1,RENT,27000.0,Source Verified,Dec-2015,debt_consolidation,MD,100.09,...,Jan-2016,510.05,Feb-2016,Jan-2016,0.0,0.0,0.0,118127.0,31200.0,0
66593126,16000.0,17.57,D4,MORTGAGE,30000.0,Not Verified,Dec-2015,debt_consolidation,GA,90.0,...,Jan-2016,402.57,Feb-2016,Jan-2016,0.0,0.0,0.0,234904.0,36700.0,0
65662159,14675.0,18.2,E1,RENT,18240.0,Verified,Nov-2015,debt_consolidation,ME,72.3,...,Jan-2016,411.34,Jan-2016,Jan-2016,0.0,0.0,0.0,41414.0,23500.0,0


Since there are not many dti outliers, I am just going to drop the rows. It looks like using std to drop rows will not be good for dti since rows with values that do make sense will be dropped. Instead I am just going to chose a fixed upper bound (of 100). 

In [17]:
# dropping rows where dti > 100
df = df[df['dti']<100]
df.shape

(885121, 34)

## 4.6 Split Dataset<a id='4.6_Split_Dataset'></a>

df currently contains both the features and the targets (the 'defaults' column) for machine learning. I will need to seperate these before moving forward. 

In [18]:
# Split the dataset into features and target
X = df.drop('defaults', axis=1)
y = df['defaults']

In [27]:
X_hold = df_hold.drop('defaults', axis=1)
y_hold = df_hold['defaults']

## 4.7 Preprocessing<a id='4.7_Preprocessing'></a>

### 4.7.1 Cadegorical Variable Cleaning<a id='4.7.1_Cadegorical_Variable_Cleaning'></a>

The categorical columns are currently dtype object. 

In [19]:
X.select_dtypes('object')

Unnamed: 0_level_0,sub_grade,home_ownership,verification_status,issue_d,purpose,addr_state,earliest_cr_line,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1077501,B2,RENT,Verified,Dec-2011,credit_card,AZ,Jan-1985,f,Jan-2015,Jan-2015,Jan-2016
1077430,C4,RENT,Source Verified,Dec-2011,car,GA,Apr-1999,f,Apr-2013,Apr-2013,Sep-2013
1077175,C5,RENT,Not Verified,Dec-2011,small_business,IL,Nov-2001,f,Jun-2014,Jun-2014,Jan-2016
1076863,C1,RENT,Source Verified,Dec-2011,other,CA,Feb-1996,f,Jan-2015,Jan-2015,Jan-2015
1075358,B5,RENT,Source Verified,Dec-2011,other,OR,Jan-1996,f,Jan-2016,Feb-2016,Jan-2016
...,...,...,...,...,...,...,...,...,...,...,...
36371250,B5,RENT,Verified,Jan-2015,debt_consolidation,CA,Sep-2004,f,Jan-2016,Feb-2016,Jan-2016
36441262,B5,MORTGAGE,Verified,Jan-2015,home_improvement,NJ,Mar-1974,f,Jan-2016,Feb-2016,Jan-2016
36271333,D2,RENT,Verified,Jan-2015,debt_consolidation,TN,Sep-2003,w,Jan-2016,Feb-2016,Jan-2016
36490806,E3,RENT,Source Verified,Jan-2015,debt_consolidation,MA,Oct-2003,w,Jan-2016,Feb-2016,Jan-2016


In [20]:
# break date columns into relevant info and make sure they have dtype of category
X['issue_d_year'] = X['issue_d'].str.slice(4, 8).astype('category')
X['earliest_cr_line_year'] = X['earliest_cr_line'].str.slice(4, 8).astype('category')
X['last_pymnt_d_month'] = X['last_pymnt_d'].str.slice(0, 3).astype('category')
X['last_pymnt_d_year'] = X['last_pymnt_d'].str.slice(4, 8).astype('category')
X['next_pymnt_d_month'] = X['next_pymnt_d'].str.slice(0, 3).astype('category')
X['next_pymnt_d_year'] = X['next_pymnt_d'].str.slice(4, 8).astype('category')
X['last_credit_pull_d_year'] = X['last_credit_pull_d'].str.slice(4, 8).astype('category')

In [21]:
# drop the original date columns
X.drop(['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d'], axis=1, inplace=True)

In [22]:
# change the object columns to category
for col in ['sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 'initial_list_status']:
    df[col] = df[col].astype('category')

### 4.7.2 One-hot Encoding<a id='4.7.2_One-hot_Encoding'></a>

In [23]:
X = pd.get_dummies(X)

### 4.7.3 Train Test Split<a id='4.7.3_Train_Test_Split'></a>

Here I am going to split my dataset into training and test sets. The test set size is going to be 30%. Since only about 5% of loans default, I will need to set the stratify parameter.

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100, stratify=y)

### 4.7.4 Scale the Data<a id='4.7.4_Scale_the_Data'></a>

For ML, I plan on using trees and logistic regression. These models do not assume that the data is a normal distribution. So, to scale the data, I am going to use a standard scaler. 

In [25]:
scale = StandardScaler()
scale.fit(X_train)
X_train_scaled = scale.transform(X_train)
X_test_scaled = scale.transform(X_test)

## 4.8 Save the Dataset<a id='4.8_Save_the_Dataset'></a>

In [26]:
# Save the data for use in the next modeling notebook
savetxt('X_train.csv', X_train_scaled, delimiter=',')
savetxt('X_test.csv', X_test_scaled, delimiter=',')
savetxt('y_train.csv', y_train, delimiter=',')
savetxt('y_test.csv', y_test, delimiter=',')

In [28]:
X_hold.to_csv('X_holdout.csv')
y_hold.to_csv('y_holdout.csv')