# 4 Preprocessing and Training<a id='4_Preprocessing_and_Training'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Preprocessing and Training](#4_Preprocessing_and_Training)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Data Cleaning](#4.5_Data_Cleaning)
      * [4.5.1 Drop columns from hold out set](#4.5.1_Drop_columns_from_hold_out_set)
      * [4.5.2 Missing Values](#4.5.2_Missing_Values)
  * [4.6 Split Dataset](#4.6_Split_Dataset)
  * [4.7 Preprocessing](#4.7_Preprocessing)
      * [4.7.1 Cadegorical Variable Cleaning](#4.7.1_Cadegorical_Variable_Cleaning)
      * [4.7.2 Label Encoding](#4.7.2_Label_Encoding)
      * [4.7.3 Train Test Split](#4.7.3_Train_Test_Split)
      * [4.7.4 Scale the Data](#4.7.4_Scale_the_Data)
      * [4.7.5 Balancing the Data](#4.7.5_Balancing_the_Data)
  * [4.8 Save the Dataset](#4.8_Save_the_Dataset)

## 4.2 Introduction<a id='4.2_Introduction'></a>

In this notebook, I am going to first do some final data cleaning. Next, I will do preprocessing of categorical features and scaling of the data. Finally, I will break the training data into training and test splits to prepare for machine learning. 

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
# import the necessary packages
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from numpy import savetxt, where
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler



## 4.4 Load Data<a id='4.4_Load_Data'></a>

In the previous notebook I saved the data as 'lc_step3.csv'. The hold-out dataset was saved in the data wrangling notebook as 'LC_2016_2017_cleaned.csv'. I will load this data now. 

In [2]:
df = pd.read_csv('lc_step3.csv', index_col = 'id')

In [3]:
df_hold = pd.read_csv('LC_2016_2017_cleaned.csv', index_col='id')

## 4.5 Data Cleaning<a id='4.5_Data_Cleaning'></a>

Before preprocessing, I need to do some final data cleaning. I dropped a number of columns during EDA and now I am going to drop those from the hold-out set. Also, I did not 

### 4.5.1 Drop columns from hold out set<a id='4.5.1_Drop_columns_from_hold_out_set'></a>

In [4]:
list(set(df_hold.columns).difference(set(df.columns)))

['emp_length', 'loan_status', 'grade', 'term(months)']

In [5]:
# In the last notebook, I dropped a number of columns from the training set. Here I will drop those same columns
# from the hold-out set.
drop_col = ['grade', 'emp_length', 'loan_status', 'term(months)']
df_hold.drop(drop_col, axis=1, inplace=True)
df_hold.shape

(758965, 27)

### 4.5.2 Missing Values<a id='4.5.2_Missing_Values'></a>

Quick check to make sure there are no missing values in the data

In [6]:
# training set missing values
missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='%', ascending=False).head()

Unnamed: 0,count,%
loan_amnt,0,0.0
mths_since_last_record,0,0.0
total_rev_hi_lim,0,0.0
tot_coll_amt,0,0.0
acc_now_delinq,0,0.0


In [7]:
missing_holdout = pd.concat([df_hold.isnull().sum(), 100 * df_hold.isnull().mean()], axis=1)
missing_holdout.columns=['count', '%']
missing_holdout.sort_values(by='%', ascending=False).head()

Unnamed: 0,count,%
loan_amnt,0,0.0
mths_since_last_record,0,0.0
total_rev_hi_lim,0,0.0
tot_coll_amt,0,0.0
acc_now_delinq,0,0.0


Good, there is no missing data.

## 4.6 Split Dataset<a id='4.6_Split_Dataset'></a>

df currently contains both the features and the targets (the 'defaults' column) for machine learning. I will need to seperate these before moving forward. 

In [8]:
# Split the dataset into features and target
X = df.drop('defaults', axis=1)
y = df['defaults']

In [9]:
X_hold = df_hold.drop('defaults', axis=1)
y_hold = df_hold['defaults']

## 4.7 Preprocessing<a id='4.7_Preprocessing'></a>

### 4.7.1 Cadegorical Variable Cleaning<a id='4.7.1_Cadegorical_Variable_Cleaning'></a>

The categorical columns are currently dtype object. 

In [10]:
X.select_dtypes('object')

Unnamed: 0_level_0,sub_grade,home_ownership,verification_status,purpose,addr_state,earliest_cr_line,last_credit_pull_d
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10149342,B2,OWN,Verified,debt_consolidation,MI,Oct-1986,Dec-2015
10159584,C1,RENT,Not Verified,debt_consolidation,CA,Jan-2007,Jan-2016
10159498,A2,MORTGAGE,Not Verified,debt_consolidation,CT,Mar-1994,Jan-2016
10139658,B5,RENT,Source Verified,debt_consolidation,NM,Oct-1998,Jan-2016
10159548,A5,MORTGAGE,Not Verified,debt_consolidation,FL,Mar-1998,Jan-2016
...,...,...,...,...,...,...,...
36371250,B5,RENT,Verified,debt_consolidation,CA,Sep-2004,Jan-2016
36441262,B5,MORTGAGE,Verified,home_improvement,NJ,Mar-1974,Jan-2016
36271333,D2,RENT,Verified,debt_consolidation,TN,Sep-2003,Jan-2016
36490806,E3,RENT,Source Verified,debt_consolidation,MA,Oct-2003,Jan-2016


In [11]:
# Extracting year from the Earliest Credit Line and Last Credit Pulled columns
X['earliest_cr_line_year'] = X['earliest_cr_line'].str.slice(4, 8).astype('category')
X['last_credit_pull_d_year'] = X['last_credit_pull_d'].str.slice(4, 8).astype('category')

In [12]:
X_hold['earliest_cr_line_year'] = X_hold['earliest_cr_line'].str.slice(4, 8).astype('category')
X_hold['last_credit_pull_d_year'] = X_hold['last_credit_pull_d'].str.slice(4, 8).astype('category')

In [13]:
# drop the original date columns
X.drop(['earliest_cr_line', 'last_credit_pull_d'], axis=1, inplace=True)

In [14]:
X_hold.drop(['earliest_cr_line', 'last_credit_pull_d'], axis=1, inplace=True)

In [15]:
# change the object columns to category
for col in ['sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state']:
    X[col] = X[col].astype('category')

In [16]:
for col in ['sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state']:
    X_hold[col] = X_hold[col].astype('category')

### 4.7.2 Label Encoding<a id='4.7.2_Label_Encoding'></a>

In [17]:
categories = ['sub_grade', 'home_ownership', 'verification_status', 'purpose','addr_state', 'earliest_cr_line_year', 'last_credit_pull_d_year']

In [18]:
label_encoder = preprocessing.LabelEncoder()
for col in categories:
    X[col]= label_encoder.fit_transform(X[col])

In [19]:
label_encoder = preprocessing.LabelEncoder()
for col in categories:
    X_hold[col]= label_encoder.fit_transform(X_hold[col])

In [20]:
X.head()

Unnamed: 0_level_0,loan_amnt,int_rate,installment,sub_grade,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,mths_since_last_major_derog,acc_now_delinq,tot_coll_amt,total_rev_hi_lim,earliest_cr_line_year,last_credit_pull_d_year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
10149342,27050.0,10.99,885.46,6,2,55000.0,2,2,22,22.87,0.0,0.0,300.0,300.0,14.0,0.0,36638.0,61.2,27.0,0.0,300.0,0.0,0.0,59900.0,40,3
10159584,9750.0,13.98,333.14,10,3,26000.0,0,2,4,25.12,0.0,0.0,300.0,300.0,12.0,0.0,7967.0,52.8,28.0,0.0,300.0,0.0,0.0,15100.0,61,4
10159498,12000.0,6.62,368.45,1,0,105000.0,0,2,6,14.05,0.0,1.0,43.0,300.0,12.0,0.0,13168.0,21.6,22.0,0.0,300.0,0.0,0.0,61100.0,48,4
10139658,12000.0,13.53,407.4,9,3,40000.0,1,2,32,16.94,0.0,0.0,53.0,33.0,7.0,2.0,5572.0,68.8,32.0,0.0,53.0,0.0,15386.0,8100.0,52,4
10159548,15000.0,8.9,476.3,4,0,63000.0,0,2,9,16.51,0.0,0.0,34.0,300.0,8.0,0.0,11431.0,74.2,29.0,0.0,34.0,0.0,1514.0,15400.0,52,4


In [None]:
X.to_csv('features.csv')
y.to_csv('labels.csv')

### 4.7.3 Train Test Split<a id='4.7.3_Train_Test_Split'></a>

Here I am going to split my dataset into training and test sets. I will have two different sets. One with the one-hot encoded dummy features and one with label encoded categorical features. The test set size is going to be 30%. Since only about 5% of loans default, I will need to set the stratify parameter.

In [21]:
X, X_test, y, y_test = train_test_split(X, y, test_size=0.3, random_state=100, stratify=y)

### 4.7.4 Scale the Data<a id='4.7.4_Scale_the_Data'></a>

For ML, I plan on using trees and logistic regression. The tree based models will use the label encoded split and the logistic regression model will use the dummy split. These models do not assume that the data is a normal distribution. To scale the data, I am going to use a standard scaler just on the dummy split. 

In [22]:
scale = MinMaxScaler()
X = scale.fit_transform(X)
X_test = scale.transform(X_test)

### 4.7.5 Balancing the Data<a id='4.7.5_Balancing_the_Data'></a>

Since the dataset only contains approximately 5% defaulted loans, the data needs to be balanced. I am going to use a combination of SMOTE to oversample the minority class and RandomUnderSampler to undersample the majority class to balance the data.

In [23]:
y = y.to_numpy().ravel()

In [24]:
# summarize class distribution
counter = Counter(y)
print(counter)

Counter({0: 546317, 1: 25496})


In [25]:
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

In [26]:
# transform the dataset
X, y = pipeline.fit_resample(X, y)

# summarize the new class distribution
counter = Counter(y)
print(counter)



Counter({0: 109262, 1: 54631})




## 4.8 Save the Dataset<a id='4.8_Save_the_Dataset'></a>

In [27]:
# Save the data for use in the next modeling notebook
savetxt('X_train.csv', X, delimiter=',')
savetxt('y_train.csv', y, delimiter=',')
savetxt('X_test.csv', X_test, delimiter=',')
savetxt('y_test.csv', y_test, delimiter=',')

In [28]:
! zip train_test.csv.zip *.csv

  adding: X_test.csv (deflated 85%)
  adding: X_train.csv (deflated 82%)
  adding: y_test.csv (deflated 99%)
  adding: y_train.csv (deflated 100%)


In [None]:
X_hold.to_csv('X_holdout.csv')
y_hold.to_csv('y_holdout.csv')