##### Lending Club Data Project

# 1 Data Cleansing

In this notebook, we prepare the data for EDA and modelling. 
Since the data science life cycle is iterative, sometimes it's becomes clear at a *later* time that further changes in the data preprocessing are necessary. Therefore, the sequence of data manipulations in this notebook may refer to insights from notebooks 2 and 3. 

__Content__ <br>
1.1 Variable Descriptions <br>
1.2 Data Import <br>
1.3 Deletion of variables not neededand variable recoding <br>
1.4 Train-Test-Split <br>
1.5 Missing Values in Training Data <br>
1.6 Rescaling of Training Data <br>
1.7 Preprocessing of Test Data <br>

In [1]:
reset -fs

## 1.1 Variable Descriptions

| variable                   | description                                                                                                                                                                                              |
|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| acc_now_delinq             | The number of accounts on which the borrower is now delinquent.                                                                                                                                          |
| addr_state                 | state provided by the borrower in the loan application                                                                                                                                                   |
| annual_inc                 | self-reported annual income provided by the borrower during registration                                                                                                                                 |
| application_type           | indicates whether the loan is an individual application or a joint application with two co-borrowers                                                                                                     |
| chargeoff_within_12_mths   | number of charge-offs within 12 months.                                                                                                                                                                  |
| collection_recovery_fee    | post charge off collection fee                                                                                                                                                                           |
| collections_12_mths_ex_med | Number of collections in 12 months excluding medical collections                                                                                                                                         |
| delinq_amnt                | amount the borrower is now delinquent                                                                                                                                                                    |
| delinq_2yrs                | delinquency in the borrower's credit file for the past 2 years                                                                                                                                           |
| dti                        | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| earliest_cr_line           | The month the borrower's earliest reported credit line was opened                                                                                                                                        |
| emp_length                 | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.                                                                        |
| emp_title                  | The job title supplied by the Borrower when applying for the loan.*                                                                                                                                      |
| funded_amnt                | The total amount committed to that loan at that point in time.                                                                                                                                           |
| funded_amnt_inv            | The total amount committed by investors for that loan at that point in time.                                                                                                                             |
| grade                      | LC assigned loan grade                                                                                                                                                                                   |
| home_ownership             | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.                                                                                      |
| id                         | A unique LC assigned ID for the loan listing.                                                                                                                                                            |
| initial_list_status        | The initial listing status of the loan. Possible values are – W, F                                                                                                                                       |
| inq_last_6mths             | The number of inquiries in past 6 months (excluding auto and mortgage inquiries)                                                                                                                         |
| installment                | The monthly payment owed by the borrower if the loan originates.                                                                                                                                         |
| int_rate                   | Interest Rate on the loan                                                                                                                                                                                |
| issue_d                    | The month which the loan was funded                                                                                                                                                                      |
| last_credit_pull_d         | The most recent month LC pulled credit for this loan                                                                                                                                                     |
| last_pymnt_amnt            | Last total payment amount received                                                                                                                                                                       |
| last_pymnt_d               | Last month payment was received                                                                                                                                                                          |
| loan_amnt                  | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.                             |
| loan_status                | Current status of the loan <br> * Fully Paid: Principal amount of loans that have been fully paid <br> * Current: Principal amount of loans that are in "current" or "grace period" status <br> * Late: Principal amount of loans that are 16+ days late but have not charged off <br> * Charged off (net): Total amount charged off net of any funds subsequently recovered. <br> Principal and interest payments received prior to charge off and recoveries made after charge off are not included here; they are included in the "Principal Payments Received" or "Interest Payments Received" columns. As a result, the fully paid, current, late, and charged off columns do not add up to 100% for the "% of Issued dollars" view.                                                                                                                                                                            |
| member_id                  | A unique LC assigned Id for the borrower member.                                                                                                                                                         |
| open_acc                   | The number of open credit lines in the borrower's credit file.                                                                                                                                           |
| out_prncp                  | Remaining outstanding principal for total amount funded                                                                                                                                                  |
| out_prncp_inv              | Remaining outstanding principal for portion of total amount funded by investors                                                                                                                          |
| policy_code                | publicly available policy_code=1 new products not publicly available policy_code=2                                                                                                                       |
| pub_rec                    | Number of derogatory public records                                                                                                                                                                      |
| pub_rec_bankruptcies       | Number of derogatory public records of bankruptcies                                                                                                                                                                      |
| purpose                    | A category provided by the borrower for the loan request.                                                                                                                                                |
| pymnt_plan                 | Indicates if a payment plan has been put in place for the loan                                                                                                                                           |
| recoveries                 | post charge off gross recovery                                                                                                                                                                           |
| revol_bal                  | Total credit revolving balance                                                                                                                                                                           |
| revol_util                 | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.                                                                               |
| sub_grade                  | LC assigned loan subgrade                                                                                                                                                                                |
| tax_liens                  | Legal claim against the assets of an individual or business who fails to pay taxes owed to the government                                                                                                |
| term                       | The number of payments on the loan. Values are in months and can be either 36 or 60.                                                                                                                     |
| title                      | The loan title provided by the borrower                                                                                                                                                                  |
| total_acc                  | The total number of credit lines currently in the borrower's credit file                                                                                                                                 |
| total_pymnt                | Payments received to date for total amount funded                                                                                                                                                        |
| total_pymnt_inv            | Payments received to date for portion of total amount funded by investors                                                                                                                                |
| total_rec_int              | Interest received to date                                                                                                                                                                                |
| total_rec_late_fee         | Late fees received to date                                                                                                                                                                               |
| total_rec_prncp            | Principal received to date                                                                                                                                                                               |
| verification status        | Indicates if the borrowers' income was verified by LC, not verified, or if the income source was verified                                                                                       |
| zip_code                   | The first 3 numbers of the zip code provided by the borrower in the loan application.                                                                                                                    |

## 1.2 Data Import

In [2]:
#Standard imports 
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd

from scipy import stats
import seaborn as sns
import statsmodels.api as sms
import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split 

In [3]:
#load data
loans_total = pd.read_csv("data/loans_2007.csv" )
pd.set_option("display.max_columns", None)
loans_total.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [4]:
loans_total.shape

(42538, 52)

## 1.3 Deletion of variables not needed and variable recoding

In the following, a greater amount of variables will be dropped. 
* Some variables contain information which cannot be known by an investor deciding whom to support with a loan (e.g, measures related to delayed payback.
* Some variables are redundant. 
* Some variables have no predictive value for the target/ are not really connected to the target.
* Some variables don't vary (e.g. application type)

In [5]:
loans_total.drop(['id', 'member_id', 'delinq_amnt','chargeoff_within_12_mths', 'acc_now_delinq','last_pymnt_d','last_pymnt_amnt','total_rec_prncp', 
                  'total_rec_late_fee', 'total_rec_int', 'funded_amnt_inv', "pymnt_plan", "initial_list_status", "application_type", 
                  'recoveries', 'sub_grade', 'emp_title', 'last_credit_pull_d', 'zip_code', 'policy_code', 'pub_rec_bankruptcies', 
                  'tax_liens', 'earliest_cr_line', 'title'  ], axis=1, inplace=True)

Checking if the object variables' datatypes need to be changed:

In [6]:
loans_total.select_dtypes('object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   term                 42535 non-null  object
 1   int_rate             42535 non-null  object
 2   grade                42535 non-null  object
 3   emp_length           41423 non-null  object
 4   home_ownership       42535 non-null  object
 5   verification_status  42535 non-null  object
 6   issue_d              42535 non-null  object
 7   loan_status          42535 non-null  object
 8   purpose              42535 non-null  object
 9   addr_state           42535 non-null  object
 10  revol_util           42445 non-null  object
dtypes: object(11)
memory usage: 3.6+ MB


These changes will be executed:
* change to float:
    * revol_util
    * int_rate
* change to date:
    * issue_d

In [7]:
loans_total.int_rate = loans_total.int_rate.str.replace("%","")
loans_total.int_rate = loans_total.int_rate.astype(float)
loans_total.int_rate = (loans_total.int_rate)/100
loans_total.revol_util = loans_total.revol_util.str.replace("%","")
loans_total.revol_util = loans_total.revol_util.astype(float)
loans_total.revol_util = (loans_total.revol_util)/100

Grade will be recoded to numerical - since we can assume equidistancy of values we treat the variable as pseudo-metric.

In [8]:
loans_total.grade = loans_total.grade.replace('A', 1)
loans_total.grade = loans_total.grade.replace('B', 2)
loans_total.grade = loans_total.grade.replace('C', 3)
loans_total.grade = loans_total.grade.replace('D', 4)
loans_total.grade = loans_total.grade.replace('E', 5)
loans_total.grade = loans_total.grade.replace('F', 6)
loans_total.grade = loans_total.grade.replace('G', 7)

In [9]:
#Convert grade to float 
loans_total.grade.astype('float');

The issue_d variable is converted to year-format and numeric.

In [10]:
from datetime import datetime
loans_total.issue_d = pd.to_datetime(loans_total.issue_d).dt.year

In [11]:
#change datatype of issue_d
loans_total.issue_d.astype('float');

The years 2007 and 2008 are dropped since the EDA (see second notebook) showed that these years differ stongly from the others. This is not surprising since Lending Club was founded in 2007, so the first two years might have been influenced by teething problems (like difficulties with finding investors). Therefore, these two years are not representative. 

In [12]:
loans_total.drop(loans_total.loc[loans_total['issue_d']== 2007.0].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['issue_d']==2008.0].index, inplace=True)

Let's have a look if frequencies in the target variable are balanced. 

In [13]:
loans_total.loan_status.value_counts()

Fully Paid                                             31615
Charged Off                                             5342
Does not meet the credit policy. Status:Fully Paid      1167
Current                                                  961
Does not meet the credit policy. Status:Charged Off      399
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

The data is very imbalanced: The majority of borrowers fully pay back (which is good news for investors but difficult to handle in data analysis).
Anyway, for a prediction if borrowers will pay back or not, only the 'fully paid' and 'charged off' values are important. We will drop the others which is okay since it is only a small proportion of the data. 
Additonally, we will recode the target variable to numeric (values 0 = charged off and 1 = fully paid).

In [14]:
# Drop three missing values in target variable in order to avoid problems in ther train test split. 
loans_total.loan_status.dropna();

In [15]:
loans_total.drop(loans_total.loc[loans_total['loan_status']=='Current'].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['loan_status']=='In Grace Period'].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['loan_status']=='Default'].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['loan_status']=='Late (31-120 days)'].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['loan_status']=='Late (16-30 days)'].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['loan_status']=='Does not meet the credit policy. Status:Fully Paid'].index, inplace=True)
loans_total.drop(loans_total.loc[loans_total['loan_status']=='Does not meet the credit policy. Status:Charged Off'].index, inplace=True)

In [16]:
loans_total.replace(to_replace = 'Charged Off', value = 0, inplace = True)
loans_total.replace(to_replace = 'Fully Paid', value = 1, inplace = True)

The term variable is changed to numeric. Herefor, the substring 'month' is deleted.

In [17]:
loans_total.term = loans_total.term.str.replace('months', '')
loans_total.term.astype('float');

Since the train test split (v.s.) has difficulties handling missing data, these are temporarily recoded in a way that they take impossible values. Later, they will be changed to nan again.

In [18]:
#Replace numerical nans by -9999 as impossible value
loans_total.loc[:, ['loan_status', 'term', 'loan_amnt', 'funded_amnt', 'int_rate', 'grade',
       'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',]] = loans_total.loc[:, ['loan_status', 'term', 'loan_amnt', 'funded_amnt', 'int_rate', 'grade',
       'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',]].fillna(-9999)

In [19]:
#Replace string nans by '?' as impossible value
loans_total.loc[:, ['emp_length', 'home_ownership', 'verification_status', 'purpose', 'addr_state']] = loans_total.loc[:, ['emp_length', 'home_ownership', 'verification_status', 'purpose', 'addr_state']].fillna("?");

## 1.4 Train Test Split

The dataset is splitted into train and test sets. Any data manipulations will be exectued for train and test data separately. 

Since frequencies are not balanced, the train-test-split will be stratified. 

In [20]:
#Create train and test datasets
y = loans_total.loan_status
X = loans_total[['loan_amnt', 'funded_amnt',  'term', 'int_rate',
       'installment', 'grade', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'purpose',
       'addr_state', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc']]
#Train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .33, random_state = 42, stratify = y)

In [21]:
#Save test data in separate file 
test_data = pd.concat([X_test, y_test], axis = 1)
test_data.reset_index(inplace = True)
test_data.drop('index', inplace = True, axis = 1)
test_data.to_csv('test_data.csv')

In [22]:
#Save train data in separate file 
train_data = pd.concat([X_train, y_train], axis = 1)
train_data.reset_index(inplace = True)
train_data.drop('index', inplace = True, axis = 1)
train_data.to_csv('train_data.csv')

## 1.5 Missing values in training data

In [23]:
#Re-establish missing data which were temporarily replaced by impossible values for the train test split. 
train_data.replace('?', np.nan, inplace = True)
train_data.replace(-9999, np.nan, inplace = True)

Missing data with occurences < 100 will be dropped.

In [24]:
train_data.isna().sum()

loan_amnt                2
funded_amnt              2
term                     2
int_rate                 2
installment              2
grade                    2
emp_length             689
home_ownership           2
annual_inc               2
verification_status      2
issue_d                  2
purpose                  2
addr_state               2
dti                      2
delinq_2yrs              2
inq_last_6mths           2
open_acc                 2
pub_rec                  2
revol_bal                2
revol_util              34
total_acc                2
loan_status              2
dtype: int64

In [25]:
train_data.dropna(subset = ['loan_amnt', 'funded_amnt',  'term', 'int_rate',
       'installment', 'grade', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'purpose', 'addr_state', 'dti',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc',  'loan_status'], inplace = True)

In [26]:
train_data.isna().sum()

loan_amnt                0
funded_amnt              0
term                     0
int_rate                 0
installment              0
grade                    0
emp_length             685
home_ownership           0
annual_inc               0
verification_status      0
issue_d                  0
purpose                  0
addr_state               0
dti                      0
delinq_2yrs              0
inq_last_6mths           0
open_acc                 0
pub_rec                  0
revol_bal                0
revol_util               0
total_acc                0
loan_status              0
dtype: int64

In [27]:
train_data[['emp_length']].describe().round(2)

Unnamed: 0,emp_length
count,24044
unique,11
top,10+ years
freq,5453


The mode is not frequent enough to replace missing values bz mode. Therefore, missing values are replaced bz a new value 'unknown'.

In [28]:
train_data.emp_length.fillna('unknown', inplace = True)

## 1.5 Outlier detection and handling

#### Numerical Data

Outliers are removed if they exceed the 99.5th percentile. This is a very liberal criterion but we want to maximise generalization of the model.

In [29]:
#exclude outliers in numerical data
for var in train_data.select_dtypes(include = 'number').columns:
    train_data.drop(train_data.loc[train_data[var]> train_data[var].quantile(q = .995)].index, inplace = True);

#### Categorical Data

Even though group sizes are small for some variables, categorical data are not excluded since all outliers are outliers only regarding their frequency, not regarding their *value*.

In [30]:
#Save data
train_data.to_csv('train_data.csv')

## 1.6 Rescale train data

Data are rescaled using the RobustScaler.
The centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers. Consequently, the resulting range of the transformed feature values is larger than for the previous scalers and, more importantly, are approximately similar.

In [31]:
#Separate numerical and non-numerical variables since only the numerical ones can be rescaled.
X_train_num = train_data.copy().select_dtypes('number')

In [32]:
#For some reason, the term variable is still an object although it has been transformed earlier. 
#Therefore, it is added by hand to the numerical variables.
X_train_num['term'] = train_data.term.copy()

In [33]:
#Drop loan status as the dependent variable
X_train_num.drop('loan_status', inplace = True, axis = 1)

In [34]:
#Categorical variables 
X_train_cat = pd.get_dummies(train_data[['emp_length', 'home_ownership',
       'verification_status',  'purpose', 'addr_state']], drop_first = True)

In [35]:
! pip install sklearn-pandas



In [36]:
#Rescale data
from sklearn.preprocessing import RobustScaler
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(X_train_num.columns, RobustScaler())])
scaled_features = mapper.fit_transform(X_train_num.copy(), 4)
X_r = pd.DataFrame(scaled_features, index=X_train_num.index, columns=X_train_num.columns)

In [37]:
X_r.head()

Unnamed: 0,loan_amnt,funded_amnt,int_rate,installment,grade,annual_inc,issue_d,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,term
0,-0.403061,-0.367347,-0.891304,-0.361449,-0.5,-0.7899,0.0,-0.4,0.0,-1.0,-1.0,0.0,-0.546581,-0.123932,-0.733333,0.0
1,-0.709184,-0.673469,0.894928,-0.663813,1.5,-1.1799,0.0,-0.436538,0.0,0.0,-1.0,0.0,-0.478905,1.057692,-1.066667,0.0
2,-0.158163,-0.122449,-1.050725,-0.089012,-0.5,-0.1999,0.0,-0.566346,0.0,-1.0,0.0,0.0,-0.121236,-0.732906,-0.266667,0.0
3,0.005102,0.040816,-0.782609,0.124841,-0.5,-0.6498,-1.0,-1.291346,0.0,-1.0,-0.166667,0.0,-0.672444,-1.051282,-0.466667,0.0
4,-0.505102,-0.469388,0.456522,-0.412665,0.5,-1.0299,-1.0,-0.535577,0.0,0.0,-0.833333,0.0,-0.393115,0.950855,-0.666667,0.0


In [38]:
#Join all predictive variables again
X_train = pd.concat([X_r, X_train_cat], axis = 1)

In [39]:
X_train.to_csv('train_data_rescaled.csv')

In [40]:
y_train = train_data.loan_status
y_train.to_csv('y_train.csv')

## 1.7 Preprocessing of test data for later analysis 

Note that although this section is part of the first notebook for practical reasons, the test data hasn't been touched before model evaluation. 

Before we apply our best model to the test data, we need to prepare the data analogously to the training data: 
* Handle missing values
* Make sure only our selected features are in the dataframe
* Rescale the data using the train set scaler. 

In [41]:
#Re-establish missing data which were temporarily replaced by impossible values for the train test split. 
test_data.replace('?', np.nan, inplace = True)
test_data.replace(-9999, np.nan, inplace = True)

In [42]:
#Missing data with occurences < 100 will be dropped.
test_data.isna().sum()

loan_amnt                1
funded_amnt              1
term                     1
int_rate                 1
installment              1
grade                    1
emp_length             350
home_ownership           1
annual_inc               1
verification_status      1
issue_d                  1
purpose                  1
addr_state               1
dti                      1
delinq_2yrs              1
inq_last_6mths           1
open_acc                 1
pub_rec                  1
revol_bal                1
revol_util              11
total_acc                1
loan_status              1
dtype: int64

In [43]:
test_data.dropna(subset = ['loan_amnt', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade',
       'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'purpose', 'addr_state', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'loan_status'], inplace = True);

In [44]:
#Replace missing values in emp_length
test_data.emp_length.fillna('unknown', inplace = True)

In [45]:
test_data.to_csv('test_data.csv')

In [46]:
y_test = pd.DataFrame(test_data.loan_status)
X_test = pd.DataFrame(test_data)
X_test.drop('loan_status', axis = 1, inplace = True)
y_test.loan_status.value_counts()

1.0    10425
0.0     1761
Name: loan_status, dtype: int64

In [47]:
#Divide numerical and categorical data
X_test_num = X_test.copy().select_dtypes('number')
X_test_num['term'] = X_test.term.copy()

X_test_cat = pd.get_dummies(X_test[['emp_length', 'home_ownership',
       'verification_status',  'purpose', 'addr_state']], drop_first = True)

In [48]:
from sklearn.preprocessing import RobustScaler
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([(X_test_num.columns, RobustScaler())])
mapper.fit(X_train_num.copy(), 4)
scaled_test = mapper.transform(X_test_num.copy())
X_t = pd.DataFrame(scaled_test, index=X_test_num.index, columns=X_test_num.columns)

In [49]:
X_test = pd.concat([X_t, X_test_cat], axis = 1)

In [50]:
X_test_10 = X_test[['funded_amnt', 'int_rate', 'installment', 'grade', 'annual_inc', 'dti',
       'inq_last_6mths', 'open_acc', 'revol_util', 'total_acc']].copy()

In [51]:
X_test_10.head()

Unnamed: 0,funded_amnt,int_rate,installment,grade,annual_inc,dti,inq_last_6mths,open_acc,revol_util,total_acc
0,0.857143,-0.347826,1.162058,0.0,-0.1749,0.778846,0.0,1.5,-0.559829,0.466667
1,0.244898,0.755435,0.552085,1.5,-0.1998,-0.201923,1.0,0.833333,0.380342,-0.2
2,-0.163265,0.557971,-0.342435,1.0,-0.4955,0.848077,1.0,0.833333,-0.288462,1.866667
3,-0.214286,0.641304,-0.067068,1.0,-0.5748,1.075,0.0,-0.333333,-0.232906,-0.933333
4,0.0,0.028986,0.154576,0.0,0.4251,0.905769,0.0,0.666667,-0.185897,1.8


In [52]:
X_test_10.to_csv('x_test_10.csv')

In [53]:
y_test.to_csv('y_test.csv')