# Capstone Project 1
# Lending Club Loan Status Analysis
## Part 4: Modelling Loan Status: Machine Learning Approach

Data Source: Kaggle Dataset -- Lending Club Loan Data  
URL: https://www.kaggle.com/wendykan/lending-club-loan-data  
Analyst: Eugene Wen

In [1]:
# Load the cleaned dataset from previous steps.
%run ./py/FE.py
%matplotlib inline

In [2]:
# Quickly check the dataframe loaded in.
loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 33 columns):
funded_amnt                    887379 non-null float64
term                           887379 non-null object
int_rate                       887379 non-null float64
annual_inc                     887379 non-null float64
verification_status            887379 non-null object
purpose                        887379 non-null object
addr_state                     887379 non-null object
dti                            887379 non-null float64
delinq_2yrs                    887379 non-null float64
inq_last_6mths                 887379 non-null float64
mths_since_last_delinq         887379 non-null float64
mths_since_last_record         887379 non-null float64
open_acc                       887379 non-null float64
pub_rec                        887379 non-null float64
revol_bal                      887379 non-null float64
revol_util                     887379 non-null float64
total_acc    

### Training and Test Set Preparation
As we have seen in the previous section, the target (loan_status_simple) is not balaced and including a level Issued that needs to be removed from the target. This level, however, could be used as new data for prediction. 

For convenience, we will take the following steps to prepare the training, test and prediction sets:  
1. Use custom functions to conduct standardization and dummy coding on the dataframe.
2. Split status = Issued as predict set. Undersample status = Good observations to 10% (yield 81149 rows) and append to status = Bad to form train_test_set.
3. Split the train_test_set into X_train, y_train, X_test and y_test for further modeling.

In [3]:
import pandas as pd
import numpy as np
# Define std_scaler that takes a dataframe and standardize all numerical columns.
# Return updated dataframe.
def std_scaler(df):
    cols = df.select_dtypes(include=["float64"]).columns.tolist()
    for col in cols: 
        df[col] = (df[col] - np.mean(df[col]))/np.std(df[col])
    return df
    
# Define dummy_encoder that takes a dataframe and generate (k-1) dummy columns  for all categorical columns.
# Return updated dataframe.
def dummy_encoder(df):
    cols = df.select_dtypes(include=["object"]).columns.tolist()
    for col in cols:
        dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df = pd.concat([df, dummies], axis=1)
        df.drop(col, axis = 1, inplace = True)
    return df

In [4]:
# Create a copy of loan dataframe
loan_copy = loan.copy()

In [5]:
# Process the dataframe using custom functions
loan_copy = std_scaler(loan_copy)
loan_copy = dummy_encoder(loan_copy)

In [16]:
# Undersampling the large category by 10%
loan_pred = loan_copy.loc[loan_copy.loan_status_simple_Issued == 1, ]
loan_pred.drop(["loan_status_simple_Good", "loan_status_simple_Issued"], axis = 1, inplace=True)

loan_bad = loan_copy.loc[(loan_copy.loan_status_simple_Good == 0) & (loan_copy.loan_status_simple_Issued == 0), ]
loan_bad["isBad"] = np.abs(loan_bad.loan_status_simple_Good - 1)
loan_bad.drop(["loan_status_simple_Issued", "loan_status_simple_Good"], axis = 1, inplace=True)

loan_good = loan_copy.loc[loan_copy.loan_status_simple_Good == 1, ].sample(frac=0.1)
loan_good["isBad"] = np.abs(loan_good.loan_status_simple_Good - 1)
loan_good.drop(["loan_status_simple_Issued", "loan_status_simple_Good"], axis = 1, inplace=True)

loan_train_test = pd.concat([loan_good, loan_bad], axis=0)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [17]:
# For model debugging
loan_train_test = loan_train_test.sample(frac=0.1)

In [18]:
# Split the dataframe
from sklearn.model_selection import train_test_split
all_features = loan_train_test.columns.drop("isBad").tolist()
X_train, X_test, y_train, y_test = train_test_split(loan_train_test[all_features], loan_train_test["isBad"], test_size = 0.3, random_state = 1234) 

### First Model


In [19]:
# Extra Tree Classifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

etc = ExtraTreesClassifier()
etc.fit(X_train, y_train)
predictions = etc.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(accuracy)

0.795199641095


In [20]:
# Feature Importance
feature_importance = pd.Series(etc.feature_importances_, index=all_features).sort_values(ascending=False)
print("Feature importance based on Extra Tree Classifier:")
print("**************************************************")
print(feature_importance)

Feature importance based on Extra Tree Classifier:
**************************************************
out_prncp                              0.107397
last_pymnt_amnt                        0.082758
recoveries                             0.064841
int_rate                               0.060296
total_pymnt                            0.042443
funded_amnt                            0.029370
total_rec_int                          0.027463
revol_util                             0.026363
cr_hist_yr                             0.025050
dti                                    0.024758
tot_cur_bal                            0.023915
revol_bal                              0.023335
total_rec_late_fee                     0.023234
annual_inc                             0.022484
open_acc                               0.021879
emp_length_yr                          0.021855
total_acc                              0.021526
initial_list_status_w                  0.021250
inq_last_6mths                    

In [11]:
opt_features = feature_importance3[feature_importance3 > 0.05].index.tolist()

0.831090174966


### Model Tuning and Pipeline: Choose the best performing model

In [23]:
# Select and Tune Different Algorithms: LR, RF, SVM, Naive Bayesian Classifier, XGBoost
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
#from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

In [32]:
# Define model selection function
def choose_model(df, feature_list, target_df):
    X = df[feature_list]
    y = target_df
    
    dict_list = [
        {
            "name": "LogisticRegression",
            "estimator": LogisticRegression(),
            "hyperparameters":
                {
                    "solver": ["newton-cg", "lbfgs", "liblinear"]
                }
        },

        
#        {
#            "name": "RandomForestClassifier",
#            "estimator": RandomForestClassifier(),
#            "hyperparameters":
#                {
#                    "n_estimators": [4,6,9],
#                    "criterion":["entropy", "gini"],
#                    "max_depth": [2,5,10],
#                    "max_features":["log2","sqrt"],
#                    "min_samples_leaf":[1,5,8],
#                    "min_samples_split":[2,3,5]
#                }
#        },
        
        {
            "name": "SupportVectorMachine",
            "estimator": SVC(),
            "hyperparameters":
                {
                    "kernel": ["linear"],
                    "C": [1, 2]
                }
        },
        
        {
            "name": "NaiveBayesian",
            "estimator": GaussianNB(),
            "hyperparameters": None
        }]
    for dict in dict_list:
        print(dict["name"])
        grid = GridSearchCV(dict["estimator"], param_grid=dict["hyperparameters"], cv=10)
        grid.fit(X, y)
        dict["best_params"] = grid.best_params_
        dict["best_score"] = grid.best_score_
        dict["best_estimator"] = grid.best_estimator_
        print(grid.best_params_)
        print(grid.best_score_)
    
    return dict_list

In [33]:
selected_MLmodel = choose_model(X_train, opt_features, y_train)

LogisticRegression
{'solver': 'newton-cg'}
0.781923076923
SupportVectorMachine
{'C': 1, 'kernel': 'linear'}
0.784903846154
NaiveBayesian


TypeError: 'NoneType' object is not iterable

It seems that the best model is random forest.

In [34]:
# Score x_test dataset
rfc = RandomForestClassifier()
rfc.fit(X_train[opt_features], y_train)
predictions_opt = rfc.predict(X_test[opt_features])
accuracy_opt = accuracy_score(y_test, predictions_opt)
print(accuracy_opt)

0.831314490803


In [47]:
# Evaluate model performance using ROC
from sklearn.metrics import roc_curve, auc

y_score = rfc.predict_proba(X_test[opt_features])[:,1]
fpr, tpr, th = roc_curve(y_test, y_score)
#roc_auc = auc(fpr, tpr)

ValueError: Data is not binary and pos_label is not specified

In [46]:
y_score

array([ 0. ,  1. ,  0.1, ...,  1. ,  0.6,  0. ])