## Classification Functions for Homeworks 2 & 3

This notebook contains code to create the classification models needed for homeworks 2 and 3. It also has code to visualize the performance of the models created.

Please read __ALL__ the comments in the code and the headings. This notebook is NOT intended to be run as a script from top to bottom, although there are some code cells that need to be run first.
- The general utility libraries need to be loaded first
- Then you need to execute the load data and engineer features code cells
- Finally, execute the create X and y from the features code cells

You can choose to use utilize the grid search implementations for each algorithm listed below using the given RFECV implementation or choose your own feature selection implementation. __OR__ you can choose to implement a pipeline for each of the algorithms using the example pipelines listed in this notebook.

The description box above each algorithm contains a link to sklearn's documentation on that algorithm. Please use that for parameter tuning.

Please note that some algorithms may be sensitive to scale. Therefore, for those algorithms, you may need to use a scaler (such as the StandardScaler or MinMaxScaler) either before feature selection or within your pipeline.

In [None]:
# Load general utilities
# ----------------------
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import datetime
import numpy as np
import pickle
import time
import seaborn as sns

### PREP AND PREPROCESSING SECTION

###  Load the data and engineer features

In [None]:
# This is the code you can use to open your pickle file
# Read the data and features from the pickle
data, discrete_features, continuous_features, ret_cols = pickle.load( open( "Data/clean_data.pickle", "rb" ) )

In [None]:
# Create the outcome
data["default"] = data.loan_status.isin(["Charged Off", "Default"])

In [None]:
# Create a feature for the length of a person's credit history at the
# time the loan is issued
data['cr_hist'] = (data.issue_d - data.earliest_cr_line) / np.timedelta64(1, 'M')
continuous_features.append('cr_hist')

#### If you want to use a smaller sample of the data due to time constraints, use the following code

In [None]:
# this code randomly samples 25% of the rows
# change the frac paramter if you want a different % to sample
# replace = False insures we won't select the same row twice
data=data.sample(frac=.25, replace=False, ).copy()

### Create X and y from the features
These next few steps are needed if you are not going to use the pipelines below. __If you are using the pipelines, you can skip down to that section__

In [None]:
from sklearn.preprocessing import MinMaxScaler

def minMaxScaleContinuous(continuousList):
    return pd.DataFrame(MinMaxScaler().fit_transform(data[continuousList])
                             ,columns=list(data[continuousList].columns)
                             ,index = data[continuousList].index)

def createDiscreteDummies(discreteList):
    return pd.get_dummies(data[discreteList], dummy_na = True, prefix_sep = "::", drop_first = False)

#### VERY IMPORTANT STEP
You need to define which features to use in the modeling. The homework will direct you to either use all the features from the data ingestion and cleaning process or to remove some features because they are defined by LendingClub

In [None]:
# define the discrete features you want to use in modeling.
# if you want to use all the discrete features, just set discrete_features_touse = discrete_features
discrete_features_touse =['purpose', 'term', 'verification_status', 'emp_length', 'home_ownership']

# define the continuous features to use in modeling
# if you want to use all the continuous features, just set the continuous_features_touse = continuous_features
continuous_features_touse = ['loan_amnt', 'funded_amnt','installment','annual_inc','dti','revol_bal','delinq_2yrs','open_acc',
 'pub_rec','fico_range_high','fico_range_low','revol_util','cr_hist']

In [None]:
#discrete_features_touse=discrete_features
#continuous_features_touse = continuous_features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Create dummies for categorical features and concatenate with continuous features for X or predictor dataframe

# Use this line of code if you do not want to scale the continuous features
#X_continuous = data[continuous_features_touse]

# use this line if you want to scale the continuous features using the MinMaxScaler in the function defined above
X_continuous = minMaxScaleContinuous(continuous_features_touse)

# create numeric dummy features for the discrete features to be used in modeling
X_discrete = createDiscreteDummies(discrete_features_touse)

#concatenate the continuous and discrete features into one dataframe
X = pd.concat([X_continuous, X_discrete], axis = 1)

# this is the target variable 
target_col = 'default'
y=data[target_col]

# create a test and train split of the transformed data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=.4)

print("Population:\n",y.value_counts())
print("Train:\n", y_train.value_counts())
print("Test:\n", y_test.value_counts())

### RFE with Cross Validation using Decision Tree

This is an example of RFECV using a decision tree. You could use another tree based algorithm to select features.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

I would encourage you to experiment with other tree based algorithms such as ExtraTreesClassifier or Random Forest. The ExtraTreeClassifier fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFECV

dt = DecisionTreeClassifier(max_depth=30, criterion="entropy")

rfecv = RFECV(estimator=dt, min_features_to_select = 4, cv=10, n_jobs=-1)
rfecv.fit(X_train, y_train)

# Plot number of features VS. cross-validation scores
plt.figure(figsize=(10,5))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.title("Optimal number of features : %d" % rfecv.n_features_)
plt.show()

In [None]:
# create new training and test datasets with the selected features only

rfecv_selected_train = pd.DataFrame(rfecv.transform(X_train)
                                    ,columns=list(X_train.columns[rfecv.get_support()])
                                    ,index=X_train.index)

rfecv_selected_test = pd.DataFrame(rfecv.transform(X_test)
                                    ,columns=list(X_test.columns[rfecv.get_support()])
                                    ,index=X_test.index)

print("train:", rfecv_selected_train.shape)
print("test:", rfecv_selected_test.shape)

In [None]:
# look at the ranking of each feature after using RFECV
pd.Series(rfecv.ranking_, index=X_train.columns).sort_values(ascending=True).head(10)

### Important functions to save a model to pickle and load it later
Training of these models takes time. It is advisable to save the model as a pickle after you've trained it to your satisfaction so you can use it later for comparison without having to re-train it.

The code after the function defintions provides an __example__ of how to use them.

In [None]:
import joblib
# save the model to disk
def saveModel(filename, model):
    joblib.dump(model, filename)
 
 
# load the model from disk
def loadModel(filename):
    return joblib.load(filename)

In [None]:
# save the model to disk
saveModel('dt_model', dt_model)

In [None]:
# load the model from disk
svc_model = loadModel('svc_model')

### Decision Tree GridsearchCV
This is an example grid search with cross validation using a Decision Tree Classifier. Please note the following:
- It does not use the selected features from the RFE above.
- To use selected features: Replace X_train with rfecv_selected_train and Replace X_test with rfecv_selected_test
- You can adjust these parameters or add others
- The scoring method can be changed

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

parameters = {'criterion'   : ["gini", "entropy"],
              'max_depth'   : [3]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(DecisionTreeClassifier(), parameters, cv=10, return_train_score=True, scoring='roc_auc', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
dt_model = grid.fit(X_train, y_train)

print('best params ',dt_model.best_params_,'\n')
print('best estimator ',dt_model.best_estimator_,'\n')
print('best validation score ', dt_model.best_score_,'\n')
print('scoring method ', dt_model.scorer_)

print("Test set accuracy score: {:.7f}".format(dt_model.score(X_test, y_test)))

saveModel('dt_model', dt_model)

### Logistic Regression GridsearchCV
This is an example grid search with cross validation using logistic regression. Please note the following:
- It does not use the selected features from the RFE above.
- To use selected features: Replace X_train with rfecv_selected_train and Replace X_test with rfecv_selected_test
- You can adjust these parameters or add others
- The scoring method can be changed

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of roc_auc. You can change that to another scoring method.
'''

parameters = {'penalty': ["l1"],
              'C'      : [0.1],
              'solver' : ['liblinear']
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(LogisticRegression(), parameters, cv=10, return_train_score=True, scoring='roc_auc', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
lr_model = grid.fit(X_train, y_train)

print('best params ',lr_model.best_params_,'\n')
print('best estimator ',lr_model.best_estimator_,'\n')
print('best validation score ', lr_model.best_score_,'\n')
print('scoring method ', lr_model.scorer_)

print("Test set accuracy score: {:.7f}".format(lr_model.score(X_test, y_test)))

saveModel('lr_model', lr_model)

### SVM GridsearchCV
This is an example grid search with cross validation using support vector machine classifier. Please note the following:
- It does not use the selected features from the RFE above.
- To use selected features: Replace X_train with rfecv_selected_train and Replace X_test with rfecv_selected_test
- You can adjust these parameters or add others
- The scoring method can be changed

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of roc_auc. You can change that to another scoring method.
'''

parameters = {'C': [0.1,1.0],
              'max_iter': [100]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(LinearSVC(), parameters, cv=10, return_train_score=True, scoring='roc_auc', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
svc_model = grid.fit(X_train, y_train)

print('best params ',svc_model.best_params_,'\n')
print('best estimator ',svc_model.best_estimator_,'\n')
print('best validation score ', svc_model.best_score_,'\n')
print('scoring method ', svc_model.scorer_)

print("Test set accuracy score: {:.7f}".format(svc_model.score(X_test, y_test)))

saveModel('svc_model', svc_model)

### Random Forest GridsearchCV
This is an example grid search with cross validation using random forest classifier. Please note the following:
- It does not use the selected features from the RFE above.
- To use selected features: Replace X_train with rfecv_selected_train and Replace X_test with rfecv_selected_test
- You can adjust these parameters or add others
- The scoring method can be changed

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of roc_auc. You can change that to another scoring method.
'''

parameters = {'criterion' : ["gini", "entropy"],
              'n_estimators': [50]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(RandomForestClassifier(), parameters, cv=10, return_train_score=True, scoring='roc_auc', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
rf_model = grid.fit(X_train, y_train)

print('best params ',rf_model.best_params_,'\n')
print('best estimator ',rf_model.best_estimator_,'\n')
print('best validation score ', rf_model.best_score_,'\n')
print('scoring method ', rf_model.scorer_)

print("Test set accuracy score: {:.7f}".format(rf_model.score(X_test, y_test)))

saveModel('rf_model', rf_model)

### GradientBoosting GridsearchCV
This is an example grid search with cross validation using gradient boosting classifier. Please note the following:
- It does not use the selected features from the RFE above.
- To use selected features: Replace X_train with rfecv_selected_train and Replace X_test with rfecv_selected_test
- You can adjust these parameters or add others
- The scoring method can be changed

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of roc_auc. You can change that to another scoring method.
'''

parameters = {"loss" : ["deviance", "exponential"],
              "learning_rate" :   [.05, .2],
              "n_estimators": [50,100,500]
             }

print("Parameter grid:\n{}".format(parameters),'\n')

grid =  GridSearchCV(GradientBoostingClassifier(), parameters, cv=5, return_train_score=True, scoring='roc_auc', n_jobs=-1)

# perform grid search cv on training data.  The CV algorithm divides this into training and validation
gbc_model = grid.fit(X_train, y_train)

print('best params ',gbc_model.best_params_,'\n')
print('best estimator ',gbc_model.best_estimator_,'\n')
print('best validation score ', gbc_model.best_score_,'\n')
print('scoring method ', gbc_model.scorer_)

print("Test set accuracy score: {:.7f}".format(gbc_model.score(X_test, y_test)))

saveModel('gbc_model', gbc_model)

### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, fbeta_score, classification_report

'''Function to print model accuracy information'''

def printAccuracyInfo(model, X_test, y_test):
    print(y_test.value_counts())
    # Make predictions against the test set
    pred = model.predict(X_test)

    # Show the confusion matrix
    print("confusion matrix:")
    print(confusion_matrix(y_test, pred))

    # Find the accuracy scores of the predictions against the true classes
    print("accuracy: %0.3f" % accuracy_score(y_test, pred))
    print("recall: %0.3f" % recall_score(y_test, pred, pos_label=True))
    print("precision: %0.3f" % precision_score(y_test, pred, pos_label=True))
    print("f-measure: %0.3f" % fbeta_score(y_test, pred, beta=1, pos_label=True))
    print(classification_report(y_test,pred))

In [None]:
# example use of printAccuracyInfo using gbc_model and X_test
# note: if you trained model using rfecv_selected_train, you need to call the function with rfecv_selected_test for X_test
printAccuracyInfo(gbc_model, X_test, y_test)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

'''Function to print confusion matrix for a model
   You may need to run this to update to scikit-learn version 0.22.1
         !pip install -U scikit-learn --user
'''

def plotConfusionMatrix (negative_label, positive_label, model, X_test, y_test):
    titles_options = [("Confusion matrix, without normalization", None,'d'),
                      ("Normalized confusion matrix", 'true','.3g')]
    for title, normalize,val_frmt in titles_options:
        disp = plot_confusion_matrix(model, X_test, y_test,
                                     display_labels=[negative_label,positive_label],
                                     cmap=plt.cm.Blues,
                                     values_format=val_frmt,
                                     normalize=normalize)
        disp.ax_.set_title(title)
        disp.ax_.set_xlabel('Predicted')
        disp.ax_.set_ylabel('Actual')

        print(title)
        print(disp.confusion_matrix)

    plt.show()

In [None]:
# example use of plotConfusionMatrix using gbc_model and X_test
# note: if you trained model using rfecv_selected_train, you need to call the function with rfecv_selected_test for X_test
plotConfusionMatrix('No Default', 'Default', gbc_model, X_test, y_test)

### PIPELINE SECTION

#### VERY IMPORTANT STEP
You need to define which features to use in the modeling. The homework will direct you to either use all the features from the data ingestion and cleaning process or to remove some features because they are defined by LendingClub

In [None]:
# define the discrete features you want to use in modeling.
# if you want to use all the discrete features, just set discrete_features_touse = discrete_features
discrete_features_touse =['purpose', 'term', 'verification_status', 'emp_length', 'home_ownership']

# define the continuous features to use in modeling
# if you want to use all the continuous features, just set the continuous_features_touse = continuous_features
continuous_features_touse = ['loan_amnt', 'funded_amnt','installment','annual_inc','dti','revol_bal','delinq_2yrs','open_acc',
 'pub_rec','fico_range_high','fico_range_low','revol_util','cr_hist']

In [None]:
from sklearn.model_selection import train_test_split

X_continuous = data[continuous_features_touse]

X_discrete = pd.get_dummies(data[discrete_features_touse], dummy_na = True, prefix_sep = ":", drop_first = False)

X = pd.concat([X_continuous, X_discrete], axis = 1)

target_col = 'default'
y=data[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=.4)

print("Population:\n",y.value_counts())
print("Train:\n", y_train.value_counts())
print("Test:\n", y_test.value_counts())

### Pipeline Using Select Percentile and Random Forest Classifier
The result of this code is a random forest classifier (rf_model) with the best percentile features selected.

The rf_model can be used to make predictions with rf_model.predict(X_test) and can be passed to the accuracy visualization functions in this notebook.

This pipeline can be adapted and used to create another classifier that utilizes SelectPercentile feature selection.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectPercentile

pipe = make_pipeline(
    SelectPercentile(),
    RandomForestClassifier())
pipe.steps

In [None]:
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of roc_auc. You can change that to another scoring method.
'''

params = {'selectpercentile__percentile': [10, 15, 20, 50],
          'randomforestclassifier__max_depth': [5, 7, 9],
          'randomforestclassifier__criterion': ['entropy', 'gini']}

grid = GridSearchCV(pipe, param_grid=params, cv=10, scoring='roc_auc', n_jobs=-2)
grid.fit(X_train, y_train)

print("best cross-validation accuracy:", grid.best_score_)
print("best dataset score: ", grid.score(X_test, y_test)) 
print("best parameters: ", grid.best_params_)


In [None]:
rf_model=grid.best_estimator_
rf_model.steps

saveModel('rf_model', rf_model)

### Pipeline Using Random Forest RFECV and SVM Classifier with StandardScaler
The result of this code is a SVM classifier (svc_model) with the best scaling and features selected using random forest.

The svc_model can be used to make predictions with svc_model.predict(X_test) and can be passed to the accuracy visualization functions in this notebook.

This pipeline can be adapted and used to create another classifier that needs scaling and utilizes RFECV feature selection.

You can also replace the StandardScaler with another scaler such as MinMaxScaler. To do that you would have to make sure the parameters are changed as well as StandardScaler() replaced with MinMaxScaler in the pipeline.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.svm import LinearSVC

preprocess = make_column_transformer(
            (StandardScaler(),continuous_features_touse,))

pipe = make_pipeline(
    preprocess,
    RFECV(estimator=RandomForestClassifier(random_state=1, n_estimators=100)),
    LinearSVC())

pipe.steps

In [None]:
from sklearn.model_selection import GridSearchCV

''' These are just example parameter settings. You can change these parameters or add others.
    The grid search uses a scoring method of roc_auc. You can change that to another scoring method.
'''

params = {'columntransformer__standardscaler__with_mean': [True, False],
          'rfecv__estimator__criterion': ['gini', 'entropy'],
          'linearsvc__C': [0.1,1.0],
          'linearsvc__max_iter': [100]}

grid = GridSearchCV(pipe, param_grid=params, cv=10, scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)

print("best cross-validation accuracy:", grid.best_score_)
print("best dataset score: ", grid.score(X_test, y_test)) 
print("best parameters: ", grid.best_params_)

In [None]:
svc_model=grid.best_estimator_
svc_model.steps

saveModel('svc_model', svc_model)

### VISUALIZE PERFORMANCE SECTION

### ROC Curve

#### Function to create ROC curve for one model
__NOTE:__ This will not work with SVC models, as they do not have a predict_proba function

In [None]:
def plot_roc_curve_1 (model, model_name, X_test, y_test):
    from sklearn.metrics import roc_curve
    from sklearn.metrics import roc_auc_score
    from matplotlib import pyplot

    # generate a no skill prediction (majority class)
    ns_probs = [0 for _ in range(len(y_test))]

    # predict probabilities
    cf_probs = model.predict_proba(X_test)

    # keep probabilities for the positive outcome only
    cf_probs = cf_probs[:, 1]

    # calculate scores
    ns_auc = roc_auc_score(y_test, ns_probs)
    cf_auc = roc_auc_score(y_test, cf_probs)

    # summarize scores
    print('No Skill: ROC AUC=%.3f' % (ns_auc))
    print(model_name,': ROC AUC=%.3f' % (cf_auc))

    # calculate roc curves
    ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
    cf_fpr, cf_tpr, _ = roc_curve(y_test, cf_probs)

    # plot the roc curve for the model
    pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
    pyplot.plot(cf_fpr, cf_tpr, marker='.', label=model_name)

    # axis labels
    pyplot.xlabel('False Positive Rate')
    pyplot.ylabel('True Positive Rate')

    # show the legend
    pyplot.legend()

    # show the plot
    pyplot.show()

In [None]:
## example ROC plot for random forest model using X_test.
## if you want to use selected features, pass rfecv_selected_test instead of X_test
plot_roc_curve_1(rf_model, 'Random Forest', X_test, y_test)

#### Functions to plot multiple ROC curves for multiple models
__NOTE:__ This will not work with SVC models, as they do not have a predict_proba function.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

def plot_roc_curve(model, model_name, X_test, y_test):
    '''
    Plot ROC curve.
    
    INPUTS:
    - model object
    - model name
    - X_test
    - y_test
    ''' 
    # predict probabilities
    cf_probs = model.predict_proba(X_test)

    # keep probabilities for the positive outcome only
    cf_probs = cf_probs[:, 1]

    # calculate scores
    cf_auc = round(roc_auc_score(y_test, cf_probs),3)

    # calculate roc curve
    cf_fpr, cf_tpr, _ = roc_curve(y_test, cf_probs)

    # plot the roc curve for the model
    pyplot.plot(cf_fpr, cf_tpr, marker='.', label= '{}, ROC AUC {}'.format(model_name, cf_auc))
    
def plot_roc_curves(classifiers, X_test, y_test):
        
    # Plot roc curve for each classifier
    plt.figure(figsize=(10, 10))
    for name, classifier in classifiers.items():
        plot_roc_curve(classifier, name, X_test, y_test)
            
    # generate a no skill prediction (majority class)
    ns_probs = [0 for _ in range(len(y_test))]
    ns_auc = round(roc_auc_score(y_test, ns_probs),3)
    ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
    
    # plot the roc curve for no skill model
    pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label= '{}, ROC AUC {}'.format('No Skill', ns_auc))

    # axis labels
    pyplot.xlabel('False Positive Rate')
    pyplot.ylabel('True Positive Rate')

    # show the legend
    pyplot.legend()

    # show the plot
    pyplot.show()

In [None]:
# First, create a dictionary of classifiers to plot
classifiers = {
    "RF"  : rf_model, 
    "GBC" : gbc_model,
    "DT"  : dt_model,
    "LR"  : lr_model
}

## example ROC plot for a dictionary of models using X_test.
## if you want to use selected features, pass rfecv_selected_test instead of X_test
plot_roc_curves(classifiers, X_test, y_test)

#### Function to plot ROC curve for a single model using Sklearn new functionality

In [None]:
from sklearn.metrics import plot_roc_curve
import matplotlib.pyplot as plt

def plotNewROCCurve(model, X_test, y_test):
    disp = plot_roc_curve(model, X_test, y_test)
    disp.figure_.suptitle("ROC Curve")
    disp.ax_.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
            label='Chance', alpha=.8)
    plt.show()

In [None]:
## example new ROC plot for SVM model using X_test.
## if you want to use selected features, pass rfecv_selected_test instead of X_test
plotNewROCCurve(svc_model.best_estimator_, X_test, y_test)

### Precision Recall Curve

#### Function to plot Precision Recall Curve for a single model using Sklearn new functionality

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt

def plotNewPrecisionRecall(model, X_test, y_test):
    disp = plot_precision_recall_curve(model, X_test, y_test)

In [None]:
## example new Precision Recall plot for SVC model using X_test.
## if you want to use selected features, pass rfecv_selected_test instead of X_test
plotNewPrecisionRecall(svc_model.best_estimator_, X_test, y_test)

#### Functions to plot multiple Precision Recall curves for multiple models
__NOTE:__ This will not work with SVC models, as they do not have a predict_proba function.

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from matplotlib import pyplot

def plot_precision_recall_curve(model, model_name, X_test, y_test):
    '''
    Plot Precision Recall curve.
    
    INPUTS:
    - model object
    - model name
    - X_test
    - y_test
    ''' 
    # predict probabilities
    cf_probs = model.predict_proba(X_test)
    pred = model.predict(X_test) 

    # keep probabilities for the positive outcome only
    cf_probs = cf_probs[:, 1]

    # predict class values
    yhat = model.predict(X_test)
    cf_precision, cf_recall, _ = precision_recall_curve(y_test,cf_probs)
    cf_f1, cf_auc = f1_score(y_test, yhat), auc(cf_recall, cf_precision)
    cf_f1 = round(cf_f1,3)
    cf_auc = round(cf_auc,3)

    # plot the roc curve for the model
    pyplot.plot(cf_recall, cf_precision, marker='.', label = '{}, f1 {}, auc {}'.format(model_name, cf_f1, cf_auc))
    
def plot_precision_recall_curves(classifiers, X_test, y_test):        
    # Plot roc curve for each classifier
    plt.figure(figsize=(10, 10))
    no_skill = len(y_test[y_test==1]) / len(y_test)
    pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
    
    for name, classifier in classifiers.items():
        plot_precision_recall_curve(classifier, name, X_test, y_test)

    # axis labels
    pyplot.xlabel('Recall')
    pyplot.ylabel('Precision')

    # show the legend
    pyplot.legend()

    # show the plot
    pyplot.show()

#### Rank Correlation Using Spearman's Correlation

Use the following function to find the rank correlation between the Lending Club grades and the probability of default by your model. 

You can call the function with those values using your test dataset.

In [None]:
from scipy.stats import spearmanr

def getCorrelation (grades, scores):
    coef, p = spearmanr(grades, scores)
    print('Spearmans correlation coefficient: %.3f' % coef)

    alpha = 0.05
    if p > alpha:
        print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
    else:
        print('Samples are correlated (reject H0) p=%.3f' % p)