In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn

In [2]:
inv_data = pd.read_csv("investor_data_2.csv")
inv_data.head(3)

Unnamed: 0,investor,commit,deal_size,invite,rating,int_rate,covenants,total_fees,fee_share,prior_tier,invite_tier,tier_change,fee_percent,invite_percent
0,Goldman Sachs,Commit,300,40,2,Market,2,30,0.0,Participant,Bookrunner,Promoted,0.0,0.133333
1,Deutsche Bank,Decline,1200,140,2,Market,2,115,20.1,Bookrunner,Participant,Demoted,0.174783,0.116667
2,Bank of America,Commit,900,130,3,Market,2,98,24.4,Bookrunner,Bookrunner,,0.24898,0.144444


In [15]:
inv_data = inv_data.drop(["invite_tier", "fee_share", "invite"], axis=1)

In [17]:
processed_inv_data = pd.get_dummies(inv_data)
processed_inv_data.shape

(7233, 21)

In [18]:
processed_inv_data = processed_inv_data.drop("commit_Commit", axis=1)

In [20]:
target = processed_inv_data.commit_Decline
inputs = processed_inv_data.drop("commit_Decline", axis=1)

Processing and splitting of data into target and inputs is complete. The next step is to split the data into a training and test set and build the model pipeline.

In [21]:
from sklearn.model_selection import train_test_split
split_list = train_test_split(inputs, target, test_size=0.2, random_state=1, stratify=processed_inv_data.commit_Decline)

In [25]:
input_train = split_list[0]
input_test = split_list[1]
target_train = split_list[2]
target_test = split_list[3]

In [29]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [31]:
pipelines = {
    'l1' : make_pipeline(StandardScaler(), LogisticRegression(penalty='l1', random_state=1, solver='liblinear')),
    'l2' : make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', random_state=1, solver='liblinear')),
    'rf' : make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1)),
    'gb' : make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
}

In [33]:
l1_hyperparameters = {
    'logisticregression__C' : [0.1, 1, 10]
}
l2_hyperparameters = {
    'logisticregression__C' : [0.1, 1, 10]
}
rf_hyperparameters = {
    'randomforestclassifier__n_estimators' : [100, 200], 
    'randomforestclassifier__max_features' : ['auto', 0.3, 0.6]
}
gb_hyperparameters = {
    'gradientboostingclassifier__n_estimators' : [100, 200], 
    'gradientboostingclassifier__learning_rate' : [0.05, 0.1, 0.2],
    'gradientboostingclassifier__max_depth' : [1,3,5]
}

hyperparameters = {
    'l1' : l1_hyperparameters,
    'l2' : l2_hyperparameters,
    'rf' : rf_hyperparameters,
    'gb' : gb_hyperparameters,
}

Create untrained models, train models and cross-validate

In [34]:
from sklearn.model_selection import GridSearchCV

In [38]:
models = {}

for key in pipelines.keys():
    models[key] = GridSearchCV(pipelines[key], hyperparameters[key], cv=5)
    


<built-in method keys of dict object at 0x00000208DA0A5818>


In [40]:
for model in models:
    models[model].fit(input_train, target_train)
    print(model, " is trained and tuned")

l1  is trained and tuned
l2  is trained and tuned
rf  is trained and tuned
gb  is trained and tuned


AUROC - for classification models, the area under the ROC curve is a more useful metric than the R^2. For imbalanced classes such as credit card fraud data where 99% of transactions are valid, the model can predict valid 100% of the time and have an R^2 of almost 1 despite misclassifying all the fraudulent transactions.

For AUROC, a confusion matrix is necessary (True Negative, False Negative, False Positive, True Positive)

AUROC is also called AUC ROC - area under the curve receiver operating characteristics

In [42]:
from sklearn.metrics import confusion_matrix

pred = models['l1'].predict(input_test)
print(confusion_matrix(target_test, pred))

[[1124   22]
 [  23  278]]


The above indicates 1124 correctly predicted commits out of 1146 actual commits (1124+22).
278 correctly predicted declines out of 301 actual declines (23+278).

The model makes a trade-off between true positive rate and false positive rate. (Predicting Decline is the positive class). As the probability threshold increases, the true negative rate will increase; the model will identify more of the actual negative class. However, the false negative rate will also increase as more cases are predicted as negative.

In the ideal situation of perfect separability (i.e. the true negative and true positive distributions do not overlap) the AUC is 1. A value of 0.5 is equivalent to a coin-flip

the proability threshold depends on the goals of your models (whether it's more important to have specificity or sensitivity)

In [49]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve (target_test, pred) # default threshold 0.5
print(" l1 \n","AUROC = ", round(auc(fpr,tpr), 4))

 l1 
 AUROC =  0.9522


In [53]:
for model in models:
    pred  = models[model].predict(input_test)
    fpr, tpr, thresholds = roc_curve(target_test, pred)
    print(model, "\t AUROC = ", round(auc(fpr,tpr), 4))

l1 	 AUROC =  0.9522
l2 	 AUROC =  0.9518
rf 	 AUROC =  0.9616
gb 	 AUROC =  0.9683


Gradient Boosting Classifier has highest AUROC