<h1>Modeling the combined dataFrame:</h1>

<h2>Activity</h2> <br />
To split the data into train and test sets. Train LogisticRegression, RandomForestClassifier, DecisionTreeClassifier and GradientBoostingClassifier models on the train set. <br />
Compare their accuracy, recall and precision i.e Evaluate their metrics. Hypertune the best classifier and use it on the test set.

In [5]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

In [7]:
combined_df = pd.read_csv('normalized/combined_df.csv')
combined_df.head()

Unnamed: 0,offer_label,person,time,total_amount,successful,duration,difficulty,web,mobile,email,social,bogo,discount,informational,age,income,member_year,gender_label
0,4.0,78afa995795e4d85b5d9ceeca43f5fef,0,37.67,1,168,0.25,1,1,1,0,1,0,0,75,100000.0,2017,1
1,8.0,78afa995795e4d85b5d9ceeca43f5fef,168,49.39,0,72,0.0,0,1,1,1,0,0,1,75,100000.0,2017,1
2,1.0,78afa995795e4d85b5d9ceeca43f5fef,408,48.28,1,168,0.5,0,1,1,1,1,0,0,75,100000.0,2017,1
3,9.0,78afa995795e4d85b5d9ceeca43f5fef,504,48.28,1,120,0.25,1,1,1,1,1,0,0,75,100000.0,2017,1
4,10.0,e2127556f4f64592b11af22de27a7932,0,0.0,0,168,0.5,1,1,1,0,0,1,0,68,70000.0,2018,2


In [20]:
#The person ID information can be removed since demographic information is already within the dataset and 
#'person' column does not add value when trainng the model
combined_df = combined_df.drop(columns = ['person'])
combined_df.head()

Unnamed: 0,offer_label,time,total_amount,successful,duration,difficulty,web,mobile,email,social,bogo,discount,informational,age,income,member_year,gender_label
0,4.0,0,37.67,1,168,0.25,1,1,1,0,1,0,0,75,100000.0,2017,1
1,8.0,168,49.39,0,72,0.0,0,1,1,1,0,0,1,75,100000.0,2017,1
2,1.0,408,48.28,1,168,0.5,0,1,1,1,1,0,0,75,100000.0,2017,1
3,9.0,504,48.28,1,120,0.25,1,1,1,1,1,0,0,75,100000.0,2017,1
4,10.0,0,0.0,0,168,0.5,1,1,1,0,0,1,0,68,70000.0,2018,2


<h2>Split into train and test sets</h2>

Allocating 2/3 of the dataset to train and 1/3 to test

In [27]:
#features 
data_X = combined_df.drop(columns = ['successful'])

#target to be predicted
data_y = np.array(combined_df['successful'])

#split into train and test sets
train_X, test_X, train_y, test_y = train_test_split(data_X.values, data_y, test_size = 0.33, random_state = 1)


**Also create a validation set**

Allocating 10% of test to validation

In [64]:
test_X, val_X, test_y, val_y = train_test_split(test_X, test_y, test_size = 0.1, random_state = 1)

<h3>Defining Evaluation metrics</h3>

In [65]:
def evaluation_metrics(cls_name, y_true, y_pred):
    '''
    Evaluation metric for Ml models
    Input: classifier name (string), true target labels(array) , predicted target labels (array)
    output: accuracy, classification_report (precision, recall, f-1 score)
    '''
    
    accuracy = accuracy_score(y_true, y_pred)
    cls_report = classification_report(y_true, y_pred)
    score_f1 = f1_score(y_true, y_pred)
    
    
    return accuracy,cls_report, pd.DataFrame({'accuracy': accuracy, 'f1_score':score_f1},
                                            index = [cls_name])

<h2>Logistic Regression (Benchmark)</h2> <br />
This will be the benchmark for all other models

In [66]:
#Instantiate and fit model
lr_clf = LogisticRegression(solver='liblinear', random_state = 1)

lr_clf.fit(train_X, train_y)
pred_y = lr_clf.predict(val_X)

In [67]:
lr_accuracy, lr_cls_report, lr_stats_df = evaluation_metrics('Logistic Regression',val_y, pred_y)
print ('Logistic Regression accuracy: ', lr_accuracy)
print ('Logistic Regression classification report : \n',lr_cls_report)

Logistic Regression accuracy:  0.8747152619589977
Logistic Regression classification report : 
               precision    recall  f1-score   support

           0       0.87      0.91      0.89      1216
           1       0.88      0.84      0.86       979

    accuracy                           0.87      2195
   macro avg       0.87      0.87      0.87      2195
weighted avg       0.87      0.87      0.87      2195



<h2>Gradient Boosting</h2>

In [68]:
#Instantiate and fit model
gb_clf = GradientBoostingClassifier(random_state = 1)

gb_clf.fit(train_X, train_y)
pred_y = gb_clf.predict(val_X)

In [69]:
gb_accuracy, gb_cls_report, gb_stats_df = evaluation_metrics('GradientBoosting',val_y, pred_y)
print ('GradientBoosting Classifier accuracy: ', gb_accuracy)
print ('GradientBoosting Classifier classification report : \n',gb_cls_report)

GradientBoosting Classifier accuracy:  0.9129840546697039
GradientBoosting Classifier classification report : 
               precision    recall  f1-score   support

           0       0.95      0.89      0.92      1216
           1       0.87      0.94      0.91       979

    accuracy                           0.91      2195
   macro avg       0.91      0.92      0.91      2195
weighted avg       0.92      0.91      0.91      2195



<h2>Decision Tree</h2>

In [70]:
#Instantiate and fit model
dt_clf = DecisionTreeClassifier(random_state = 1)

dt_clf.fit(train_X,train_y)
pred_y = dt_clf.predict(val_X)

In [51]:
#TO PLOT DECISION TREE

# from sklearn import tree
# fig = plt.figure(figsize=(25,20))
# _ = tree.plot_tree(dt_clf,
#                   feature_names = data_X.columns.values,
#                   class_names = 'successful',
#                   filled = True)

In [71]:
dt_accuracy, dt_cls_report, dt_stats_df = evaluation_metrics('Decision Tree',val_y, pred_y)
print ('DecisionTree Classifier accuracy: ', dt_accuracy)
print ('DecisionTree Classifier classification report : \n',dt_cls_report)

DecisionTree Classifier accuracy:  0.882004555808656
DecisionTree Classifier classification report : 
               precision    recall  f1-score   support

           0       0.90      0.89      0.89      1216
           1       0.87      0.87      0.87       979

    accuracy                           0.88      2195
   macro avg       0.88      0.88      0.88      2195
weighted avg       0.88      0.88      0.88      2195



<h2>Random Forest</h2>

In [72]:
#Instantiate and fit model
rf_clf =RandomForestClassifier(random_state = 1)

rf_clf.fit(train_X,train_y)
pred_y = rf_clf.predict(val_X)

In [73]:
rf_accuracy, rf_cls_report, rf_stats_df = evaluation_metrics('Random Forest', val_y, pred_y)
print ('RandomForest Classifier accuracy: ', rf_accuracy)
print ('RandomForest Classifier classification report : \n',rf_cls_report)

RandomForest Classifier accuracy:  0.9129840546697039
RandomForest Classifier classification report : 
               precision    recall  f1-score   support

           0       0.94      0.90      0.92      1216
           1       0.89      0.92      0.90       979

    accuracy                           0.91      2195
   macro avg       0.91      0.91      0.91      2195
weighted avg       0.91      0.91      0.91      2195



<h3>Comparing statistics</h3>

In [74]:
pd.concat([lr_stats_df, gb_stats_df, dt_stats_df, rf_stats_df])

Unnamed: 0,accuracy,f1_score
Logistic Regression,0.874715,0.856247
GradientBoosting,0.912984,0.90605
Decision Tree,0.882005,0.868193
Random Forest,0.912984,0.904548


All the learning models perform well on the validation set. <br />
**Select the best model for the test set:** The performance of both Random Forest Classifier and GradientBoosting Classifier is very identical. Although the _F1 score_ of the GradientBoosting model is more but the difference is not substantial. Hence, either model can be selected for predicting the test set labels. <br />

<h2>Hypertuning the best model to optimize performance</h2>

I have selected the RandomForestClassifier because for two reasons:
- its performance is good with the validation set
- The decision trees within the Random forest can be exported and useful for understanding the rationale leading to a good performance. This will help optimize business decisons for Starbucks.

In [76]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, truncnorm, randint

In [78]:
#hyperparameter distributions
model_params = {
    # randomly sample numbers from 4 to 210 estimators
    'n_estimators': randint(10,200),
    # normally distributed max_features, with mean .25 stddev 0.1, bounded between 0 and 1
    'max_features': truncnorm(a=0, b=1, loc=0.25, scale=0.1),
    # uniform distribution from 0.01 to 0.2 (0.01 + 0.199)
    'min_samples_split': uniform(0.01, 0.199)
}

#initialize and fit
clf = RandomForestClassifier(random_state = 1)

# this will train 100 models over 5 folds of cross validation
rf_clf = RandomizedSearchCV(clf, model_params, n_iter=100, cv=5, random_state=1)

In [79]:
#train to find best model
best_rf_clf = rf_clf.fit(train_X, train_y)

In [82]:
from pprint import pprint

#see the best hyperparametes set
print ("Best set of hyperparameters:\n")
pprint(best_rf_clf.best_estimator_.get_params())

Best set of hyperparameters:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 0.34082380566905934,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 0.012776363022143807,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 98,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}


**Predicting for Test set**

In [87]:
test_pred_y = best_rf_clf.predict(test_X)
best_rf_accuracy, best_rf_cls_report, best_rf_stats_df = evaluation_metrics('Random Forest', test_y, test_pred_y)
print ('RandomForest Classifier accuracy: ', best_rf_accuracy)
print ('RandomForest Classifier classification report : \n', best_rf_cls_report)
best_rf_stats_df

RandomForest Classifier accuracy:  0.9117006733836261
RandomForest Classifier classification report : 
               precision    recall  f1-score   support

           0       0.94      0.89      0.91     10433
           1       0.88      0.94      0.91      9318

    accuracy                           0.91     19751
   macro avg       0.91      0.91      0.91     19751
weighted avg       0.91      0.91      0.91     19751



Unnamed: 0,accuracy,f1_score
Random Forest,0.911701,0.909223


<h1>Conclusion</h1>

The Random Forest performed well on unseen test data. After tuning the hyperparameters the model delivers an increase in the f1-score. <br />
The model successfully out performs the Logistic Regression model which was considered a benchmark for this case. It predicts the success label of an offer with 91% accuracy which makes it favourable for general unseen cases and thus favourable for application in the real world. This is because an extremely high acuuracy with training data could also be an outcome of overfitting which this model avoids.