### **Predictive Modeling**
I am building three models to predict the effectiveness of each offer, depending on the offer attributes and the demographics of the customers.

Models -


*   Logistic Regression
*   Random Forest Classifier

*   XGBoost





In [1]:
# import Google Colab
# this steps is unnecessary if running on local Jupyter Notebook
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
!pip install --upgrade scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/d9/3a/eb8d7bbe28f4787d140bb9df685b7d5bf6115c0e2a969def4027144e98b6/scikit_learn-0.23.1-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
[K     |████████████████████████████████| 6.9MB 3.5MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.23.1 threadpoolctl-2.1.0


In [22]:
import pandas as pd
import numpy as np
import sklearn 

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix

from xgboost.sklearn import XGBClassifier

In [4]:
# import the dataframe built in ETL .ipynb
main_df = pd.read_csv('/content/drive/My Drive/starbucks_model_df.csv')

# drop the unnecessary columns
main_df = main_df.drop(['customer_id', 'time', 'email', 'offer_type'], axis = 1)
main_df.columns

Index(['offer_id', 'difficulty', 'duration', 'reward', 'web', 'mobile',
       'social', 'bogo', 'informational', 'discount', 'gender', 'age_bin',
       'income_bin', 'membership_since', 'total_amount', 'cust_action'],
      dtype='object')

In [5]:
# Define the columns should be encoded vs scaled
cate_col = ['offer_id', 'gender', 'age_bin', 'income_bin', 'membership_since']
num_col  = ['difficulty', 'duration', 'reward', 'total_amount']

In [6]:
# define X and y
X = main_df.drop('cust_action', axis = 1)
y = main_df.cust_action

# split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

### **Evaluate naive predictor performance**


*   A naive predictor assumes that all customer offers were successful

In [7]:
naive_pred_accuracy = accuracy_score(y_train, np.ones(len(y_train)))
naive_pred_f1= f1_score(y_train, np.ones(len(y_train)))

print("Naive predictor accuracy: %.3f" % (naive_pred_accuracy))
print("Naive predictor f1-score: %.3f" % (naive_pred_f1))

Naive predictor accuracy: 0.301
Naive predictor f1-score: 0.463


Feature Engineering


*   Perform OneHotEncoder on categorical columns
*   Perform standard scaling on numerical columns



In [8]:
# instantiate OneHotEncoder and StandardScaler
ohe = OneHotEncoder()
scaler = StandardScaler()

ct = make_column_transformer(
    (ohe, cate_col),
    (scaler, num_col),
    remainder = 'passthrough')

### **Construct Logistic Regression model**


*   Perform random search of hyperparameter tuning 

*   The results suggest that a logistic regression model could give the better accuracy and f1-score than the naive predictor.

    *   Accuracy

      *   Naive predictor: 0.301
      *   Logistic regression:  0.792
    *   F1-score
      *   Naive predictor: 0.463
      *   Logistic regression: 0.573

In [9]:
# hyperparameters ranges

params_lr = {}
params_lr['logisticregression__penalty'] = ['l1','l2']
params_lr['logisticregression__C'] = [0.1, 1, 10]
params_lr

{'logisticregression__C': [0.1, 1, 10],
 'logisticregression__penalty': ['l1', 'l2']}

In [10]:
# build a pipeline 
# set up randomized search cross validation

lr = LogisticRegression(solver = 'liblinear', random_state = 1)
pipe_lr = make_pipeline(ct, lr)
rand_lr = RandomizedSearchCV(pipe_lr, params_lr, cv = 5, scoring = 'accuracy')
%time rand_lr.fit(X_train, y_train)



CPU times: user 38.5 s, sys: 2.96 s, total: 41.5 s
Wall time: 38.2 s


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('onehotencoder',
                                                                               OneHotEncoder(),
                                                                               ['offer_id',
                                                                                'gender',
                                                                                'age_bin',
                                                                                'income_bin',
                                                                                'membership_since']),
                                                                              ('standardscaler',
                                                             

In [11]:
# print the results of the top estimators
results = pd.DataFrame(rand_lr.cv_results_)
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__penalty,param_logisticregression__C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.414199,0.518633,0.022761,0.000505,l1,0.1,"{'logisticregression__penalty': 'l1', 'logisti...",0.788722,0.795771,0.790132,0.796053,0.794549,0.793045,0.00303,1
2,4.412165,0.503704,0.022941,0.000269,l1,1.0,"{'logisticregression__penalty': 'l1', 'logisti...",0.787688,0.796335,0.789474,0.795771,0.795019,0.792857,0.003562,2
3,0.346516,0.019615,0.02281,0.000447,l2,1.0,"{'logisticregression__penalty': 'l2', 'logisti...",0.787406,0.796429,0.789286,0.795865,0.795113,0.79282,0.003724,3
4,0.206354,0.002219,0.022689,0.00049,l1,10.0,"{'logisticregression__penalty': 'l1', 'logisti...",0.787312,0.796335,0.789286,0.795771,0.795113,0.792763,0.003718,4
5,0.347444,0.039625,0.022748,0.000705,l2,10.0,"{'logisticregression__penalty': 'l2', 'logisti...",0.787312,0.796335,0.789286,0.795771,0.795113,0.792763,0.003718,4
1,0.304824,0.018882,0.022592,0.000178,l2,0.1,"{'logisticregression__penalty': 'l2', 'logisti...",0.787594,0.796053,0.789286,0.795865,0.794361,0.792632,0.003513,6


In [12]:
# make the prediction from the test set with the best estimators
y_pred_lr = rand_lr.predict(X_test)
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr, average='binary')

print('accuracy score of logistic regression: ', lr_accuracy )
print('F1 score of logistic regression: ', lr_f1)

accuracy score of logistic regression:  0.7915194346289752
F1 score of logistic regression:  0.5727930981358804


### **Construct Random Forest Classifier**

*   Perform randomized search of hyperparameter tuning
*   The results suggest that random forest classifier could give the better accuracy and f1-score than the naive predictor.
    * Accuracy
      * Naive predictor: 0.301
      * Random Forest Classifier: 0.875
    * F1-score
      * Naive predictor: 0.463
      * Random Forest Classifier: 0.799

In [13]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# create the dictionary of the parameters
params_rfc = {'randomforestclassifier__n_estimators': n_estimators,
              'randomforestclassifier__max_features': max_features,
              'randomforestclassifier__max_depth': max_depth,
              'randomforestclassifier__min_samples_split': min_samples_split,
              'randomforestclassifier__min_samples_leaf': min_samples_leaf,
              'randomforestclassifier__bootstrap': bootstrap}


In [14]:
# build the pipeline
rfc = RandomForestClassifier()
pipe_rfc = make_pipeline(ct, rfc)
rand_rfc = RandomizedSearchCV(pipe_rfc, params_rfc, n_iter = 6, cv = 5, verbose = 2, random_state = 1, n_jobs = 3)
%time rand_rfc.fit(X_train, y_train);

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  30 out of  30 | elapsed: 16.4min finished


CPU times: user 48.4 s, sys: 556 ms, total: 48.9 s
Wall time: 17min 12s


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('onehotencoder',
                                                                               OneHotEncoder(),
                                                                               ['offer_id',
                                                                                'gender',
                                                                                'age_bin',
                                                                                'income_bin',
                                                                                'membership_since']),
                                                                              ('standardscaler',
                                                             

In [15]:
# print the results of the top estimators

results = pd.DataFrame(rand_rfc.cv_results_)
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_randomforestclassifier__n_estimators,param_randomforestclassifier__min_samples_split,param_randomforestclassifier__min_samples_leaf,param_randomforestclassifier__max_features,param_randomforestclassifier__max_depth,param_randomforestclassifier__bootstrap,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
3,91.829873,2.238988,2.754379,0.146875,800,2,4,auto,40,False,"{'randomforestclassifier__n_estimators': 800, ...",0.869925,0.869173,0.868045,0.871805,0.875846,0.870959,0.002733,1
0,94.893753,1.431579,3.953045,0.079983,1200,10,2,auto,20,True,"{'randomforestclassifier__n_estimators': 1200,...",0.868233,0.869173,0.868515,0.871617,0.876598,0.870827,0.003122,2
1,25.286658,0.711111,0.861729,0.070979,200,10,1,auto,110,False,"{'randomforestclassifier__n_estimators': 200, ...",0.869737,0.869079,0.867763,0.870019,0.876222,0.870564,0.002934,3
4,164.764823,2.61219,5.470978,0.154742,1200,5,1,auto,50,False,"{'randomforestclassifier__n_estimators': 1200,...",0.871805,0.868797,0.86438,0.867293,0.872838,0.869023,0.003063,4
2,119.782003,4.206669,5.364707,0.196133,1200,2,1,auto,60,True,"{'randomforestclassifier__n_estimators': 1200,...",0.867575,0.867669,0.86532,0.865977,0.869831,0.867274,0.001567,5
5,58.904289,15.426622,2.04387,0.560503,1000,10,2,sqrt,10,True,"{'randomforestclassifier__n_estimators': 1000,...",0.858177,0.86344,0.859868,0.865883,0.867857,0.863045,0.00361,6


In [16]:
# make the prediction from the test set with the best estimators

y_pred_rfc = rand_rfc.predict(X_test)
rfc_accuracy = accuracy_score(y_test, y_pred_rfc)
rfc_f1 = f1_score(y_test, y_pred_rfc, average='binary')

print('accuracy score of random forest classification: ', rfc_accuracy)
print('F1 score of random forest classification: ', rfc_f1)

accuracy score of random forest classification:  0.878355010901436
F1 score of random forest classification:  0.8027791321306679


### **Construct XGBoost**
* Perform randomized search of hyperparameter tuning
* The results suggest that XGBoost could give the better accuracy f1-score than the naive predictor.
    * Accuracy
      * Naive predictor: 0.301
      * XGBoost: 0.877
    * F1-score
      * Naive predictor: 0.463
      * XGBoost: 0.797




In [17]:
# number of trees 
n_estimators = list(range(10, 1000, 100))

# maximum number of levels in tree
# control over-fitting as higher depth will allow model 
# to learn relations very specific to a particular sample
max_depth = list(range(3, 13, 2)) 

# denotes the fraction of observations to be randomly samples for each tree.
# lower values make the algorithm more conservative and 
# prevents overfitting but too small values might lead to under-fitting.
subsamples = list(np.arange(0.5, 1, 0.1)) 

# decide learning rate and number of trees
colsample_bytree = list(np.arange(0.5, 0.9, 0.1)) 

# defines the minimum sum of weights of all observations required in a child.
# control over-fitting
min_child_weight = [9, 12, 15]

# define learning rate
learning_rate = list(np.arange(0.1, 1, 0.1)) 

# specify the minimum loss reduction required to make a split.
gamma = list(np.arange(0.1, 1, 0.1)) 

# create the dictionary of the parameters
params_xgb = {'xgbclassifier__n_estimators': n_estimators,
             'xgbclassifier__max_depth': max_depth,
             'xgbclassifier__subsamples': subsamples,
             'xgbclassifier__colsample_bytree': colsample_bytree,
             'xgbclassifier__min_child_weight': min_child_weight,
             'xgbclassifier__learning_rate': learning_rate,
             'xgbclassifier__gamma': gamma}


In [18]:
# create the pipeline

xgb = XGBClassifier()
pipe_xgb = make_pipeline(ct, xgb)
rand_xgb = RandomizedSearchCV(pipe_xgb, params_xgb, n_iter = 6, cv = 5, verbose = 2 , random_state=42)# Fit the random search model
%time rand_xgb.fit(X_train, y_train);

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree=0.7 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree=0.7, total=  15.0s
[CV] xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree=0.7 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   15.0s remaining:    0.0s


[CV]  xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree=0.7, total=  14.9s
[CV] xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree=0.7 
[CV]  xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree=0.7, total=  14.9s
[CV] xgbclassifier__subsamples=0.7999999999999999, xgbclassifier__n_estimators=110, xgbclassifier__min_child_weight=9, xgbclassifier__max_depth=9, xgbclassifier__learning_rate=0.1, xgbclassifier__gamma=0.1, xgbclassifier__colsample_bytree

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 21.9min finished


CPU times: user 22min 44s, sys: 1.12 s, total: 22min 45s
Wall time: 22min 46s


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('onehotencoder',
                                                                               OneHotEncoder(),
                                                                               ['offer_id',
                                                                                'gender',
                                                                                'age_bin',
                                                                                'income_bin',
                                                                                'membership_since']),
                                                                              ('standardscaler',
                                                             

In [19]:
# print the results of the top estimators
results = pd.DataFrame(rand_xgb.cv_results_)
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgbclassifier__subsamples,param_xgbclassifier__n_estimators,param_xgbclassifier__min_child_weight,param_xgbclassifier__max_depth,param_xgbclassifier__learning_rate,param_xgbclassifier__gamma,param_xgbclassifier__colsample_bytree,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,42.533662,0.302615,0.303841,0.003287,0.7,310,9,11,0.7,0.4,0.7,"{'xgbclassifier__subsamples': 0.7, 'xgbclassif...",0.872838,0.87312,0.87406,0.880263,0.876598,0.875376,0.00278,1
2,83.183457,1.568517,0.641075,0.016044,0.7,610,12,11,0.5,0.2,0.7,"{'xgbclassifier__subsamples': 0.7, 'xgbclassif...",0.873308,0.872744,0.874718,0.877914,0.87594,0.874925,0.001864,2
4,62.689007,0.670043,0.313672,0.022257,0.9,510,9,11,0.7,0.9,0.6,{'xgbclassifier__subsamples': 0.89999999999999...,0.875094,0.869925,0.87359,0.875376,0.875282,0.873853,0.002069,3
0,14.783141,0.046057,0.100253,0.001873,0.8,110,9,9,0.1,0.1,0.7,{'xgbclassifier__subsamples': 0.79999999999999...,0.866729,0.869737,0.870019,0.875188,0.874718,0.871278,0.003218,4
3,45.78165,1.081995,0.227875,0.005409,0.9,810,9,5,0.4,0.7,0.6,{'xgbclassifier__subsamples': 0.89999999999999...,0.868233,0.866917,0.871147,0.873214,0.872556,0.870414,0.002448,5
5,11.867162,0.138164,0.090058,0.003809,0.8,310,9,3,0.4,0.8,0.6,{'xgbclassifier__subsamples': 0.79999999999999...,0.863816,0.868703,0.867669,0.869455,0.872274,0.868383,0.002748,6


In [20]:
# make the prediction from the test set with the best estimators

y_pred_xgb = rand_xgb.predict(X_test)
xgb_accuracy = accuracy_score(y_test, y_pred_xgb)
xgb_f1 = f1_score(y_test, y_pred_xgb, average='binary')

print('accuracy score of xgboost: ', xgb_accuracy)
print('F1 score of xgboost: ', xgb_f1)

accuracy score of xgboost:  0.8771520938275318
F1 score of xgboost:  0.7971698113207547


### **Tune the best model**

In [21]:
score_metrics = {'model': ['naive predictor', 'logistic regression', ' random forest classifier', 'xgboost'],
                 'accuracy': [naive_pred_accuracy, lr_accuracy, rfc_accuracy, xgb_accuracy],
                 'F1 score': [naive_pred_f1, lr_f1, rfc_f1, xgb_f1]}

score_metrics_df = pd.DataFrame(score_metrics, columns= ['model', 'accuracy', 'F1 score'])
score_metrics_df

Unnamed: 0,model,accuracy,F1 score
0,naive predictor,0.301222,0.462983
1,logistic regression,0.791519,0.572793
2,random forest classifier,0.878355,0.802779
3,xgboost,0.877152,0.79717


The best model is the random forest classifier. Let's print out the confusion matrix.

In [31]:
print('confusion matrix of random forest classifier:\n', confusion_matrix(y_test, y_pred_rfc))

confusion matrix of random forest classifier:
 [[8390  944]
 [ 674 3293]]


### **Conclusion**

It is not an easy task to create a decent modeling on marketing campaign due to many factors such as unpredicable human behaviours and economic cycles etc. 

In this project, I tried to accomplish a more convining model by combining the  business heuristics knowledge and the machine learning approach. Hope that would give some insights on the future marketing campaign.

Notes of this project:

* I removed several samlpe with age 118 because these samples are also lack of many features. I strongly believe the samples of age 118 are Nan. It is a good learning process to the business world.

*   Female customers have higher density in higher income spectum (75k up)   while male custoers have higher density in lower income spectum (75k below).


Finally, I am impressed that the random forest classifer explain the dataset with 88% accuracy and 80% F1 score.

