# Report

The goal is to identify which features are important and influence the buying intent of customers. There are 3 sets of feature and 2 models are trained on each set. The tree based model(Gradient boosted trees) will be used to determine feature importance. The model with the highest Balanced accuracy score will be selected. The neural netwrok based mode serves as reference for how good the tree based model is. Ideally the performance between the two models should be equal. 


## Model selection

The hyper parameter for each model is optimized using bayesian optimization and 5 fold  stratified Cross Validation. Each model is trained on there different transformations of the original dataset. Since the data set is imbalanced be use balanced accuracy as the metric.

### Optimization and selection for gradient boosted trees

In [1]:

import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install lightgbm



Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


In [3]:
!pip install hyperopt



Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


In [4]:
# imports
import lightgbm as lgb #light gradient boosted tree
from sklearn.model_selection import train_test_split, StratifiedKFold # train and test split
from sklearn.metrics import balanced_accuracy_score,precision_score# metrics
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK# optimization
import numpy as np
import pandas as pd # reading data
import warnings
warnings.simplefilter("ignore")

In [5]:
# read data
df_encoded = pd.read_csv('online_shoppers_intention_encoded.csv', index_col=False)
df_scaled = pd.read_csv('online_shoppers_intention_encoded_scaled.csv', index_col=False)
df_transformed = pd.read_csv('online_shoppers_intention_encoded_scaled_transformed.csv', index_col=False)

In [6]:
#seperate label
label = df_encoded.pop('Revenue').astype('int')
_,_ = df_scaled.pop('Revenue'),df_transformed.pop('Revenue')

In [7]:
#split object for CV 
skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=True)
#Hyperparameter search space for gradient boosed trees
lgb_space = {'lr':hp.loguniform('lr',-6.9,-2.3),'num_leaves':hp.quniform('num_leaves',15,255,1),
         'max_depth':hp.choice('max_depth',[-1,9,12]),'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1.0)}

In [8]:
def lgb_optimizer(params):
    '''A function to optimize  lgb classifier
    :params:=params , dictionary containing the Hyper-parameters for the classifier
    returns true loss and validation loss
    '''
    if 'num_leaves' in params:
        params['num_leaves']=int(params['num_leaves'])
    if 'max_depth' in params:
        params['max_depth']=int(params['max_depth'])
    val_score=[]
    true_scores=[]
    rd=1
    for train_index, test_index in skf.split(X_train, y_train):
        X_tr, X_val = X_train.iloc[train_index], X_train.iloc[test_index]
        y_tr, y_val = y_train.iloc[train_index], y_train.iloc[test_index]
        clf = lgb.LGBMClassifier(n_estimators=2000,**params)
        clf.fit(X_tr,y_tr,eval_set=(X_val,y_val),early_stopping_rounds =200,eval_metric='logloss',verbose=False)
        y_pred = clf.predict(X_val)
        y_tr_pred=clf.predict(X_tr)
        score=balanced_accuracy_score(y_val,y_pred,)
        true_score=balanced_accuracy_score(y_tr,y_tr_pred)
        val_score.append(score)
        true_scores.append(true_score)
        rd+=1
    mean,std =np.mean(val_score),np.std(val_score)
    true_mean=np.mean(true_scores)
    print("mean: {}, Std: {}".format(mean,std))
    return {'loss':-mean,'status': STATUS_OK,'true_loss':-true_mean}

In [9]:
#split into train and test and call the fmin(optimizer) function for encoded dataset
X_train,X_test,y_train,y_test = train_test_split(df_encoded,label,test_size=0.2,random_state=42)
trials_encoded = Trials()
best_e= fmin(lgb_optimizer,lgb_space, algo=tpe.suggest, max_evals=20, trials=trials_encoded)

mean: 0.779801200445387, Std: 0.015378072232663909                                                                     
mean: 0.7737058846317746, Std: 0.013829952496060574                                                                    
mean: 0.7739169123334954, Std: 0.013047148989631644                                                                    
mean: 0.7443563883164647, Std: 0.01709156664462584                                                                     
mean: 0.770928183223507, Std: 0.010748851779302045                                                                     
mean: 0.7695350117407708, Std: 0.01202891358575693                                                                     
mean: 0.775500404272416, Std: 0.017958117750630293                                                                     
mean: 0.776527349934954, Std: 0.012509161521363357                                                                     
mean: 0.7731379287505795, Std: 0.0067900

In [10]:
#for scaled dataset
X_train,X_test,y_train,y_test = train_test_split(df_scaled,label,test_size=0.2,random_state=42)
trials_scaled = Trials()
best_scaled = fmin(lgb_optimizer, lgb_space, algo=tpe.suggest, max_evals=20, trials=trials_scaled)

mean: 0.7692049080045937, Std: 0.011876834499231817                                                                    
mean: 0.7653190502909887, Std: 0.020089213375784665                                                                    
mean: 0.7772322273223169, Std: 0.013845146511467876                                                                    
mean: 0.788156955172625, Std: 0.013054956347022348                                                                     
mean: 0.7527307758156367, Std: 0.008792092461853668                                                                    
mean: 0.7658385434695946, Std: 0.005584254936410054                                                                    
mean: 0.7775213914469203, Std: 0.011446310628948594                                                                    
mean: 0.7598946454837086, Std: 0.01574105291905567                                                                     
mean: 0.7294897107359007, Std: 0.0124431

In [None]:
#for transformed dataset
X_train,X_test,y_train,y_test = train_test_split(df_transformed,label,test_size=0.2,random_state=42)
trials_transformed = Trials()
best_transformed = fmin(lgb_optimizer, lgb_space, algo=tpe.suggest, max_evals=20, trials=trials_transformed)

mean: 0.7559017899208076, Std: 0.018882095745168123                                                                    
mean: 0.7668276612309626, Std: 0.011003723453735226                                                                    
mean: 0.7795804689560024, Std: 0.008331011736322908                                                                    
mean: 0.770613900472755, Std: 0.009989144471381222                                                                     
mean: 0.7710507396248202, Std: 0.012526496267674297                                                                    
mean: 0.7717355408565076, Std: 0.022578656628019948                                                                    
mean: 0.7575867123717356, Std: 0.01666954514540512                                                                     
mean: 0.7672287469554601, Std: 0.015779766422753253                                                                    
mean: 0.7787505795404416, Std: 0.0126237

In [None]:
def model_trainer(Data,label,params,classifier='lgb',test_size=0.2):
    '''A function to train a model. 
    parameters:
    Data :=pandas dataset or numpy array of features
    label:=pandas dataset or numpy array of labels for features 
    params:=python dict parametes for the model
    model:=keras model object or sklearn classifier object
    test_size:=Float (0,1) fraction for test split
    return
    model:=str 'keras','lgb'
    train_score:=float balanced accuracy train score
    test_score:=float balanced accuracy test score
    returns-trained model,train_score,test_score
    '''
    X_train,X_test,y_train,y_test = train_test_split(Data,label,test_size=test_size,random_state=42)
    if classifier=='lgb':
        if 'num_leaves' in params:
            params['num_leaves']=int(params['num_leaves'])
        if 'max_depth' in params:
            #max_depth=[-1,9,12]#max_depth choice list
            params['max_depth']=-1#int(params['max_depth'])#best returns an index corresponding the choice
        clf = lgb.LGBMClassifier(n_estimators=5000,**params)
        clf.fit(X_train,y_train,eval_set=(X_test,y_test),early_stopping_rounds =200,eval_metric='auc',verbose=False)
        train_pred=clf.predict(X_train)
        test_pred=clf.predict(X_test)
    elif classifier=='keras':
        opt=['adam','sgd']
        if 'optimizer' in params:
            params['optimizer'] =opt[params['optimizer']]
        clf = model(feature_size=X_train.shape[-1],**params)
        clf.fit(X_train,y_train,epochs=20,batch_size=128,verbose=0)
        train_pred=np.round(clf.predict(X_train))
        test_pred=np.round(clf.predict(X_test))
    train_score = balanced_accuracy_score(train_pred,y_train)
    test_score= balanced_accuracy_score(test_pred,y_test)
    return clf,train_score,test_score
        
    

In [None]:
print(best_e)
print(best_scaled)
print(best_transformed)

In [None]:
clf_encoded,lgb_train_encoded,lgb_test_encoded=model_trainer(df_encoded,label,best_e)
clf_scaled,lgb_train_scaled,lgb_test_scaled=model_trainer(df_scaled,label,best_scaled)
clf_transformed,lgb_train_transformed,lgb_test_transformed=model_trainer(df_transformed,label,best_transformed)

In [None]:
print("results")
print("| Dataset Type | Train Score | Test Score |")
print("| Encoded      | {:2f}       | {:2f}     |".format(lgb_train_encoded,lgb_test_encoded))
print("| Scaled       | {:2f}       | {:2f}     |".format(lgb_train_scaled,lgb_test_scaled))
print("| Transformed  | {:2f}       | {:2f}     |".format(lgb_train_scaled,lgb_test_scaled))

### optimization and selection for neural network

In [None]:
!pip install tensorflow

In [None]:
!pip install keras

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam, SGD
from keras.callbacks import EarlyStopping

In [None]:
skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=True)

In [None]:
def model(feature_size,lr=0.1,optimizer='adam',hl_size=128):
    ''' function to define keras model
    :feature:=
    '''
    if optimizer=='adam':
        opt =Adam(lr=lr)
    elif optimizer=='sgd':
        opt = SGD(lr=lr)
    hl_size=int(hl_size)
    model =Sequential()
    model.add(Dense(hl_size,activation='relu',input_shape=(feature_size,)))
    model.add(Dropout(0.25))
    model.add(Dense(hl_size,activation='relu'))
    model.add(Dense(1,activation='sigmoid'))
    model.compile(optimizer=opt,loss='binary_crossentropy',metrics=['acc'])
    return model
    

In [None]:
def keras_optimizer(params):
    val_score=[]
    true_scores=[]
    rd=1
    for train_index, test_index in skf.split(X_train, y_train):
        X_tr, X_val = X_train.iloc[train_index], X_train.iloc[test_index]
        y_tr, y_val = y_train.iloc[train_index], y_train.iloc[test_index]
        nn_model = model(X_train.shape[1],**params)
        cb = EarlyStopping(monitor='val_acc',min_delta=0.001,patience=3)
        nn_model.fit(X_tr,y_tr,validation_data=(X_val,y_val),batch_size=128,callbacks=[cb],epochs=50,verbose=0)
        #_,score= nn_model.evaluate(X_val,y_val)
        #_,true_score=nn_model.evaluate(X_tr,y_tr)
        cv_pred = np.round(nn_model.predict(X_val))
        score=balanced_accuracy_score(y_val,cv_pred)
        y_tr_pred = np.round(nn_model.predict(X_tr))
        true_score=balanced_accuracy_score(y_tr,y_tr_pred)
        
        val_score.append(score)
        true_scores.append(true_score)
        rd+=1
    mean,std =np.mean(val_score),np.std(val_score)
    true_mean=np.mean(true_scores)
    print("mean: {}, Std: {}".format(mean,std))
    return {'loss':-mean,'status': STATUS_OK,'true_loss':-true_mean}

In [None]:
keras_space = {'lr':hp.loguniform('lr',-10,-2.3),'optimizer':hp.choice('optimizer',['adam','sgd'])}

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df_encoded,label,test_size=0.2,random_state=42)
best_nn_encoded=fmin(keras_optimizer,keras_space,algo=tpe.suggest,max_evals=10)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df_scaled,label,test_size=0.2,random_state=42)
best_nn_scaled=fmin(keras_optimizer,keras_space,algo=tpe.suggest,max_evals=10)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df_transformed,label,test_size=0.2,random_state=42)
best_nn_transformed=fmin(keras_optimizer,keras_space,algo=tpe.suggest,max_evals=10)

In [None]:
print(best_nn_encoded)
print(best_nn_scaled)
print(best_nn_transformed)

In [None]:
_,nn_train_encoded,nn_test_encoded=model_trainer(df_encoded,label,best_nn_encoded,'keras')
_,nn_train_scaled,nn_test_scaled=model_trainer(df_scaled,label,best_nn_scaled,'keras')
_,nn_train_transformed,nn_test_transformed=model_trainer(df_transformed,label,best_nn_transformed,'keras')

In [None]:
print("|              |        neural network            |       Boosted Trees       |")
print("| Dataset Type | Train Score     | Test Score     |Train Score   | Test Score |")
print("| Encoded      | {:.4f}          | {:.4f}         | {:.4f}       | {:.4f}     |".format(nn_train_encoded,nn_test_encoded,lgb_train_encoded,lgb_test_encoded))
print("| Scaled       | {:.4f}          | {:.4f}         | {:.4f}       | {:.4f}     |".format(nn_train_scaled,nn_test_scaled,lgb_train_scaled,lgb_test_scaled))
print("| Transformed  | {:.4f}          | {:.4f}         | {:.4f}       | {:.4f}     |".format(nn_train_scaled,nn_test_scaled,lgb_train_scaled,lgb_test_scaled))

The best result for both tree based model and neural network model are obtained on the feature set that is not scaled or transfromed. Both models perform equally well with the neural netwrok performing 1 percent better than the gradient boosted trees.

In [None]:
#Save the model
from sklearn.externals import joblib
joblib.dump(clf_encoded, 'lgb_best.pkl')

In [None]:
#feature importance for the best tree based model
%matplotlib inline
lgb.plot_importance(clf_encoded,max_num_features=10)

## Conclusion

The top 10 features affecting the buying intentions were identified. These features can be further isolated depending upon the applications. For eg. Exit would be a good measure of how well personalized webpages are working for users. A simple A/B test can be carried out with and without personalization and exit rates as well as other features can be monitored. Change in these features would indicate a change in buying intent and tell us if the test was succesfull. 