Here we will build a gradient boosting trees model to classify the customers into corresponding classes.<br>
<br>
Recall that we used averages over all the orders of a customer to calculate Food%, Fresh% etc. of a customer. A customer may change from one class to another, which is natural. Birth of a new baby may make the customer new_parents, after years once the baby grows up they become normal again. A customer may turn more health concious when he gets old.<br>
<br>
To address this problem we can update the data of the customer with the current average after every order and put the data again to be predicted by the model. A better idea will be to use the concept of exponential moving average which is used in technical analysis of stock markets. We can have a certain number of orders as the look back period for the exponential moving average. What a moving average does is, it gives exponentially more importance to the recent data and less importance to the earlier data. Their by catching current trend in the customer's orders.<br>

Here we will just create a basic model with the data we have.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import scipy as sc
import xgboost as xgb
import itertools

In [14]:
from xgboost.sklearn import XGBClassifier

In [18]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('customer_segmentation.csv')

In [3]:
data.head()

Unnamed: 0,customer,order,total_items,discount%,weekday,hour,Food%,Fresh%,Drinks%,Home%,Beauty%,Health%,Baby%,Pets%,num_orders,labels,class
0,0,0,44.666667,14.11,4,13,14.07,73.203333,4.356667,6.2,2.176667,0.0,0.0,0.0,3.0,1,fresh_regulars
1,1,3,31.15,17.849,1,12,17.762,52.909,17.761,3.2075,2.3145,4.352,1.695,0.0,20.0,4,loyals
2,2,23,26.0,2.97,6,23,24.1,22.29,38.69,14.92,0.0,0.0,0.0,0.0,1.0,8,grocery_regulars
3,3,24,27.782609,4.102174,1,10,23.825652,51.28087,8.22087,14.773478,0.0,0.0,1.898696,0.0,23.0,4,loyals
4,4,47,17.103448,4.373103,3,9,24.841379,51.082414,10.291034,13.035172,0.683793,0.0,0.065517,0.0,29.0,4,loyals


In [4]:
data.describe()

Unnamed: 0,customer,order,total_items,discount%,weekday,hour,Food%,Fresh%,Drinks%,Home%,Beauty%,Health%,Baby%,Pets%,num_orders,labels
count,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0,9354.0
mean,5022.473808,15025.143789,32.022679,11.857907,3.657473,15.258071,25.88656,15.173542,23.717923,15.517726,6.083896,1.280169,11.03604,1.144381,3.085311,4.904319
std,2945.899928,8825.170543,18.724271,19.372177,2.181161,5.709821,24.018227,19.856395,21.745537,18.024529,11.766312,5.089555,23.515242,6.224596,3.24771,2.906824
min,0.0,0.0,4.25,-31.82,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,2457.25,7307.75,19.44697,2.56,2.0,11.0,9.60875,0.0,7.88025,2.125833,0.0,0.0,0.0,0.0,1.0,2.0
50%,4951.5,14777.5,28.763889,5.75,3.0,16.0,20.9,4.867,18.9715,10.50125,2.280417,0.0,0.0,0.0,2.0,5.0
75%,7573.75,22804.75,40.0,12.3825,6.0,20.0,33.8025,26.487083,33.567292,21.72,7.35,0.0,7.09725,0.0,4.0,8.0
max,10237.0,29997.0,147.5,100.0,7.0,23.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,52.0,9.0


In [11]:
y = data['labels'].values
X = data.drop(['weekday', 'hour','labels', 'class'], axis=1).values

In [7]:
from sklearn.model_selection import StratifiedKFold

In [8]:
skf = StratifiedKFold(n_splits = 5, shuffle=True, random_state=42)

In [9]:
xgbc = XGBClassifier()

In [56]:
cv_scores = []
for train_index, test_index in skf.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    xgbc.fit(X_train, y_train)
    score = xgbc.score(X_test, y_test)
    print(score)
    cv_scores.append(score)

0.9797441364605544
0.9770544290288153
0.9705724986623863
0.9689507494646681
0.9716122121049813


That's not a bad score, let's try to improve it by tuning parameters.

In [13]:
from sklearn.model_selection import GridSearchCV

In [44]:
params={
    'max_depth':[6,7],
    'learning_rate':[0.05],
    'n_estimators':[500],
    'objective':['multi:softprob'],
    'gamma':[0],
    'max_delta_step':[1],
    'subsample':[0.9,0.8],
    'colsample_bytree':[1.0],
    'colsample_bylevel':[1.0],
    'min_child_weight':[1.0]
}

In [45]:
grid_search_xgb = GridSearchCV(estimator=XGBClassifier(), param_grid=params, n_jobs=-1)

In [46]:
grid_search_xgb.fit(X,y)

GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'colsample_bylevel': [1.0], 'colsample_bytree': [1.0], 'gamma': [0], 'subsample': [0.9, 0.8], 'min_child_weight': [1.0], 'max_delta_step': [1], 'objective': ['multi:softprob'], 'n_estimators': [500], 'learning_rate': [0.05], 'max_depth': [6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [47]:
print(grid_search_xgb.best_params_)

{'colsample_bylevel': 1.0, 'colsample_bytree': 1.0, 'gamma': 0, 'subsample': 0.9, 'min_child_weight': 1.0, 'max_delta_step': 1, 'objective': 'multi:softprob', 'n_estimators': 500, 'learning_rate': 0.05, 'max_depth': 6}


In [48]:
#results = pd.DataFrame(grid_search_xgb.cv_results_)
results = pd.concat([results, pd.DataFrame(grid_search_xgb.cv_results_)], axis=0)

In [49]:
results[results['mean_test_score']==results['mean_test_score'].max()].T

Unnamed: 0,21,0,1
mean_fit_time,30.7451,28.9921,28.5919
mean_score_time,1.22534,2.06268,1.80435
mean_test_score,0.972525,0.972525,0.972525
mean_train_score,1,1,1
param_colsample_bylevel,1,1,1
param_colsample_bytree,1,1,1
param_gamma,0,0,0
param_learning_rate,0.05,0.05,0.05
param_max_delta_step,1,1,1
param_max_depth,6,6,6


In [50]:
selected_xgbc = XGBClassifier(learning_rate=0.05, max_depth=6, n_estimators=500, subsample=0.9)

In [51]:
from sklearn.model_selection import train_test_split

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)

In [53]:
selected_xgbc.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05, max_delta_step=0,
       max_depth=6, min_child_weight=1, missing=None, n_estimators=500,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.9)

In [54]:
selected_xgbc.score(X_train, y_train)

1.0

In [55]:
selected_xgbc.score(X_test, y_test)

0.9850427350427351

Well, That's an improvement! This model can be further used to predict the classes of customers.