# Prediction of customers' travel pattern

# Content

- Loading data.
- Model Building.
- Dealing with imbalnced class problem.
- Choosing most relevant features.
- Using train-test split and also K-fold approaches for model evaluation, perfomance and validation.

# 1)-Importing key modules

In [1]:
import warnings
warnings.filterwarnings('ignore')
# For processing
import pandas as pd
import numpy as np
import scipy
import datetime as dt
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint
%matplotlib inline

In [2]:
# For modeling building and validation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [4]:
# for evaluation

from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score

# 2)-Loading and processing data

Loading from notebook2

In [6]:
df = pd.read_csv('selected_feature.csv')
df.shape

(45805, 515)

In [7]:
df.head()

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,0,5834.154716,7,6.0,11,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,6525.926149,4,21.0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,469.781624,2,3.0,23,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1498.817537,1,3.0,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,2921.339028,4,6.0,22,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
df_model=df.copy()

In [9]:
df_model.head()

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,0,5834.154716,7,6.0,11,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,6525.926149,4,21.0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,469.781624,2,3.0,23,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1498.817537,1,3.0,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,2921.339028,4,6.0,22,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### spliting data into dependent and independent features

In [10]:
target=df_model["event_type"]
features=df_model.drop(['event_type'], axis=1)

In [11]:
print(target.shape)
print(features.shape)

(45805,)
(45805, 514)


In [12]:
features.head(2)

Unnamed: 0,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,origin_ALA,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,5834.154716,7,6.0,11,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,6525.926149,4,21.0,20,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### normalize data

to deal with outlier issue seen in last notebook

In [13]:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(features)

# 3)-Model Building

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.3, random_state=0)

In [15]:
print(X_train.shape)
print(X_test.shape)

(32063, 514)
(13742, 514)


In [16]:
print(y_train.shape)
print(y_test.shape)

(32063,)
(13742,)


In [17]:
X_train

array([[-0.35177902,  1.0152115 , -0.61244991, ..., -0.03371257,
        -0.03135908, -0.08173844],
       [-0.50221793,  0.15374868, -0.15216897, ..., -0.03371257,
        -0.03135908, -0.08173844],
       [-0.69762774, -0.70771413, -0.21792339, ..., -0.03371257,
        -0.03135908, -0.08173844],
       ...,
       [ 1.98983893,  1.0152115 ,  3.53007847, ..., -0.03371257,
        -0.03135908, -0.08173844],
       [-0.50391064, -0.70771413,  0.76839289, ..., -0.03371257,
        -0.03135908, -0.08173844],
       [-0.73094009,  0.15374868, -0.34943223, ..., -0.03371257,
        -0.03135908, -0.08173844]])

In [18]:
# Logistic Classifeir
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
predictions_LR = logreg.predict(X_test)

In [19]:
predictions_LR[:5]

array([0, 0, 0, 0, 0])

In [20]:
print(accuracy_score(y_test,predictions_LR))

0.9598311745015282


In [21]:
print(roc_auc_score(y_test, predictions_LR))

0.4998931390669304


In [22]:
print(classification_report(y_test,predictions_LR))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     13217
           1       0.03      0.00      0.00       525

    accuracy                           0.96     13742
   macro avg       0.50      0.50      0.49     13742
weighted avg       0.93      0.96      0.94     13742



We can clearly see that there is problem of imbalanced class. For booking results , we get very poor precision score of only 0.03 and for search events, we have 0.96. We need to solve this problem as we want a model that performs more precisely on booking instances than search results.

As for accuracy, it is very good. But, this does not make much value for our case. Also it makes case that accuracy is not always a good evaluation matric for checking model performance.

### Confusion matrix

In [23]:
C = confusion_matrix(y_test, predictions_LR)

In [24]:
C

array([[13189,    28],
       [  524,     1]])

This model does predict only 1 correct booking instance whereas 524 booking instances are incorrectly predicted as search instances. This is really bad for our case. We need a model that could predict booking instance correctly. even though we have an accuracy of 95%, we fail to get our required results 

### Precicion matrix

In [25]:
P =(C/C.sum(axis=0))
P

array([[0.96178808, 0.96551724],
       [0.03821192, 0.03448276]])

### Recall Matrix

In [26]:
A =(((C.T)/(C.sum(axis=1))).T)
A

array([[0.99788152, 0.00211848],
       [0.99809524, 0.00190476]])

# 4)- Solution to imbalanced Class

### Solution1: Under-sampling method

In [27]:
booking = df_model[df_model['event_type']==1]

search = df_model[df_model['event_type']==0]

In [28]:
target=df_model["event_type"]
features=df_model.drop(['event_type'], axis=1)
X = StandardScaler().fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.3, random_state=0)

In [29]:
from imblearn.under_sampling import NearMiss
# Implementing Undersampling for Handling Imbalanced 
nm = NearMiss()
X_under,y_under=nm.fit_sample(X_train,y_train)

In [30]:
print(X_under.shape,y_under.shape)

(2566, 514) (2566,)


In [31]:
# checking classes
print(y_train.value_counts())

0    30780
1     1283
Name: event_type, dtype: int64


In [32]:
print(y_under.value_counts())

1    1283
0    1283
Name: event_type, dtype: int64


In [33]:
print(y_test.value_counts())

0    13217
1      525
Name: event_type, dtype: int64


In [34]:
y_test.shape

(13742,)

In [35]:
# Logistic Classifeir
logreg = LogisticRegression()
logreg.fit(X_under, y_under)
predictions_LR_under = logreg.predict(X_test)

In [36]:
print(accuracy_score(predictions_LR_under,y_test))

0.22165623635569787


In [37]:
print(classification_report(predictions_LR_under,y_test))

              precision    recall  f1-score   support

           0       0.20      0.95      0.33      2793
           1       0.74      0.04      0.07     10949

    accuracy                           0.22     13742
   macro avg       0.47      0.49      0.20     13742
weighted avg       0.63      0.22      0.12     13742



In [38]:
print(roc_auc_score(predictions_LR_under,y_test))

0.49341759863955387


In [39]:
print(recall_score(predictions_LR_under,y_test))

0.03552835875422413


In [40]:
print(f1_score(predictions_LR_under,y_test))

0.06780547324385568


We could better results that our 1st model. Let's see what are other solutions in bag.

### Solution 2: Over-Sampling

adding artificial values to our training data. Mind you, we wont over sample test data. It will be a mistake as we only make our training data to learn and test that learning on test data.

In [41]:
from imblearn.combine import SMOTETomek

In [42]:
# Implementing Oversampling for Handling Imbalanced 
smk = SMOTETomek(random_state=42)
X_over,y_over=smk.fit_sample(X_train,y_train)

In [43]:
X_over.shape,y_over.shape

((60868, 514), (60868,))

In [44]:
X_test.shape, y_test.shape

((13742, 514), (13742,))

In [45]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X_over, y_over)
predictions_LR = logreg.predict(X_test)

In [46]:
print(accuracy_score(predictions_LR,y_test))

0.5406782127783437


In [47]:
print(roc_auc_score(predictions_LR,y_test))

0.5092671954659488


In [48]:
print(classification_report(predictions_LR,y_test))

              precision    recall  f1-score   support

           0       0.54      0.97      0.69      7339
           1       0.59      0.05      0.09      6403

    accuracy                           0.54     13742
   macro avg       0.56      0.51      0.39     13742
weighted avg       0.56      0.54      0.41     13742



We can see that keeping a balanced class will give us a balanced precision result.It is still not very impressive on recall or f1 though.

### Solution 3: Using class_weight option as "balanced" 


using model for class balance

In [49]:
from sklearn.utils import class_weight

In [50]:
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

In [51]:
logreg = LogisticRegression(class_weight="balanced")

In [52]:
logreg.fit(X_train, y_train)
predictions_LR = logreg.predict(X_test)

In [53]:
print(accuracy_score(predictions_LR,y_test))

0.5813564255566875


In [54]:
print(roc_auc_score(predictions_LR,y_test))

0.5119320419309547


In [55]:
print(classification_report(predictions_LR,y_test))

              precision    recall  f1-score   support

           0       0.58      0.97      0.73      7908
           1       0.58      0.05      0.10      5834

    accuracy                           0.58     13742
   macro avg       0.58      0.51      0.41     13742
weighted avg       0.58      0.58      0.46     13742



We have three solutions. Oversample whether SMOTE or model option class balance give almost same results. With model option class balance, it does not seem like a big issue. But, I am going to choose under-sampling approach

### Why did we choose under sampling?

There are three reasons

- 1)- Our task is not to create a model for predicting accurate results for search and booking event classes. Our task is to find conversion-likelihood. Idea is to create a model that predicts very good results for booking events. If that model is not good in prediction of search event types then we won't mind. 

- 2)- As per my first interaction, our organization has got enough data. So, if we lose some samples then it is not big deal. I will consider it as first prototype round.In intital stage, we get less data and later we scale up project. If we get more data then this code will perform well for scaled data. Main concern is to deal with imbalanced classes and still to get good results for booking event type instances. That's why we can optimal model with our Solution 1. I have given other solutions as to know that I have knowledge of solutions.

- 3-If we apply more intense ML or DL models then over-sampling models will take alot of computing. If that is valuebale then sure , I would not mind high computation cost. But, in this assignment we can get pretty good results with under-sampling as well. 

# 4)- Choosing features

3 features vs six featues

In [56]:
y=df_model["event_type"]
features=df_model[["distance","num_family","len_jour","ts_hour"]]
X = StandardScaler().fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
nm = NearMiss()
X_under,y_under=nm.fit_sample(X_train,y_train)

### 4.1)- Using Statsmodel for Logistic Regression

In [57]:
import statsmodels.api as sm
logistic= sm.Logit(y_train,sm.add_constant(X_train)).fit()
logistic.summary2()

Optimization terminated successfully.
         Current function value: 0.167024
         Iterations 8


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.006
Dependent Variable:,event_type,AIC:,10720.5885
Date:,2020-03-28 23:26,BIC:,10762.4658
No. Observations:,32063,Log-Likelihood:,-5355.3
Df Model:,4,LL-Null:,-5386.3
Df Residuals:,32058,LLR p-value:,1.0759e-12
Converged:,1.0000,Scale:,1.0
No. Iterations:,8.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
const,-3.2045,0.0293,-109.2954,0.0000,-3.2620,-3.1471
x1,-0.2063,0.0360,-5.7304,0.0000,-0.2769,-0.1357
x2,0.0515,0.0270,1.9099,0.0561,-0.0013,0.1044
x3,-0.0825,0.0398,-2.0720,0.0383,-0.1605,-0.0045
x4,0.0052,0.0286,0.1824,0.8553,-0.0508,0.0613


- ts_hour i. x4 is not significant as compare to p-value. So we can remove it from equation.
- num_family is marginal significant so, we shall keep it for analysis.

### 4.2)- Using sklearn for Logistic Regression

In [58]:
logreg = LogisticRegression()
logreg.fit(X_under, y_under)
predictions_LR = logreg.predict(X_test)

In [59]:
print(accuracy_score(predictions_LR,y_test))

0.24465143356134478


In [60]:
print(classification_report(predictions_LR,y_test))

              precision    recall  f1-score   support

           0       0.23      0.94      0.37      3195
           1       0.66      0.03      0.06     10547

    accuracy                           0.24     13742
   macro avg       0.44      0.49      0.22     13742
weighted avg       0.56      0.24      0.13     13742



It is lower than our six feature model. So, it shows that those categorical variables(origin and destination) improved performance. We cannot leave them aside.

### 4.3)- Applying with removal of insignificant variable


using statsmodel

In [61]:
y=df_model["event_type"]
features=df_model[["distance","num_family","len_jour"]]
X = StandardScaler().fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
nm = NearMiss()
X_under,y_under=nm.fit_sample(X_train,y_train)

In [62]:
logistic= sm.Logit(y_train,sm.add_constant(X_train)).fit()
logistic.summary2()

Optimization terminated successfully.
         Current function value: 0.167025
         Iterations 8


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.006
Dependent Variable:,event_type,AIC:,10718.6218
Date:,2020-03-28 23:27,BIC:,10752.1236
No. Observations:,32063,Log-Likelihood:,-5355.3
Df Model:,3,LL-Null:,-5386.3
Df Residuals:,32059,LLR p-value:,2.1806e-13
Converged:,1.0000,Scale:,1.0
No. Iterations:,8.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
const,-3.2045,0.0293,-109.2960,0.0000,-3.2620,-3.1470
x1,-0.2063,0.0360,-5.7297,0.0000,-0.2768,-0.1357
x2,0.0515,0.0270,1.9097,0.0562,-0.0014,0.1044
x3,-0.0826,0.0398,-2.0751,0.0380,-0.1606,-0.0046


### 4.4)- Using sklearn

In [63]:
logreg = LogisticRegression()
logreg.fit(X_under, y_under)
predictions_LR = logreg.predict(X_test)

In [64]:
print(accuracy_score(predictions_LR,y_test))

0.24043079609954882


In [65]:
print(classification_report(predictions_LR,y_test))

              precision    recall  f1-score   support

           0       0.22      0.94      0.36      3141
           1       0.66      0.03      0.06     10601

    accuracy                           0.24     13742
   macro avg       0.44      0.49      0.21     13742
weighted avg       0.56      0.24      0.13     13742



Results are not same but, somewhat similar. Removal for one variable has not made much impact. 

still we will keep this three feature approached dropped and continue with five feature model. I am taking this liberty NOT to confine my model to three features as was directed in assignment.This is the only time I am deviating from assignments' instruction.

# 5)- Update Dataset

with five features in total

In [66]:
df.head()

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,0,5834.154716,7,6.0,11,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,6525.926149,4,21.0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,469.781624,2,3.0,23,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1498.817537,1,3.0,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,2921.339028,4,6.0,22,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [67]:
df.drop(['ts_hour'], axis=1, inplace=True)

In [68]:
df.to_csv('updated_feature.csv',index=False)

# 6)- K-Fold 

Using balanced class data using SMOTE

In [71]:
from sklearn import metrics
df_model = pd.read_csv('updated_feature.csv')
y=df_model["event_type"]
features=df_model.drop(['event_type'], axis=1)
X = StandardScaler().fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [72]:
kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = target[train_index]  # Based on our code, you might need a ravel call here, but I would look into how you're generating y
    X_test = X[test_index]
    y_test = target[test_index] 
    nm = NearMiss()
    X_under,y_under=nm.fit_sample(X_train,y_train)
    model = LogisticRegression()  # Choose a model here
    model.fit(X_under, y_under )  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_pred, y_test)}')
    print(f'roc-auc: {roc_auc_score(y_pred, y_test)}')
    print(f'recall: {recall_score(y_pred, y_test)}')
    print(f'precision: {metrics.precision_score(y_pred, y_test)}')

For fold 1:
Accuracy: 0.1945202488811265
f-score: 0.31326198231735686
roc-auc: 0.5125469273921022
recall: 0.18611080393674664
precision: 0.9888366627497063
For fold 2:
Accuracy: 0.2025979696539679
f-score: 0.020908725371934056
roc-auc: 0.49891357156652105
recall: 0.010597826086956521
precision: 0.7722772277227723
For fold 3:
Accuracy: 0.18404104355419715
f-score: 0.0
roc-auc: 0.5
recall: 0.0
precision: 0.0
For fold 4:
Accuracy: 0.19211876432703853
f-score: 0.00027016074564365796
roc-auc: 0.5000675493109971
recall: 0.00013509862199405565
precision: 1.0
For fold 5:
Accuracy: 0.19069970527235017
f-score: 0.0008086253369272238
roc-auc: 0.4999157324208789
recall: 0.0004045307443365696
precision: 0.75


### cross validation done wrong

- Thanks to https://www.youtube.com/watch?v=DQC_YE3I5ig tutorial, I have found what is right and what is wrong way to apply K-fold.

kf = KFold(n_splits=5, random_state=42) <br>
accuracy = [] <br>
precision = [] <br>
recall = [] <br>
f1 = [] <br>
auc = [] <br>
X, y = SMOTE().fit_sample(X_train, y_train) <br>
for train, test in kf.split(X, y): <br>
    pipeline = make_pipeline(classifier(random_state=42)) <br>
    model = pipeline.fit(X[train], y[train]) <br>
    prediction = model.predict(X[test]) <br>
    accuracy.append(pipeline.score(X[test], y[test])) <br>
    precision.append(precision_score(y[test], prediction)) <br>
    recall.append(recall_score(y[test], prediction)) <br>
    f1.append(f1_score(y[test], prediction)) <br>

print("done wrong mean of scores 5-fold:") <br>
print("accuracy: {}".format(np.mean(accuracy))) <br>
print("precision: {}".format(np.mean(precision))) <br>
print("recall: {}".format(np.mean(recall))) <br>
print("f1: {}".format(np.mean(f1))) <br>

**Finally,** we get precison score of 0.75.These results are also very close to our earlier six feature model. But, would Regression Classifier be the only model ? We shall try more ML and DL model in next notebook to see better performance. To summarize ;

- Our dataset will be with five features i.e 3 numeric and 2 categorical
- We shall use under-sample method as per reason given above
- We shall try and compare results of Reg. Classifier with other models
- Our matric of evaluation will be precision.

**END OF NOTEBOOK3**