# Prediction of customers' travel pattern

- https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

- https://towardsdatascience.com/predicting-hotel-bookings-with-user-search-parameters-8c570ab24805

# 1)-Importing key modules

In [1]:
import warnings
warnings.filterwarnings('ignore')
# For processing
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy
import datetime as dt
from datetime import datetime
import seaborn as sns
plt.rcParams["figure.figsize"] = (16, 10)
plt.rcParams["xtick.labelsize"] = 10
plt.figure(figsize=(16,10)) # this creates a figure 16 inch wide, 10 inch high
from pprint import pprint
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
# For modeling building and tunning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [3]:
# for deep learning if I will have time

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical

Using TensorFlow backend.


In [4]:
# for evaluation

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [5]:
from datetime import date
import datetime as dt

# 2)-Loading data

In [6]:
df = pd.read_csv('model_data.csv')
df.shape

(45805, 8)

In [7]:
df.columns

Index(['event_type', 'origin', 'destination', 'distance', 'num_family',
       'ts_datetime', 'len_jour', 'ts_hour'],
      dtype='object')

In [8]:
df.head()

Unnamed: 0,event_type,origin,destination,distance,num_family,ts_datetime,len_jour,ts_hour
0,search,PAR,NYC,5834.154716,7,2017-04-27 11:06:51,6.0,11
1,book,FRA,WAS,6525.926149,4,2017-04-27 20:15:27,21.0,20
2,book,BER,CGN,469.781624,2,2017-04-27 23:03:43,3.0,23
3,book,BER,BCN,1498.817537,1,2017-04-27 15:17:50,3.0,15
4,book,DEL,BKK,2921.339028,4,2017-04-27 22:51:57,6.0,22


### a. creating 3-feature dataset

In [9]:
df_three_feat=df[["event_type","distance","num_family","len_jour"]]

In [10]:
df_three_feat.head(3)

Unnamed: 0,event_type,distance,num_family,len_jour
0,search,5834.154716,7,6.0
1,book,6525.926149,4,21.0
2,book,469.781624,2,3.0


In [11]:
df_three_feat['event_type'] = df_three_feat.event_type.map({'search':0, 'book':1})
df_three_feat.head(3)

Unnamed: 0,event_type,distance,num_family,len_jour
0,0,5834.154716,7,6.0
1,1,6525.926149,4,21.0
2,1,469.781624,2,3.0


### b. Creating all feature dataset

In [12]:
df_all = pd.read_csv('all_features.csv')
df_all.shape

(45805, 518)

In [13]:
df_all.head(3)

Unnamed: 0,event_type,origin,destination,distance,num_family,ts_datetime,len_jour,ts_hour,origin_ADB,origin_ADL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,0,PAR,NYC,5834.154716,7,2017-04-27 11:06:51,6.0,11,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,FRA,WAS,6525.926149,4,2017-04-27 20:15:27,21.0,20,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,BER,CGN,469.781624,2,2017-04-27 23:03:43,3.0,23,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df_all_feat=df_all.drop(['origin','destination','ts_datetime'], axis=1)

In [15]:
df_all_feat.head(3)

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,0,5834.154716,7,6.0,11,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,6525.926149,4,21.0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,469.781624,2,3.0,23,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we have both datasets aligned. So, let's concentrate on our problem of imbalanced classes

In [16]:
df_three_feat.event_type.value_counts()

0    43997
1     1808
Name: event_type, dtype: int64

In [17]:
df_all_feat.event_type.value_counts(normalize=True)

0    0.960528
1    0.039472
Name: event_type, dtype: float64

This is the problem. We have only 3.9% of our data is of booking class.

# 3)- Solution1: Under-sampling method

In [18]:
booking_indices = df_three_feat[df_three_feat.event_type == 1].index
random_indices = np.random.choice(booking_indices, len(df_three_feat.loc[df_three_feat.event_type == 1]), replace=False)
booking_sample = df_three_feat.loc[random_indices]

In [19]:
booking_sample

Unnamed: 0,event_type,distance,num_family,len_jour
5557,1,8371.777081,1,7.0
3461,1,415.631777,1,0.0
785,1,2640.191537,1,8.0
6140,1,469.781624,1,6.0
8570,1,1334.259311,2,11.0
...,...,...,...,...
6262,1,614.303469,3,3.0
5548,1,1451.119217,1,4.0
2776,1,2531.250114,2,5.0
3566,1,1734.921234,1,1.0


In [20]:
not_booking = df_three_feat[df_three_feat.event_type == 0].index
random_indices = np.random.choice(not_booking, sum(df_three_feat['event_type']), replace=False)
not_booking_sample = df_three_feat.loc[random_indices]

In [21]:
not_booking_sample

Unnamed: 0,event_type,distance,num_family,len_jour
32184,0,2432.590282,2,28.0
18168,0,9058.618604,1,43.0
45177,0,5570.249097,1,29.0
20220,0,483.075576,4,6.0
41252,0,8722.933267,1,6.0
...,...,...,...,...
17925,0,1653.474736,1,4.0
35204,0,614.303469,4,2.0
32677,0,1183.187222,3,8.0
38633,0,1451.119217,1,2.0


In [22]:
imb_three_feat = pd.concat([not_booking_sample, booking_sample], axis=0)

In [23]:
print("Percentage of search clicks: ", len(imb_three_feat[imb_three_feat.event_type == 0])/len(imb_three_feat))
print("Percentage of booking clicks: ", len(imb_three_feat[imb_three_feat.event_type == 1])/len(imb_three_feat))
print("Total number of records in resampled data: ", len(imb_three_feat))

Percentage of search clicks:  0.5
Percentage of booking clicks:  0.5
Total number of records in resampled data:  3616


In [24]:
#save dataset
imb_three_feat.to_csv('undersample_my_feat.csv',index=False)

## Alternative way of doing Imbalanced class: USING imblearn

In [25]:
df_imbalance=df[["event_type","distance","num_family","len_jour"]]
df_imbalance['event_type'] = df_imbalance.event_type.map({'search':0, 'book':1})

In [26]:
df_imbalance.head(2)

Unnamed: 0,event_type,distance,num_family,len_jour
0,0,5834.154716,7,6.0
1,1,6525.926149,4,21.0


In [27]:
booking = df_imbalance[df_imbalance['event_type']==1]

search = df_imbalance[df_imbalance['event_type']==0]

In [28]:
print(booking.shape,search.shape)

(1808, 4) (43997, 4)


In [29]:
X=df_imbalance[['distance','num_family','len_jour']]
y=df_imbalance['event_type']

In [30]:
print(X.shape, y.shape)

(45805, 3) (45805,)


In [31]:
from imblearn.under_sampling import NearMiss
# Implementing Undersampling for Handling Imbalanced 
nm = NearMiss()
X_under,y_under=nm.fit_sample(X,y)
X_under.shape,y_under.shape

((3616, 3), (3616,))

Same as above case

# 4)- Model Building

### 4.1)-Separate features

In [32]:
imb_three_feat.head(3)

Unnamed: 0,event_type,distance,num_family,len_jour
32184,0,2432.590282,2,28.0
18168,0,9058.618604,1,43.0
45177,0,5570.249097,1,29.0


In [33]:
imb_three_feat.event_type.value_counts()

1    1808
0    1808
Name: event_type, dtype: int64

In [34]:
target=imb_three_feat["event_type"]

In [35]:
features=imb_three_feat[["distance","num_family","len_jour"]]

In [36]:
print(target.shape)
print(features.shape)

(3616,)
(3616, 3)


### 4.2)-Normalize data

In [37]:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(features)

### 4.3)-train_test_split

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.3, random_state=0)

In [39]:
print(X_train.shape)
print(X_test.shape)

(2531, 3)
(1085, 3)


In [40]:
print(y_train.shape)
print(y_test.shape)

(2531,)
(1085,)


In [41]:
X_train

array([[ 0.95178903,  0.1236728 , -0.08447997],
       [-0.30404627, -0.73467813, -0.3693094 ],
       [-0.86110937, -0.73467813,  8.88764715],
       ...,
       [-0.70337398, -0.73467813, -0.44051676],
       [-0.17340259,  0.98202373,  2.40777756],
       [-0.72351418, -0.73467813, -0.3693094 ]])

In [42]:
# Logistic Classifeir
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
predictions_LR = logreg.predict(X_test)

In [43]:
predictions_LR[:5]

array([1, 1, 1, 1, 1])

In [44]:
print(accuracy_score(y_test,predictions_LR))

0.5493087557603686


In [45]:
print(recall_score(y_test,predictions_LR))

0.8003731343283582


In [46]:
print(classification_report(y_test,predictions_LR))

              precision    recall  f1-score   support

           0       0.61      0.30      0.41       549
           1       0.53      0.80      0.64       536

    accuracy                           0.55      1085
   macro avg       0.57      0.55      0.52      1085
weighted avg       0.57      0.55      0.52      1085



And here we are, we have lower accuracy(54.37%) than our previous model(96.17%).

But, we have got some valuable results for our booking class now.

# 5)-Model with ALL_Features

In [47]:
df_all_feat.head(2)

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
0,0,5834.154716,7,6.0,11,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,6525.926149,4,21.0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
booking_indices = df_all_feat[df_all_feat.event_type == 1].index
random_indices = np.random.choice(booking_indices, len(df_all_feat.loc[df_all_feat.event_type == 1]), replace=False)
booking_sample = df_all_feat.loc[random_indices]

In [49]:
booking_sample

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
6570,1,2024.995685,2,9.0,22,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5743,1,4569.053013,2,12.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1096,1,670.324963,2,0.0,14,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7249,1,614.303469,2,2.0,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
704,1,1801.713318,2,8.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9087,1,16954.354092,1,0.0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5395,1,1885.305826,1,0.0,16,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3734,1,2199.490725,2,7.0,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5767,1,1011.416675,1,0.0,13,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
search = df_all_feat[df_all_feat.event_type == 0].index
random_indices = np.random.choice(search, sum(df_all_feat['event_type']), replace=False)
search_sample = df_all_feat.loc[random_indices]

In [51]:
search_sample

Unnamed: 0,event_type,distance,num_family,len_jour,ts_hour,origin_ADB,origin_ADL,origin_AER,origin_AGP,origin_AKL,...,dest_YEG,dest_YMQ,dest_YOW,dest_YTO,dest_YUL,dest_YVR,dest_YWG,dest_YYC,dest_YYZ,dest_ZRH
25398,0,9295.936700,2,21.0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31237,0,1451.119217,7,7.0,12,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36677,0,1263.700579,2,4.0,22,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9976,0,2252.048902,1,19.0,18,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
280,0,1074.597942,1,4.0,21,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26171,0,9444.808919,1,18.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36594,0,1962.497137,1,13.0,16,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22563,0,794.280490,2,14.0,16,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22027,0,614.303469,1,0.0,22,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
df_imb_all_feat = pd.concat([search_sample, booking_sample], axis=0)

In [53]:
df_imb_all_feat.shape

(3616, 515)

In [54]:
print("Percentage of search clicks: ", len(df_imb_all_feat[df_imb_all_feat.event_type == 0])/len(df_imb_all_feat))
print("Percentage of booking clicks: ", len(df_imb_all_feat[df_imb_all_feat.event_type == 1])/len(df_imb_all_feat))
print("Total number of records in resampled data: ", len(df_imb_all_feat))

Percentage of search clicks:  0.5
Percentage of booking clicks:  0.5
Total number of records in resampled data:  3616


In [55]:
#save dataset
df_imb_all_feat.to_csv('undersample_all_feat.csv',index=False)

### 5.a)- Model Building for all feature dataset

In [56]:
target=df_imb_all_feat["event_type"]
features=df_imb_all_feat.drop(['event_type'], axis=1)

In [57]:
print(target.shape)
print(features.shape)

(3616,)
(3616, 514)


In [58]:
X = StandardScaler().fit_transform(features)

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.3, random_state=0)

In [60]:
print(X_train.shape)
print(X_test.shape)

(2531, 514)
(1085, 514)


In [61]:
print(y_train.shape)
print(y_test.shape)

(2531,)
(1085,)


In [62]:
# Logistic Classifeir
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
predictions_LR = logreg.predict(X_test)

### 5b. Evaluate Model


https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

In [63]:
print(accuracy_score(y_test,predictions_LR))

0.5797235023041475


In [64]:
print(recall_score(y_test,predictions_LR))

0.5970149253731343


In [65]:
print(classification_report(y_test,predictions_LR))

              precision    recall  f1-score   support

           0       0.59      0.56      0.58       549
           1       0.57      0.60      0.58       536

    accuracy                           0.58      1085
   macro avg       0.58      0.58      0.58      1085
weighted avg       0.58      0.58      0.58      1085



**Results of two models**
- Model with Three selected Features:
    
Accuracy: 54% <br>
Pprecision: For 0 -> 60% , For 1 -> 53% <br>
recall: For 0 -> 0.30 , For 1 ->  79% <br>
f1-score: For 0 -> 40 % , For 1 -> 63% <br>


- For model with all features, we can see results above

**Which is better model out of two using under sample method**

Without going nerd, I ll consider F1-score as matrics to judge performance. I ll discuss in detail what evaluation matrics mean for us in next Notebook. For now, <br>
- F1-score = (2 * Precision * Recall) / (Precision + Recall) <br>

It is combination of both Precision and Recall. And we have 3- feature model that performs more consistent with F1-score. So, I ll use 3-features model if we select "under-sampling" as our solution.

# 6)- Solution 2:

# 7)- Interpreting Evaluation part

**END OF NOTEBOOK3**