# Classification Methods: Predicting Hotel Cancellations

In this example, a series of classification algorithms are implemented to predict hotel cancellation using the datasets as provided by Antonio, Almeida and Nunes (2019). Attributions provided below.

#### Attributions

The below code uses the [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING) library (Copyright (c) 2007-2023 The scikit-learn developers) in executing the below examples, as provided under the BSD 3-Clause License.

Modifications have been made where appropriate for conducting analysis on the data specific to this example.

The copyright and permission notices are made available below:

Copyright (c) 2007-2023 The scikit-learn developers

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

### Python version

In [1]:
from platform import python_version
print(python_version())

3.8.10


## 1. Data Loading and Processing

### Import Libraries

In [2]:
import csv
import imblearn
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import seed
seed(1)

import os
import pandas as pd
import random
import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from collections import Counter

### Import Data From AWS S3 to Sagemaker

In [3]:
# import boto3
# import botocore
# from sagemaker import get_execution_role

# role = get_execution_role()

# bucket = 'enterbucketname'
# data_key_train = 'H1.csv'
# data_location_train = 's3://{}/{}'.format(bucket, data_key_train)

# train_df = pd.read_csv(data_location_train)

### Import Data From Azure Blob Storage to Azure Machine Learning Studio

In [4]:
# from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

# #download csv file from Azure blob

# sas_url = "enter url here"
# blob_client = BlobClient.from_blob_url(sas_url)
# downloaded_blo = blob_client.download_blob()

In [5]:
# from io import StringIO
# blob_data = blob_client.download_blob()
# train_df = pd.read_csv(StringIO(blob_data.content_as_text()))
# print(train_df)

### Import Data Through CSV

In [6]:
train_df = pd.read_csv('H1.csv')
a=train_df.head()
b=train_df
b
b.sort_values(['ArrivalDateYear','ArrivalDateWeekNumber'], ascending=True)

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,2017,August,35,31,2,8,2,1,...,No Deposit,143,,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,0,169,2017,August,35,30,2,9,2,0,...,No Deposit,250,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,0,204,2017,August,35,29,4,10,2,0,...,No Deposit,250,,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,0,211,2017,August,35,31,4,10,2,0,...,No Deposit,40,,0,Contract,112.80,0,1,Check-Out,2017-09-14


In [7]:
IsCanceled = train_df['IsCanceled']
y = IsCanceled

### Numerical Variables

In [8]:
leadtime = train_df['LeadTime']
arrivaldateyear = train_df['ArrivalDateYear']
arrivaldateweekno = train_df['ArrivalDateWeekNumber']
arrivaldatedayofmonth = train_df['ArrivalDateDayOfMonth']
staysweekendnights = train_df['StaysInWeekendNights']
staysweeknights = train_df['StaysInWeekNights']
adults = train_df['Adults']
children = train_df['Children']
babies = train_df['Babies']
previouscancellations = train_df['PreviousCancellations']
previousbookingsnotcanceled = train_df['PreviousBookingsNotCanceled']
bookingchanges = train_df['BookingChanges']
dayswaitinglist = train_df['DaysInWaitingList']
adr = train_df['ADR']
rcps = train_df['RequiredCarParkingSpaces']
totalsqr = train_df['TotalOfSpecialRequests']

### Categorical Variables

In [9]:
arrivaldatemonth = train_df.ArrivalDateMonth.astype("category").cat.codes
arrivaldatemonthcat=pd.Series(arrivaldatemonth)
mealcat=train_df.Meal.astype("category").cat.codes
mealcat=pd.Series(mealcat)
countrycat=train_df.Country.astype("category").cat.codes
countrycat=pd.Series(countrycat)
marketsegmentcat=train_df.MarketSegment.astype("category").cat.codes
marketsegmentcat=pd.Series(marketsegmentcat)
distributionchannelcat=train_df.DistributionChannel.astype("category").cat.codes
distributionchannelcat=pd.Series(distributionchannelcat)
reservedroomtypecat=train_df.ReservedRoomType.astype("category").cat.codes
reservedroomtypecat=pd.Series(reservedroomtypecat)
assignedroomtypecat=train_df.AssignedRoomType.astype("category").cat.codes
assignedroomtypecat=pd.Series(assignedroomtypecat)
deposittypecat=train_df.DepositType.astype("category").cat.codes
deposittypecat=pd.Series(deposittypecat)
customertypecat=train_df.CustomerType.astype("category").cat.codes
customertypecat=pd.Series(customertypecat)
reservationstatuscat=train_df.ReservationStatus.astype("category").cat.codes
reservationstatuscat=pd.Series(reservationstatuscat)
isrepeatedguestcat = train_df.IsRepeatedGuest.astype("category").cat.codes
isrepeatedguestcat=pd.Series(isrepeatedguestcat)
agentcat = train_df.Agent.astype("category").cat.codes
agentcat=pd.Series(agentcat)
companycat = train_df.Company.astype("category").cat.codes
companycat=pd.Series(companycat)

In [10]:
x = np.column_stack((leadtime,arrivaldateyear,arrivaldatemonthcat,arrivaldateweekno,arrivaldatedayofmonth,staysweekendnights,staysweeknights,adults,children,babies,mealcat,countrycat,marketsegmentcat,distributionchannelcat,isrepeatedguestcat,previouscancellations,previousbookingsnotcanceled,reservedroomtypecat,assignedroomtypecat,bookingchanges,deposittypecat,dayswaitinglist,customertypecat,adr,rcps,totalsqr,reservationstatuscat))
x = sm.add_constant(x, prepend=True)

## 2. Feature Selection

### Wrapper-Based: Forward Search

In [11]:
# from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
# from sklearn.metrics import roc_auc_score
# from mlxtend.feature_selection import SequentialFeatureSelector

# forward_feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
#            k_features=6,
#            forward=True,
#            verbose=2,
#            scoring='roc_auc',
#            cv=4)

In [12]:
# fselector = forward_feature_selector.fit(x, y)

In [13]:
# fselector.k_feature_names_

### Wrapper-Based: Backward Search

In [14]:
# from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
# from sklearn.metrics import roc_auc_score 

# backward_feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
#            k_features=6,
#            forward=False,
#            verbose=2,
#            scoring='roc_auc',
#            cv=4)

In [15]:
# bselector = backward_feature_selector.fit(x, y)

In [16]:
# bselector.k_feature_names_

### Extra Trees Classifier

In [17]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(x, y)
print(model.feature_importances_)

[0.00000000e+00 2.41288705e-02 6.54290762e-03 3.56552004e-03
 4.69576062e-03 3.47427522e-03 4.05667428e-03 4.86925873e-03
 2.53797514e-03 2.90658184e-03 3.51521069e-04 2.81228056e-03
 3.98090524e-02 1.76395497e-02 5.72618836e-03 4.67231162e-03
 1.06281516e-02 1.18152913e-03 4.53164843e-03 7.05720850e-03
 4.01953363e-03 4.33681743e-02 5.47423587e-04 1.24294822e-02
 7.31621484e-03 2.21889104e-02 7.26745746e-03 7.51675538e-01]


In [18]:
ext=pd.DataFrame(model.feature_importances_,columns=["extratrees"])
ext
ext.sort_values(['extratrees'], ascending=True)

Unnamed: 0,extratrees
0,0.0
10,0.000352
22,0.000547
17,0.001182
8,0.002538
11,0.002812
9,0.002907
5,0.003474
3,0.003566
20,0.00402


### Selected Features

In [19]:
x1 = np.column_stack((leadtime,countrycat,marketsegmentcat,deposittypecat,customertypecat,rcps,arrivaldateweekno))
x1 = sm.add_constant(x1, prepend=True)

In [20]:
x1_train, x1_val, y1_train, y1_val = train_test_split(x1, y, random_state=0)

## 3. Model Training and Validation

### SVM

In [21]:
# https://stackoverflow.com/questions/52896387/svc-with-class-weight-in-scikit-learn

from sklearn import svm
clf = svm.SVC(gamma='scale', 
            class_weight='balanced')
clf.fit(x1_train, y1_train)  
prclf = clf.predict(x1_val)
prclf

array([1, 1, 0, ..., 0, 1, 1])

In [22]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y1_val,prclf))
print(classification_report(y1_val,prclf))

[[5959 1307]
 [1073 1676]]
              precision    recall  f1-score   support

           0       0.85      0.82      0.83      7266
           1       0.56      0.61      0.58      2749

    accuracy                           0.76     10015
   macro avg       0.70      0.71      0.71     10015
weighted avg       0.77      0.76      0.77     10015



In [23]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(clf, x1_train, y1_train, cv=5)
cv_results

{'fit_time': array([17.09103298, 17.61097264, 18.97073817, 16.93097425, 17.5366137 ]),
 'score_time': array([3.26191711, 3.19363189, 3.62116814, 3.13388181, 3.63702679]),
 'test_score': array([0.75852887, 0.76718256, 0.76035946, 0.7538692 , 0.76052588])}

### XGBoost

In [24]:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100,
                              scale_pos_weight=3)
xgb_model.fit(x1_train, y1_train)

print("Accuracy on training set: {:.3f}".format(xgb_model.score(x1_train, y1_train)))
print("Accuracy on validation set: {:.3f}".format(xgb_model.score(x1_val, y1_val)))

Accuracy on training set: 0.579
Accuracy on validation set: 0.571


In [25]:
xgb_predict=xgb_model.predict(x1_val)
xgb_predict

array([1, 1, 1, ..., 0, 1, 1])

In [26]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y1_val,xgb_predict))
print(classification_report(y1_val,xgb_predict))

[[3159 4107]
 [ 194 2555]]
              precision    recall  f1-score   support

           0       0.94      0.43      0.59      7266
           1       0.38      0.93      0.54      2749

    accuracy                           0.57     10015
   macro avg       0.66      0.68      0.57     10015
weighted avg       0.79      0.57      0.58     10015



In [27]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(xgb_model, x1_train, y1_train, cv=5)
cv_results

{'fit_time': array([0.13618445, 0.12962818, 0.12848425, 0.12932062, 0.14404106]),
 'score_time': array([0.00341749, 0.00360179, 0.00538206, 0.00379848, 0.00389934]),
 'test_score': array([0.57547013, 0.58745216, 0.58029622, 0.57779997, 0.57580296])}

In [28]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(xgb_model, x1_train, y1_train, scoring="recall", cv=5)
cv_results

{'fit_time': array([0.13802314, 0.12978101, 0.14137053, 0.1485393 , 0.13879609]),
 'score_time': array([0.00703359, 0.00715184, 0.00730824, 0.00713611, 0.00779438]),
 'test_score': array([0.92358209, 0.93373134, 0.92298507, 0.92532855, 0.92353644])}

### Oversampling for Naive Bayes and KNN Models

In [29]:
counter = Counter(y1_train)
print(counter)

Counter({0: 21672, 1: 8373})


In [30]:
oversample = SMOTE()
x1_train, y1_train = oversample.fit_resample(x1_train, y1_train)

In [31]:
counter = Counter(y1_train)
print(counter)

Counter({1: 21672, 0: 21672})


### Naive Bayes

In [32]:
gnb = GaussianNB()
gnb

In [33]:
y_pred = gnb.fit(x1_train, y1_train).predict(x1_val)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [34]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y1_val,y_pred))
print(classification_report(y1_val,y_pred))

[[1841 5425]
 [  48 2701]]
              precision    recall  f1-score   support

           0       0.97      0.25      0.40      7266
           1       0.33      0.98      0.50      2749

    accuracy                           0.45     10015
   macro avg       0.65      0.62      0.45     10015
weighted avg       0.80      0.45      0.43     10015



### KNN and SMOTE Oversampling

In [35]:
x1_train = MinMaxScaler().fit_transform(x1_train)

In [36]:
x1_train

array([[0.        , 0.15332429, 0.768     , ..., 0.66666667, 0.        ,
        0.61538462],
       [0.        , 0.16010855, 0.768     , ..., 0.66666667, 0.        ,
        0.69230769],
       [0.        , 0.45590231, 0.368     , ..., 1.        , 0.        ,
        0.71153846],
       ...,
       [0.        , 0.01823239, 0.768     , ..., 0.66666667, 0.        ,
        0.65143341],
       [0.        , 0.36833502, 0.72508805, ..., 0.66666667, 0.        ,
        0.19183214],
       [0.        , 0.15599515, 0.768     , ..., 0.66666667, 0.        ,
        0.74939283]])

In [37]:
knn = KNeighborsClassifier(n_neighbors=10)
knn=knn.fit(x1_train, y1_train)
pred = knn.predict(x1_val)
pred
print("Training set score: {:.2f}".format(knn.score(x1_train, y1_train)))
print("Validation set score: {:.2f}".format(knn.score(x1_val, y1_val)))

Training set score: 0.87
Validation set score: 0.69


In [38]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y1_val,pred))
print(classification_report(y1_val,pred))

[[5930 1336]
 [1775  974]]
              precision    recall  f1-score   support

           0       0.77      0.82      0.79      7266
           1       0.42      0.35      0.39      2749

    accuracy                           0.69     10015
   macro avg       0.60      0.59      0.59     10015
weighted avg       0.67      0.69      0.68     10015



In [39]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(knn, x1_train, y1_train, cv=5)
cv_results

{'fit_time': array([0.04082441, 0.03587604, 0.03674221, 0.03540993, 0.0360055 ]),
 'score_time': array([0.49706864, 0.47440219, 0.48210144, 0.46027851, 0.48159075]),
 'test_score': array([0.81082016, 0.82397047, 0.85269351, 0.84796401, 0.85140748])}

## 4. Test Data

### Load test data

#### Import Data from AWS S3 to SageMaker

In [40]:
# data_key_test = 'H2.csv'
# data_location_test = 's3://{}/{}'.format(bucket, data_key_test)

# h2data = pd.read_csv(data_location_test)

#### Import Data From Azure Blob Storage to Azure Machine Learning Studio

In [41]:
# sas_url = "enter url here"
# blob_client = BlobClient.from_blob_url(sas_url)
# downloaded_blo = blob_client.download_blob()

# from io import StringIO
# blob_data = blob_client.download_blob()
# h2data = pd.read_csv(StringIO(blob_data.content_as_text()))
# print(h2data)

#### Import CSV

In [42]:
h2data = pd.read_csv('H2.csv')
a=h2data.head()
a

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,6,2015,July,27,1,0,2,1,0.0,...,No Deposit,6,,0,Transient,0.0,0,0,Check-Out,2015-07-03
1,1,88,2015,July,27,1,0,4,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,1,Canceled,2015-07-01
2,1,65,2015,July,27,1,0,4,1,0.0,...,No Deposit,9,,0,Transient,68.0,0,1,Canceled,2015-04-30
3,1,92,2015,July,27,1,2,4,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,2,Canceled,2015-06-23
4,1,100,2015,July,27,2,0,2,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,1,Canceled,2015-04-02


In [43]:
type(h2data)

pandas.core.frame.DataFrame

In [44]:
t_leadtime = h2data['LeadTime'] #1
t_arrivaldateyear = h2data['ArrivalDateYear']
t_arrivaldateweekno = h2data['ArrivalDateWeekNumber']
t_arrivaldatedayofmonth = h2data['ArrivalDateDayOfMonth']
t_staysweekendnights = h2data['StaysInWeekendNights'] #2
t_staysweeknights = h2data['StaysInWeekNights'] #3
t_adults = h2data['Adults'] #4
t_children = h2data['Children'] #5
t_babies = h2data['Babies'] #6
t_previouscancellations = h2data['PreviousCancellations'] #12
t_previousbookingsnotcanceled = h2data['PreviousBookingsNotCanceled'] #13
t_bookingchanges = h2data['BookingChanges'] #16
t_dayswaitinglist = h2data['DaysInWaitingList'] #20
t_adr = h2data['ADR'] #22
t_rcps = h2data['RequiredCarParkingSpaces'] #23
t_totalsqr = h2data['TotalOfSpecialRequests'] #24

In [45]:
t_arrivaldatemonth = h2data.ArrivalDateMonth.astype("category").cat.codes
t_arrivaldatemonthcat = pd.Series(t_arrivaldatemonth)
t_mealcat=h2data.Meal.astype("category").cat.codes
t_mealcat=pd.Series(t_mealcat)
t_countrycat=h2data.Country.astype("category").cat.codes
t_countrycat=pd.Series(t_countrycat)
t_marketsegmentcat=h2data.MarketSegment.astype("category").cat.codes
t_marketsegmentcat=pd.Series(t_marketsegmentcat)
t_distributionchannelcat=h2data.DistributionChannel.astype("category").cat.codes
t_distributionchannelcat=pd.Series(t_distributionchannelcat)
t_reservedroomtypecat=h2data.ReservedRoomType.astype("category").cat.codes
t_reservedroomtypecat=pd.Series(t_reservedroomtypecat)
t_assignedroomtypecat=h2data.AssignedRoomType.astype("category").cat.codes
t_assignedroomtypecat=pd.Series(t_assignedroomtypecat)
t_deposittypecat=h2data.DepositType.astype("category").cat.codes
t_deposittypecat=pd.Series(t_deposittypecat)
t_customertypecat=h2data.CustomerType.astype("category").cat.codes
t_customertypecat=pd.Series(t_customertypecat)
t_reservationstatuscat=h2data.ReservationStatus.astype("category").cat.codes
t_reservationstatuscat=pd.Series(t_reservationstatuscat)
t_isrepeatedguestcat = h2data.IsRepeatedGuest.astype("category").cat.codes
t_isrepeatedguestcat=pd.Series(t_isrepeatedguestcat)
t_agentcat = h2data.Agent.astype("category").cat.codes
t_agentcat=pd.Series(t_agentcat)
t_companycat = h2data.Company.astype("category").cat.codes
t_companycat=pd.Series(t_companycat)

In [46]:
a = np.column_stack((t_leadtime,t_countrycat,t_marketsegmentcat,t_deposittypecat,t_customertypecat,t_rcps,t_arrivaldateweekno))
a = sm.add_constant(a, prepend=True)
IsCanceled = h2data['IsCanceled']
b = IsCanceled
b=b.values

### SVM

In [47]:
prh2 = clf.predict(a)
prh2

array([0, 1, 1, ..., 0, 0, 0])

In [48]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(b,prh2))
print(classification_report(b,prh2))

[[34581 11647]
 [11247 21855]]
              precision    recall  f1-score   support

           0       0.75      0.75      0.75     46228
           1       0.65      0.66      0.66     33102

    accuracy                           0.71     79330
   macro avg       0.70      0.70      0.70     79330
weighted avg       0.71      0.71      0.71     79330



### KNN

In [49]:
prh3 = knn.predict(a)
prh3

array([0, 0, 0, ..., 1, 1, 0])

In [50]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(b,prh3))
print(classification_report(b,prh3))

[[35920 10308]
 [19594 13508]]
              precision    recall  f1-score   support

           0       0.65      0.78      0.71     46228
           1       0.57      0.41      0.47     33102

    accuracy                           0.62     79330
   macro avg       0.61      0.59      0.59     79330
weighted avg       0.61      0.62      0.61     79330



### XGBoost

In [51]:
prh4 = xgb_model.predict(a)
prh4

array([0, 1, 1, ..., 1, 1, 1])

In [52]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(b,prh4))
print(classification_report(b,prh4))

[[12650 33578]
 [ 1972 31130]]
              precision    recall  f1-score   support

           0       0.87      0.27      0.42     46228
           1       0.48      0.94      0.64     33102

    accuracy                           0.55     79330
   macro avg       0.67      0.61      0.53     79330
weighted avg       0.70      0.55      0.51     79330

