# Linear SVM: Predicting Customer ADR

In this example, a linear SVM is implemented using the sklearn library to predict customer ADR using the hotel cancellation datasets as provided by Antonio, Almeida and Nunes (2019). Attributions provided below.

#### Attributions

The below examples use the [scikit-learn](https://github.com/scikit-learn/scikit-learn) package which is provided by The scikit-learn developers (Copyright (c) 2007-2020), provided under the BSD 3-Clause License. Modifications have been made where appropriate for conducting analysis on the dataset specific to this example.

The copyright and permission notices are included below:
    
Copyright (c) 2007-2020 The scikit-learn developers.
All rights reserved.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL COPYRIGHT HOLDER BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The original datasets for hotel cancellations, as well as relevant research, is available here from the original authors.

* [Antonio, Almeida, Nunes, 2019. Hotel booking demand datasets](https://www.sciencedirect.com/science/article/pii/S2352340918315191)

The below work and findings are not endorsed by the original authors in any way.

### Import Libraries and Data

In [1]:
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from numpy.random import seed
seed(1)

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [2]:
train_df = pd.read_csv('H1full.csv')
a=train_df.head()
b=train_df
b
b.sort_values(['ArrivalDateYear','ArrivalDateWeekNumber'], ascending=True)

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,2017,August,35,31,2,8,2,1,...,No Deposit,143,,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,0,169,2017,August,35,30,2,9,2,0,...,No Deposit,250,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,0,204,2017,August,35,29,4,10,2,0,...,No Deposit,250,,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,0,211,2017,August,35,31,4,10,2,0,...,No Deposit,40,,0,Contract,112.80,0,1,Check-Out,2017-09-14


In [3]:
# Interval variables
leadtime = train_df['LeadTime']
arrivaldateyear = train_df['ArrivalDateYear']
arrivaldateweekno = train_df['ArrivalDateWeekNumber']
arrivaldatedayofmonth = train_df['ArrivalDateDayOfMonth']
staysweekendnights = train_df['StaysInWeekendNights']
staysweeknights = train_df['StaysInWeekNights']
adults = train_df['Adults']
children = train_df['Children']
babies = train_df['Babies']
isrepeatedguest = train_df['IsRepeatedGuest'] 
previouscancellations = train_df['PreviousCancellations']
previousbookingsnotcanceled = train_df['PreviousBookingsNotCanceled']
bookingchanges = train_df['BookingChanges']
agent = train_df['Agent']
company = train_df['Company']
dayswaitinglist = train_df['DaysInWaitingList']
adr = train_df['ADR']
rcps = train_df['RequiredCarParkingSpaces']
totalsqr = train_df['TotalOfSpecialRequests']

In [4]:
y1 = np.array(adr)

In [5]:
# Categorical variables
IsCanceled = train_df['IsCanceled']
arrivaldatemonth = train_df.ArrivalDateMonth.astype("category").cat.codes
arrivaldatemonthcat=pd.Series(arrivaldatemonth)
mealcat=train_df.Meal.astype("category").cat.codes
mealcat=pd.Series(mealcat)
countrycat=train_df.Country.astype("category").cat.codes
countrycat=pd.Series(countrycat)
marketsegmentcat=train_df.MarketSegment.astype("category").cat.codes
marketsegmentcat=pd.Series(marketsegmentcat)
distributionchannelcat=train_df.DistributionChannel.astype("category").cat.codes
distributionchannelcat=pd.Series(distributionchannelcat)
reservedroomtypecat=train_df.ReservedRoomType.astype("category").cat.codes
reservedroomtypecat=pd.Series(reservedroomtypecat)
assignedroomtypecat=train_df.AssignedRoomType.astype("category").cat.codes
assignedroomtypecat=pd.Series(assignedroomtypecat)
deposittypecat=train_df.DepositType.astype("category").cat.codes
deposittypecat=pd.Series(deposittypecat)
customertypecat=train_df.CustomerType.astype("category").cat.codes
customertypecat=pd.Series(customertypecat)
reservationstatuscat=train_df.ReservationStatus.astype("category").cat.codes
reservationstatuscat=pd.Series(reservationstatuscat)

In [6]:
x1 = np.column_stack((IsCanceled,countrycat,marketsegmentcat,deposittypecat,customertypecat,rcps,arrivaldateweekno))
x1 = sm.add_constant(x1, prepend=True)

In [7]:
from sklearn.svm import LinearSVR

In [8]:
X_train, X_val, y_train, y_val = train_test_split(x1, y1)

### LinearSVR

In [9]:
svm_reg_0 = LinearSVR(epsilon=0)
svm_reg_05 = LinearSVR(epsilon=0.5)
svm_reg_15 = LinearSVR(epsilon=1.5)

svm_reg_0.fit(X_train, y_train)
svm_reg_05.fit(X_train, y_train)
svm_reg_15.fit(X_train, y_train)



LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
          intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
          random_state=None, tol=0.0001, verbose=0)

In [10]:
predictions0 = svm_reg_0.predict(X_val)
predictions05 = svm_reg_05.predict(X_val)
predictions15 = svm_reg_15.predict(X_val)

In [11]:
predictions0

array([90.64325105, 97.89490467, 74.4480945 , ..., 87.35795296,
       50.77953399, 59.49167397])

In [12]:
predictions05

array([ 95.5812616 , 103.59180519,  77.48034877, ...,  91.46449917,
        53.20650446,  62.54728851])

In [13]:
predictions15

array([91.39436898, 99.19095222, 75.25668322, ..., 89.12092372,
       51.61460491, 60.70795814])

### Performance Against Validation Data

In [14]:
mean_absolute_error(y_val, predictions0)

43.15095947350899

In [15]:
mean_absolute_error(y_val, predictions05)

43.103803965305325

In [16]:
mean_absolute_error(y_val, predictions15)

43.11027109035095

In [17]:
mean_squared_error(y_val, predictions0)
math.sqrt(mean_squared_error(y_val, predictions0))

61.785171692909145

In [18]:
mean_squared_error(y_val, predictions05)
math.sqrt(mean_squared_error(y_val, predictions05))

60.55948187889743

In [19]:
mean_squared_error(y_val, predictions15)
math.sqrt(mean_squared_error(y_val, predictions15))

61.4056074324364

In [20]:
np.mean(y_val)

94.62558861707438

In [21]:
np.mean(predictions05)

80.36807609287723

### Performance Against Test Data

In [22]:
h2data = pd.read_csv('H2full.csv')
a=h2data.head()
a

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,6,2015,July,27,1,0,2,1,0.0,...,No Deposit,6,,0,Transient,0.0,0,0,Check-Out,2015-07-03
1,1,88,2015,July,27,1,0,4,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,1,Canceled,2015-07-01
2,1,65,2015,July,27,1,0,4,1,0.0,...,No Deposit,9,,0,Transient,68.0,0,1,Canceled,2015-04-30
3,1,92,2015,July,27,1,2,4,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,2,Canceled,2015-06-23
4,1,100,2015,July,27,2,0,2,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,1,Canceled,2015-04-02


In [23]:
t_leadtime = h2data['LeadTime'] #1
t_arrivaldateyear = h2data['ArrivalDateYear']
t_arrivaldateweekno = h2data['ArrivalDateWeekNumber']
t_arrivaldatedayofmonth = h2data['ArrivalDateDayOfMonth']
t_staysweekendnights = h2data['StaysInWeekendNights'] #2
t_staysweeknights = h2data['StaysInWeekNights'] #3
t_adults = h2data['Adults'] #4
t_children = h2data['Children'] #5
t_babies = h2data['Babies'] #6
t_isrepeatedguest = h2data['IsRepeatedGuest'] #11
t_previouscancellations = h2data['PreviousCancellations'] #12
t_previousbookingsnotcanceled = h2data['PreviousBookingsNotCanceled'] #13
t_bookingchanges = h2data['BookingChanges'] #16
t_agent = h2data['Agent'] #18
t_company = h2data['Company'] #19
t_dayswaitinglist = h2data['DaysInWaitingList'] #20
t_adr = h2data['ADR'] #22
t_rcps = h2data['RequiredCarParkingSpaces'] #23
t_totalsqr = h2data['TotalOfSpecialRequests'] #24

In [24]:
# Categorical variables
t_IsCanceled = h2data['IsCanceled']
t_arrivaldatemonth = h2data.ArrivalDateMonth.astype("category").cat.codes
t_arrivaldatemonthcat = pd.Series(t_arrivaldatemonth)
t_mealcat=h2data.Meal.astype("category").cat.codes
t_mealcat=pd.Series(t_mealcat)
t_countrycat=h2data.Country.astype("category").cat.codes
t_countrycat=pd.Series(t_countrycat)
t_marketsegmentcat=h2data.MarketSegment.astype("category").cat.codes
t_marketsegmentcat=pd.Series(t_marketsegmentcat)
t_distributionchannelcat=h2data.DistributionChannel.astype("category").cat.codes
t_distributionchannelcat=pd.Series(t_distributionchannelcat)
t_reservedroomtypecat=h2data.ReservedRoomType.astype("category").cat.codes
t_reservedroomtypecat=pd.Series(t_reservedroomtypecat)
t_assignedroomtypecat=h2data.AssignedRoomType.astype("category").cat.codes
t_assignedroomtypecat=pd.Series(t_assignedroomtypecat)
t_deposittypecat=h2data.DepositType.astype("category").cat.codes
t_deposittypecat=pd.Series(t_deposittypecat)
t_customertypecat=h2data.CustomerType.astype("category").cat.codes
t_customertypecat=pd.Series(t_customertypecat)
t_reservationstatuscat=h2data.ReservationStatus.astype("category").cat.codes
t_reservationstatuscat=pd.Series(t_reservationstatuscat)

In [25]:
atest = np.column_stack((t_IsCanceled,t_countrycat,t_marketsegmentcat,t_deposittypecat,t_customertypecat,t_rcps,t_arrivaldateweekno))
atest = sm.add_constant(atest, prepend=True)
btest = t_adr
btest=btest.values

In [26]:
bpred = svm_reg_05.predict(atest)
bpred

array([ 81.7431138 , 107.46098525, 107.46098525, ...,  94.50144931,
        94.202052  ,  94.50144931])

In [27]:
mean_absolute_error(btest, bpred)

30.332614341027753

In [28]:
print('mse (sklearn): ', mean_squared_error(btest,bpred))

mse (sklearn):  2097.3478156922206


In [29]:
math.sqrt(mean_squared_error(btest, bpred))

45.79681010389501

In [30]:
np.mean(btest)

105.30446539770578

In [31]:
np.mean(bpred)

88.63405620085129

In [32]:
bpred[1:100]

array([107.46098525, 107.46098525, 107.46098525, 107.46098525,
       107.46098525,  70.95457297, 107.46098525, 107.46098525,
       107.46098525,  70.95457297,  70.95457297,  70.95457297,
        70.95457297,  88.95748126,  70.95457297,  88.95748126,
        70.95457297,  70.95457297,  88.95748126,  88.95748126,
        99.74602209,  70.95457297,  70.95457297, 107.46098525,
        89.45807696, 107.46098525, 107.46098525, 107.46098525,
        70.95457297, 107.86151243, 107.86151243, 107.86151243,
       107.86151243, 107.86151243, 107.86151243, 107.86151243,
       107.86151243, 107.86151243, 107.86151243, 107.86151243,
       107.86151243, 107.86151243, 104.78793476,  82.14364098,
       104.78793476, 104.78793476,  69.28669661, 107.86151243,
       104.78793476,  81.64304527,  81.64304527, 107.86151243,
        89.85860414, 107.86151243, 107.86151243, 107.86151243,
        71.35510015, 107.86151243,  82.14364098, 107.86151243,
        89.85860414, 107.86151243, 107.86151243, 107.86

In [33]:
np.min(bpred)

31.95214932095261

In [34]:
np.max(bpred)

157.30035077181913

In [35]:
np.mean(bpred)

88.63405620085129

In [36]:
np.min(btest)

0.0

In [37]:
np.max(btest)

5400.0

In [38]:
np.mean(btest)

105.30446539770578