### Linear Regression Analysis of ADR
Using the transformed dataset, I divided the data into three based on the year i.e 2015, 2016 or 2017 dataset then performed the regression analysis on each dataset.  

In [1]:
import pandas as pd
import numpy as np
import calendar
from datetime import datetime
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
hotel_data1 = pd.read_csv('hotel_data1.csv')
hotel_data1.head()

Unnamed: 0,hotel,arrival_date,overnight,adults,children,babies,meal,reserved_room,booking_changes,deposit_type,agent,company,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,length_of_stay
0,0,2015-07-01,0,2,0,0,1,1,3,0,0,0,0.0,0,0,Check-Out,2015-07-01,0
1,0,2015-07-01,0,2,0,0,1,1,4,0,0,0,0.0,0,0,Check-Out,2015-07-01,0
2,0,2015-07-01,1,1,0,0,1,0,0,0,0,0,75.0,0,0,Check-Out,2015-07-02,1
3,0,2015-07-01,1,1,0,0,1,1,0,0,1,0,75.0,0,0,Check-Out,2015-07-02,1
4,0,2015-07-01,1,2,0,0,1,1,0,0,1,0,98.0,0,1,Check-Out,2015-07-03,2


In [3]:
hotel_data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75166 entries, 0 to 75165
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   hotel                        75166 non-null  int64  
 1   arrival_date                 75166 non-null  object 
 2   overnight                    75166 non-null  int64  
 3   adults                       75166 non-null  int64  
 4   children                     75166 non-null  int64  
 5   babies                       75166 non-null  int64  
 6   meal                         75166 non-null  int64  
 7   reserved_room                75166 non-null  int64  
 8   booking_changes              75166 non-null  int64  
 9   deposit_type                 75166 non-null  int64  
 10  agent                        75166 non-null  int64  
 11  company                      75166 non-null  int64  
 12  adr                          75166 non-null  float64
 13  required_car_par

In [4]:
# convert columns to the right data type
hotel_data1['arrival_date'] = pd.to_datetime(hotel_data1['arrival_date'])
hotel_data1['reservation_status_date'] = pd.to_datetime(hotel_data1['reservation_status_date'])

In [5]:
# split the data into years
hotel_data1['arrival_date'] = hotel_data1['arrival_date'].dt.year
data_2015 = hotel_data1.query('arrival_date == 2015')
data_2016 = hotel_data1.query('arrival_date == 2016')
data_2017 = hotel_data1.query('arrival_date == 2017')

In [7]:
# linear correlation of quantitative variables
data_2015.corr()

Unnamed: 0,hotel,arrival_date,overnight,adults,children,babies,meal,reserved_room,booking_changes,deposit_type,agent,company,adr,required_car_parking_spaces,total_of_special_requests,length_of_stay
hotel,1.0,,0.04279,-0.137026,-0.064888,-0.035137,0.117045,0.051116,-0.049279,-0.010457,0.122277,-0.065809,-0.020905,-0.266454,-0.128454,-0.303576
arrival_date,,,,,,,,,,,,,,,,
overnight,0.04279,,1.0,0.018744,-0.00519,5.1e-05,-0.011093,0.084418,-0.011846,0.003239,0.031146,-0.013755,0.209064,0.025358,-0.01448,0.138171
adults,-0.137026,,0.018744,1.0,0.070431,0.033301,-0.031044,-0.000768,-0.0887,0.003908,0.15872,-0.262707,0.272796,0.08711,0.204644,0.154802
children,-0.064888,,-0.00519,0.070431,1.0,0.030258,0.018969,0.027292,0.065728,-0.00673,-0.025751,-0.044868,0.25163,0.093192,0.13313,0.02792
babies,-0.035137,,5.1e-05,0.033301,0.030258,1.0,0.010153,-0.013616,0.037564,-0.002964,-0.009869,-0.017843,0.033758,0.030769,0.13713,0.021012
meal,0.117045,,-0.011093,-0.031044,0.018969,0.010153,1.0,-0.030428,0.006878,0.003217,0.004576,0.030216,-0.055995,0.032073,0.070297,-0.024862
reserved_room,0.051116,,0.084418,-0.000768,0.027292,-0.013616,-0.030428,1.0,-0.05197,0.002534,0.051045,-0.053133,0.068868,-0.022749,0.016443,0.148449
booking_changes,-0.049279,,-0.011846,-0.0887,0.065728,0.037564,0.006878,-0.05197,1.0,-0.001115,-0.075057,0.126145,-0.010096,0.039807,-0.009057,0.142171
deposit_type,-0.010457,,0.003239,0.003908,-0.00673,-0.002964,0.003217,0.002534,-0.001115,1.0,-0.035444,0.062129,0.004414,-0.002127,-0.021035,-0.006431


`adr` seems to have correlation with `adults`, `children`, `company` and `special requests`

In [8]:
# define independent and dependent variables. consider independent variables as those with correlation greater or less than 0.15
y = data_2015['adr']
x1 = data_2015[['overnight', 'adults', 'children', 'total_of_special_requests', 'company']]
x = sm.add_constant(x1)

In [9]:
# model 1 - all quantitative data including binary data
results = sm.OLS(y,x).fit()
results.summary()

0,1,2,3
Dep. Variable:,adr,R-squared:,0.199
Model:,OLS,Adj. R-squared:,0.199
Method:,Least Squares,F-statistic:,688.2
Date:,"Tue, 07 Feb 2023",Prob (F-statistic):,0.0
Time:,16:34:06,Log-Likelihood:,-70809.0
No. Observations:,13854,AIC:,141600.0
Df Residuals:,13848,BIC:,141700.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-35.8430,3.495,-10.256,0.000,-42.693,-28.993
overnight,88.5504,3.265,27.124,0.000,82.151,94.950
adults,18.5105,0.746,24.817,0.000,17.049,19.973
children,28.7805,1.012,28.451,0.000,26.798,30.763
total_of_special_requests,6.2071,0.445,13.958,0.000,5.335,7.079
company,-20.7290,1.338,-15.497,0.000,-23.351,-18.107

0,1,2,3
Omnibus:,1915.411,Durbin-Watson:,0.767
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4133.805
Skew:,0.834,Prob(JB):,0.0
Kurtosis:,5.092,Cond. No.,32.8


The p-values for all the variables are less than 0.05 indicating that the coefficients are significantly different from 0 however, the R-squared value is quite low 

In [10]:
# linear correlation of quantitative variables
data_2016.corr()

Unnamed: 0,hotel,arrival_date,overnight,adults,children,babies,meal,reserved_room,booking_changes,deposit_type,agent,company,adr,required_car_parking_spaces,total_of_special_requests,length_of_stay
hotel,1.0,,0.036359,0.014487,0.010708,-0.033059,0.137524,0.140331,-0.052161,-0.076711,0.159478,-0.041852,0.210413,-0.224239,0.104617,-0.203083
arrival_date,,,,,,,,,,,,,,,,
overnight,0.036359,,1.0,0.047937,0.007756,0.002268,-0.006103,0.090382,0.008178,0.007723,0.030353,-0.014864,0.217622,0.030849,0.028713,0.137157
adults,0.014487,,0.047937,1.0,0.02766,0.028735,-0.016639,0.046985,-0.058083,-0.020042,0.246565,-0.31562,0.322929,0.033288,0.221732,0.151224
children,0.010708,,0.007756,0.02766,1.0,0.019674,0.015286,-0.008462,0.034471,-0.015567,0.051437,-0.065169,0.31521,0.082233,0.097235,0.010739
babies,-0.033059,,0.002268,0.028735,0.019674,1.0,-0.011308,-0.005681,0.086312,-0.006573,-0.009929,-0.014124,0.041228,0.03064,0.089007,0.011219
meal,0.137524,,-0.006103,-0.016639,0.015286,-0.011308,1.0,-0.026848,-0.024402,0.00786,0.023292,0.011297,0.016404,0.018936,0.06911,-0.04914
reserved_room,0.140331,,0.090382,0.046985,-0.008462,-0.005681,-0.026848,1.0,-0.063061,-0.001762,0.09626,-0.088399,0.153259,-0.025904,0.04259,0.113516
booking_changes,-0.052161,,0.008178,-0.058083,0.034471,0.086312,-0.024402,-0.063061,1.0,0.03677,-0.084839,0.076017,0.002696,0.039855,-0.024432,0.119907
deposit_type,-0.076711,,0.007723,-0.020042,-0.015567,-0.006573,0.00786,-0.001762,0.03677,1.0,-0.091853,0.121825,-0.04017,-0.002405,-0.060674,-0.007937


In [11]:
# model 2 
y = data_2016['adr']
x2 = data_2016[['hotel', 'overnight', 'adults', 'children', 'babies', 'reserved_room', 'agent',
                    'company','total_of_special_requests']]
x = sm.add_constant(x2)

In [12]:
results = sm.OLS(y,x).fit()
results.summary()

0,1,2,3
Dep. Variable:,adr,R-squared:,0.296
Model:,OLS,Adj. R-squared:,0.296
Method:,Least Squares,F-statistic:,1698.0
Date:,"Tue, 07 Feb 2023",Prob (F-statistic):,0.0
Time:,16:34:07,Log-Likelihood:,-184800.0
No. Observations:,36370,AIC:,369600.0
Df Residuals:,36360,BIC:,369700.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-53.4387,2.163,-24.702,0.000,-57.679,-49.198
hotel,16.8326,0.434,38.762,0.000,15.981,17.684
overnight,82.8740,1.984,41.770,0.000,78.985,86.763
adults,23.8458,0.425,56.152,0.000,23.013,24.678
children,35.0294,0.528,66.390,0.000,33.995,36.064
babies,10.5314,1.817,5.795,0.000,6.970,14.093
reserved_room,11.0403,0.521,21.185,0.000,10.019,12.062
agent,-1.6754,0.717,-2.338,0.019,-3.080,-0.271
company,-10.0728,1.024,-9.835,0.000,-12.080,-8.065

0,1,2,3
Omnibus:,8913.732,Durbin-Watson:,0.943
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30706.493
Skew:,1.22,Prob(JB):,0.0
Kurtosis:,6.782,Cond. No.,39.1


In 2016 data, the R-squared value is higher than that of 2015 although this may be caused by the higher number of independent varibales that are correlated with `adr`.

In [13]:
# linear correlation of quantitative variables
data_2017.corr()

Unnamed: 0,hotel,arrival_date,overnight,adults,children,babies,meal,reserved_room,booking_changes,deposit_type,agent,company,adr,required_car_parking_spaces,total_of_special_requests,length_of_stay
hotel,1.0,,0.01182,0.032271,-0.012983,-0.042546,0.15104,0.167948,-0.052114,-0.020631,0.1545,-0.072233,0.136416,-0.24954,0.027329,-0.22219
arrival_date,,,,,,,,,,,,,,,,
overnight,0.01182,,1.0,0.071101,0.012701,0.006945,0.025633,0.074197,-0.011849,0.001391,0.049664,-0.002503,0.153414,0.009164,0.03531,0.097563
adults,0.032271,,0.071101,1.0,0.048903,0.02842,0.063298,0.1153,-0.080757,0.005692,0.282459,-0.304863,0.376207,0.022105,0.173223,0.146924
children,-0.012983,,0.012701,0.048903,1.0,0.033077,0.014482,0.029508,0.055639,0.004771,0.055318,-0.065874,0.349891,0.05818,0.066058,0.021483
babies,-0.042546,,0.006945,0.02842,0.033077,1.0,-0.007172,-0.016343,0.101824,-0.001802,0.010972,-0.02677,0.032768,0.044456,0.084183,0.027725
meal,0.15104,,0.025633,0.063298,0.014482,-0.007172,1.0,0.078038,-0.031469,0.00218,0.096354,-0.019097,0.047907,0.008442,0.068909,-0.043955
reserved_room,0.167948,,0.074197,0.1153,0.029508,-0.016343,0.078038,1.0,-0.059905,-0.005372,0.144692,-0.127164,0.181377,-0.058434,0.044752,0.11025
booking_changes,-0.052114,,-0.011849,-0.080757,0.055639,0.101824,-0.031469,-0.059905,1.0,-0.005068,-0.076488,0.060255,0.024699,0.049137,0.015763,0.098429
deposit_type,-0.020631,,0.001391,0.005692,0.004771,-0.001802,0.00218,-0.005372,-0.005068,1.0,-0.044565,0.059245,-0.016404,0.001375,-0.012929,-0.018053


In [18]:
# model 3
y = data_2017['adr']
x2 = data_2017[['overnight', 'adults', 'children', 'reserved_room', 'agent', 'company','total_of_special_requests']]
x = sm.add_constant(x2)

In [19]:
results = sm.OLS(y,x).fit()
results.summary()

0,1,2,3
Dep. Variable:,adr,R-squared:,0.283
Model:,OLS,Adj. R-squared:,0.283
Method:,Least Squares,F-statistic:,2461.0
Date:,"Tue, 07 Feb 2023",Prob (F-statistic):,0.0
Time:,16:49:48,Log-Likelihood:,-130330.0
No. Observations:,24942,AIC:,260700.0
Df Residuals:,24937,BIC:,260700.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-50.6827,3.972,-12.761,0.000,-58.467,-42.898
overnight,88.5079,3.922,22.564,0.000,80.820,96.196
adults,33.9999,0.563,60.340,0.000,32.896,35.104
children,41.2310,0.685,60.199,0.000,39.888,42.573
total_of_special_requests,7.6872,0.326,23.585,0.000,7.048,8.326

0,1,2,3
Omnibus:,2457.412,Durbin-Watson:,0.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5029.888
Skew:,0.641,Prob(JB):,0.0
Kurtosis:,4.788,Cond. No.,49.1


In 2017 data, the R-squared value is higher indicating that the variation in `adr` is better explained compared to the other years. The adjusted R-squared is lower than the R-squared which means that there are non-significant variables in the model.