# Hotel Rate Regression Modeling

---

**Rate Regression for Forecasting**

In addition to predicting/forecasting reservation cancellations, hotels rely heavily on forecasting rates and occupancy.

Using the ADR (*Average Daily Rate*) feature from the dataset, I will train predictive models that could be used to forecast future ADR based on the given features.

The end result will be one (or more) models that would use specific features known *a priori* as the exogenous features to predict the ADR in dollars. After training the models, I will review their performances; determine which models are most appropriate (based on use cases and stakeholders); then provide three recommendations to help maximize ADR.

I will start with linear regression models to obtain the coefficients for ease-of-interpretation. Additionally, I will test a few tree-based models to compare performance; however, their results will require more effort to interpret for actionable insights.

---

In [1]:
## Used to upload 
%load_ext autoreload
%autoreload 2

In [2]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import os
import sys

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils, eda

In [3]:
## Data Handling
import pandas as pd
import numpy as np

import holidays

## Visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

## Custom-made Functions
# from src import eda

In [4]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

In [5]:
## SKLearn and Modeling Tools

from sklearn import metrics
from sklearn import set_config
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression, Ridge#, RANSACRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

set_config(transform_output='pandas')

In [54]:
backup_data_path = '../data/data_condensed_with_uuid.parquet'

data = pd.read_parquet(backup_data_path)

filepath = '../data/engineered_data_dates.parquet'

df_dates = pd.read_parquet(filepath)

data = data.merge(right = df_dates, how = 'left', on = 'UUID')

drop_feats = ['UUID','LeadTime', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateWeekNumber',
              'ArrivalDateDayOfMonth', 'StaysInWeekendNights', 'StaysInWeekNights',
              'ReservationStatusDate_x', 'ReservationStatusDate_y', 'ArrivalDate',
              'DepartureDate', 'BookingDate']

df_data = data.drop(columns = drop_feats)

In [55]:
df_data = pd.read_parquet('../data/data_prepped.parquet')

In [56]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 35 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   IsCanceled                       119390 non-null  int64  
 1   Adults                           119390 non-null  int64  
 2   Children                         119386 non-null  float64
 3   Babies                           119390 non-null  int64  
 4   Meal                             119390 non-null  object 
 5   Country                          118902 non-null  object 
 6   MarketSegment                    119390 non-null  object 
 7   DistributionChannel              119390 non-null  object 
 8   IsRepeatedGuest                  119390 non-null  int64  
 9   PreviousCancellations            119390 non-null  int64  
 10  PreviousBookingsNotCanceled      119390 non-null  int64  
 11  ReservedRoomType                 119390 non-null  object 
 12  As

# Train/Test Split

In [59]:
## Splitting data into features and target variables.
target = 'ADR'

X = df_data.drop(columns = [target]).copy()
y = df_data[target].copy()

In [61]:
## Splitting - stratify to maintain class balance b/t X_train/_test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = .75,
                                                    test_size = .25,
                                                    random_state = 42)

# Pipeline and Modeling

In [62]:
## Specifying numeric columns for preprocessing
num_cols = X_train.select_dtypes('number').columns.to_list()

## Specifying numeric columns for preprocessing
cat_cols = X_train.select_dtypes(include='object').columns.to_list()

In [63]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding

### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('cat_imputer', SimpleImputer(add_indicator=True,
                                                         strategy='most_frequent')),
                           ('ohe', OneHotEncoder(drop = 'if_binary',
                                                 handle_unknown='ignore',
                                                 sparse_output=False))])

num_pipe = Pipeline(steps=[('num_imputer', SimpleImputer(add_indicator=True,strategy='median')),
                          ('scaler',StandardScaler())])

### --- Instantiating the ColumnTransformer --- ###
preprocessor = ColumnTransformer(
    transformers=[('num', num_pipe, num_cols),
                  ('cat', cat_pipe, cat_cols)])

preprocessor

In [64]:
## Transform via the ColumnTransformer preprocessor and create new dataframe

preprocessor.fit(X_train)

X_train_df = preprocessor.transform(X_train)

X_test_tf_df = preprocessor.transform(X_test)

# display(X_train_df.head(5),X_test_tf_df.head(5))



In [65]:
hgbr = HistGradientBoostingRegressor(random_state=42)
hgbr.fit(X_train_df, y_train)

# ttr_hgbr = TransformedTargetRegressor(hgbr, func=np.log, inverse_func=np.exp)
# ttr_hgbr.fit(X_train_df, y_train)

train_scores = cross_val_score(hgbr, X_train_df, y_train, cv=5, n_jobs=-1)
test_scores = cross_val_score(hgbr, X_test_tf_df, y_test, cv=5, n_jobs=-1)

print(f'The training score is: {round(train_scores.mean(), 2)}.\nThe test score is: {round(test_scores.mean(), 2)}.')

The training score is: 0.7.
The test score is: 0.79.


In [66]:
y_pred = hgbr.predict(X_test_tf_df)

mape = metrics.mean_absolute_percentage_error(y_test, y_pred).round(2)

rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred)).round(2)

print(f'The MAPE score is: {mape}.\nThe RMSE is: {rmse}.')

The MAPE score is: 2892820705959083.0.
The RMSE is: 17.76.


In [None]:
# hgbc = TransformedTargetRegressor(regressor=HistGradientBoostingRegressor(random_state=42),
#                                 func=np.log, inverse_func=np.exp)
# hgbc.fit(X_train_df, y_train)

# print(f'The training score is: {hgbc.score(X = X_train_df, y = y_train):.4f}.',
#       f'\nThe test score is: {hgbc.score(X = X_test_tf_df, y = y_test):.4f}.')

In [None]:
perm_imp = permutation_importance(ttr_hgbr, X_test_tf_df, y_test, scoring = 'r2', n_jobs = -1, random_state = 42)

# perm_imp_results = pd.Series(data = perm_imp.importances_mean, index = hgbr.feature_names_in_).sort_values(ascending = False)
perm_imp_results = pd.Series(data = perm_imp.importances_mean).sort_values(ascending = False)
perm_imp_results.head()

In [None]:
rfr = RandomForestRegressor(n_jobs = -1, random_state = 42)

rfr.fit(X_train_df, y_train)

rfr.score(X = X_test_tf_df, y = y_test).round(4)

In [None]:
# lr = LinearRegression()

# lr.fit(X_train_df, y_train)

# lr.score(X = X_test_tf_df, y = y_test).round(4)

In [None]:
# rr = RANSACRegressor()

# rr.fit(X_train_df, y_train)

# rr.score(X = X_test_tf_df, y = y_test).round(4)