# Hotel Rate Regression Modeling

---

**Rate Regression for Forecasting**

In addition to predicting/forecasting reservation cancellations, hotels rely heavily on forecasting rates and occupancy.

Using the ADR (*Average Daily Rate*) feature from the dataset, I will train predictive models that could be used to forecast future ADR based on the given features.

The end result will be one (or more) models that would use specific features known *a priori* as the exogenous features to predict the ADR in dollars. After training the models, I will review their performances; determine which models are most appropriate (based on use cases and stakeholders); then provide three recommendations to help maximize ADR.

I will start with linear regression models to obtain the coefficients for ease-of-interpretation. Additionally, I will test a few tree-based models to compare performance; however, their results will require more effort to interpret for actionable insights.

---

In [None]:
## Used to upload 
%load_ext autoreload
%autoreload 2

In [None]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import os
import sys

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils, eda

In [None]:
## Data Handling
import pandas as pd
import numpy as np

import holidays

## Visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

## Custom-made Functions
# from src import eda

In [None]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

In [None]:
## SKLearn and Modeling Tools

from sklearn import metrics
from sklearn import set_config
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression, Ridge#, RANSACRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

set_config(transform_output='pandas')

In [None]:
# data1 = pd.read_parquet('../data/Datasets_for_Arrival_Date/Feature_Engineering/H1_T_Date_Features.parquet')
# data1['HotelNumber'] = 1

# data2 = pd.read_parquet('../data/Datasets_for_Arrival_Date/Feature_Engineering/H2_T_Date_Features.parquet')
# data2['HotelNumber'] = 2

# df_data = pd.concat([data1, data2], axis = 0).reset_index(drop=True)
# df_data.head()

In [None]:
# backup_data_path = '../data/data_condensed_with_uuid.parquet'

# data = pd.read_parquet(backup_data_path)

# filepath = '../data/engineered_data_dates.parquet'

# df_dates = pd.read_parquet(filepath)

# data = data.merge(right = df_dates, how = 'left', on = 'UUID')

# drop_feats = ['UUID','LeadTime', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateWeekNumber',
#               'ArrivalDateDayOfMonth', 'StaysInWeekendNights', 'StaysInWeekNights',
#               'ReservationStatusDate_x', 'ReservationStatusDate_y', 'ArrivalDate',
#               'DepartureDate', 'BookingDate']

# data1 = data.drop(columns = drop_feats)

In [None]:
filepath = '../data/source/H2.parquet'

data1 = pd.read_parquet(filepath)

In [None]:
data1.info()

# Location, Location, Location!

---

**Additional Data About Locations and Types**

Two details that impact pricing significantly would be location - rural vs. urban - and the hotel rating (e.g., select service vs. premium vs. luxury). The original dataset does not include these specific details, but the authors specify such details in the source article. The authors of the academic paper from which I sourced this data specify the following details about the hotels:

> *Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon.*

Modeling each hotel separately is similar to a real-life revenue forecasting scenario - when predicting/setting rates, hotel managers consider each hotel separately. Two models will create slightly more work, but I expect this approach will improve the model performances.

Using this additional information, I can perform additional feature engineering based on their location and other historical data, such as weather, temperatures, etc.. Furthermore, I could use the locations as a new feature and concatenate the datasets for an overall interpretation.

---

## Regression: Hotel 1

---

I will start with the first of the two separate hotel datasets.

This hotel is located in Algarve, Portugal, and as the location is described as a *"resort region,"* I will assume it is a resort-style hotel.



---

In [None]:
# data1.head()

In [None]:
# ## Keeping zero-dollar ADR just in case it's useful
# data1 = data1[data1['ADR'] > 0]
# data1

In [None]:
# drop_cols = ['IsCanceled', 'ReservationStatus', 'ReservationStatusDate', 'HotelNumber']
# data1 = data1.drop(columns = drop_cols).reset_index(drop=True)
# data1

In [None]:
# data1['ADR'].describe()

# Drop Negative ADR Values

I need to perform a logarithmic transformation on the target feature to help normalize the feature. In order to do so, I need to ensure that all of my target values are greater than or equal to zero.

In [None]:
# data1['ADR'].describe()

In [None]:
# data1 = data1[data1['ADR'] > 0].reset_index(drop=True)
# data1

In [None]:
# data1['ADR'].describe()

# Outliers in Target Feature

In [None]:
# from scipy.stats import zscore

# z_scores = zscore(data1['ADR'])
# abs_z_scores = np.abs(z_scores)
# filtered_entries = (abs_z_scores < 3)  # Threshold of 3
# cleaned_data = data1[filtered_entries].reset_index(drop=True)
# cleaned_data

In [None]:
# data1[(abs_z_scores >= 3)]

# Train/Test Split

In [None]:
## Splitting data into features and target variables.
target = 'ADR'

X = data1.drop(columns = [target]).copy()
y = data1[target].copy()

In [None]:
## Splitting - stratify to maintain class balance b/t X_train/_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, 
                                                    random_state = 42)

# Inspecting Target Feature

In [None]:
# y_train.describe()

In [None]:
# y_train.plot(kind ='kde');

In [None]:
# y_train.apply(np.log10).plot(kind = 'kde');

# Pipeline and Modeling

In [None]:
## Specifying numeric columns for preprocessing
num_cols = X_train.select_dtypes('number').columns.to_list()

## Specifying numeric columns for preprocessing
cat_cols = X_train.select_dtypes(include='object').columns.to_list()

In [None]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding

### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('cat_imputer', SimpleImputer(add_indicator=True,
                                                         strategy='most_frequent')),
                           ('ohe', OneHotEncoder(drop = 'if_binary',
                                                 handle_unknown='ignore',
                                                 sparse_output=False))])

num_pipe = Pipeline(steps=[('num_imputer', SimpleImputer(add_indicator=True,strategy='median')),
                          ('scaler',StandardScaler())])

### --- Instantiating the ColumnTransformer --- ###
preprocessor = ColumnTransformer(
    transformers=[('num', num_pipe, num_cols),
                  ('cat', cat_pipe, cat_cols)])

preprocessor

In [None]:
## Transform via the ColumnTransformer preprocessor and create new dataframe

preprocessor.fit(X_train)

X_train_df = preprocessor.transform(X_train)

X_test_tf_df = preprocessor.transform(X_test)

# display(X_train_df.head(5),X_test_tf_df.head(5))

In [None]:
hgbr = HistGradientBoostingRegressor(l2_regularization = .01, random_state=42)
hgbr.fit(X_train_df, y_train)

# ttr_hgbr = TransformedTargetRegressor(hgbr, func=np.log, inverse_func=np.exp)
# ttr_hgbr.fit(X_train_df, y_train)

In [None]:
train_scores = cross_val_score(hgbr, X_train_df, y_train, cv=5, n_jobs=-1)
test_scores = cross_val_score(hgbr, X_test_tf_df, y_test, cv=5, n_jobs=-1)


print(f'The training score is: {round(train_scores.mean(), 2)}.\nThe test score is: {round(test_scores.mean(), 2)}.')

In [None]:
y_pred = hgbr.predict(X_test_tf_df)

mape = metrics.mean_absolute_percentage_error(y_test, y_pred).round(2)

rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred)).round(2)

print(f'The MAPE score is: {mape}.\nThe RMSE is: {rmse}.')

In [None]:
# hgbc = TransformedTargetRegressor(regressor=HistGradientBoostingRegressor(random_state=42),
#                                 func=np.log, inverse_func=np.exp)
# hgbc.fit(X_train_df, y_train)

# print(f'The training score is: {hgbc.score(X = X_train_df, y = y_train):.4f}.',
#       f'\nThe test score is: {hgbc.score(X = X_test_tf_df, y = y_test):.4f}.')

In [None]:
perm_imp = permutation_importance(ttr_hgbr, X_test_tf_df, y_test, scoring = 'r2', n_jobs = -1, random_state = 42)

# perm_imp_results = pd.Series(data = perm_imp.importances_mean, index = hgbr.feature_names_in_).sort_values(ascending = False)
perm_imp_results = pd.Series(data = perm_imp.importances_mean).sort_values(ascending = False)
perm_imp_results.head()

In [None]:
rfr = RandomForestRegressor(n_jobs = -1, random_state = 42)

rfr.fit(X_train_df, y_train)

rfr.score(X = X_test_tf_df, y = y_test).round(4)

In [None]:
# lr = LinearRegression()

# lr.fit(X_train_df, y_train)

# lr.score(X = X_test_tf_df, y = y_test).round(4)

In [None]:
# rr = RANSACRegressor()

# rr.fit(X_train_df, y_train)

# rr.score(X = X_test_tf_df, y = y_test).round(4)