In [None]:
import pandas as pd
import numpy as np
import plotnine as pln
import warnings
import seaborn as sns
import matplotlib.pyplot as plt

from utils import merge_small_groups, replace_rare_values

Dataset downloaded from: https://www.kaggle.com/jessemostipak/hotel-booking-demand

Under the same address, the description of columns is also available.

In [None]:
bookings = pd.read_csv('../data/hotel_bookings.csv')
bookings.loc[:, 'y'] = (bookings['stays_in_week_nights'] + bookings['stays_in_weekend_nights']) >= 7
# we drop the columns that were used to create our dependent variable so that we do not include it accidentaly into training
bookings = bookings.drop(columns=['stays_in_week_nights', 'stays_in_weekend_nights'])

In [None]:
bookings.shape

In [None]:
# we may need months to be sorted or we may not - better safe than sorry
month_cat = pd.CategoricalDtype(['January', 'February', 'March', 'April', 'May', 'June', 'July', 
                                 'August', 'September', 'October', 'November', 'December'],
                                ordered=True)
bookings.loc[:, 'arrival_date_month'] = bookings.loc[:, 'arrival_date_month'].astype(month_cat)

## Some EDA, data munging and cleaning

In [None]:
# let's start with the plot of the independent variable (y)
(pln.ggplot(bookings)
 + pln.geom_histogram(pln.aes(x='factor(y)'), bins=30)
 + pln.labs(x='# nights >= 7', y='', title='At least 7 nights spent in a hotel?')
 + pln.theme_bw()
).draw();

In [None]:
round(bookings['y'].value_counts().max() / bookings['y'].value_counts().sum(), 3)

It seems that the distribution is imbalanced. The imbalance of (around) 1:9 is considerable, but (at least initially) doesn't require any special cross-validation schemes, so we proceed normally. Even more so, that for the modeling we will be using the LightGBM algorithm. With proper hyperparameter tuning we can alleviate the imbalance-derived issues. On top of that - we will be using ROC AUC metric for the training which is, if not imbalance-proof, at least robust against it.

#### Draw heatmap with respect to month and day of the month

Before we will encode categorical values, let us make use of the original ones to draw the heatmap that presents the percentage of long stays started on a given day of a given month. 

We clearly see that there are periods during the year that favor long stays - especially the holiday period (July, August) and the days right before Christmas. Contrarily, the end of November and the beginning of December are rather 'dead' spells. Additionally, we see that there has been at least one leap year in the period we are looking at (2016) :-) We may infer from the graph that the two features used to plot the graph - `arrival_date_day_of_month` and `arrival_date_month` will contribute significantly towards the predictions as y varies along these two variables.

In [None]:
heatmap_data = pd.crosstab(columns=bookings['arrival_date_day_of_month'], index=bookings['arrival_date_month'],
                           values=bookings['y'], aggfunc=lambda x: x.sum() / x.shape[0])
plt.figure(figsize=(25, 10))
sns.heatmap(heatmap_data, annot=True, fmt=".1%")
plt.title('Percentage of long stays (>=7 nights) with respect to month and day of month')
plt.xlabel('Day of month')
plt.ylabel('Month')
plt.show()

---

In [None]:
# reservation_status_date seems to be a technical column and we should not (probably) be using it in the training
# all we need is the status and not exactly when it was set
bookings = bookings.drop(columns=['reservation_status_date'])

In [None]:
# we need to sort out categorical columns differently - they cannot be used as-is in the modeling part
# well, technically they could as LightGBM supports them, but we could, theoritically, be using other algorithms
categorical_columns = bookings.select_dtypes(include=['object', 'category']).columns

In [None]:
# let's plot a barplot for every categorical variable to find out their distributions
for cat_col in categorical_columns:
    n_cats = bookings[cat_col].unique().shape[0]
    g = (pln.ggplot(pln.aes(x=cat_col), data=bookings)
     + pln.geom_bar()
     + pln.geom_text(pln.aes(label='stat(count)'), stat='count', nudge_y=0.125, va='bottom')
     + pln.labs(y='', x=cat_col.upper(), title=f'Distribution of the variable {cat_col.upper()}')
     + pln.theme_bw()
    )
    if n_cats > 5:
        g = g + pln.theme(axis_text_x=pln.element_text(angle=90))
    g.draw();

At this point, we can observe two problems we need to deal with:
- There is a number of small classes that will not guarantee generalization if included into training. We will set an arbitrary threshold to 200 - all the classes with the count below that value will be dropped from the dataset. There is only a number of such classes so the overall number of observations will not change significantly.
- We cannot, however, apply this procedure to the `COUTNRY` variable, as it consists, in greater part, of small classes, since there are numerous countries in the dataset. If we removed them, we would lose quite a significant number of observations (~3500). We will therefore merge all countries with less than 200 observations into one group called `OTH` (other) and treat them as one. We will eventually lose some signal that is contained in this variable (if any), but, on the other hand, the findings will be more generalizable and we will avoid the situation where we are trying to fit too tighhlt to small groups that are potentially less informative (or straightforwardly misleading).

#### Cleaning the categorical variables

In [None]:
# first, let's merge small groups into the OTH (other) group for the `country` column
bookings = merge_small_groups(bookings, 'country')

In [None]:
# plot for verification
(pln.ggplot(pln.aes(x='country'), data=bookings)
 + pln.geom_bar()
 + pln.labs(y='', x='Country', title=f'Distribution of the variable COUNTRY after merging small groups')
 + pln.theme_bw()
 + pln.theme(axis_text_x=pln.element_text(angle=90))
).draw();

Let's now drop the infrequent groups in all the categorical columns - we will replace these values with NANs and drop any row with at least one NAN value. But before we do so we should check whether NANs already exist in the dataset.

In [None]:
bookings.isna().sum()

There are two columns that have pretty significant number of NANs - `agent` and `company`. We will drop them from the dataset as imputing them is difficult to carry out, if not impossible, since the lack of agents' or companies' IDs may be an indication of more prevalent issue. For instance, the missing IDs may not be present in the reservation system at all. Aside from that - these are ID columns, that cannot be imputed properly due to their nature, unless we know specifically, that there is a pattern associated with them.

In [None]:
bookings = bookings.drop(columns=['agent', 'company'])

In [None]:
# now we replace 'rare' classes with NANs
bookings = replace_rare_values(bookings, categorical_columns)

In [None]:
# we check once again whether the replacing worked and it seems so
bookings.isna().sum()

In [None]:
# now we can remove all the rows with NANs
n_observations_before = bookings.shape[0]
bookings = bookings.dropna()
print(f'{n_observations_before - bookings.shape[0]} rows were removed.')
del n_observations_before

We remove only 856 rows, which with this volume of that is not a great issue.

#### Encode categorical / string variables

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
all_encoded_columns = []
for col in categorical_columns:
    ohe = OneHotEncoder(drop='first')  # we drop one of the classes to alleviate the burden of colinearity
    encoded_columns = pd.DataFrame(ohe.fit_transform(bookings[[col]].values).toarray(), columns=[f'{col}_{x}' for x in ohe.categories_[0][1:]])
    all_encoded_columns.append(encoded_columns)

all_encoded_columns = pd.concat(all_encoded_columns, axis=1)
# replace categorical columns with their encoded counterparts
bookings = bookings.merge(all_encoded_columns, left_index=True, right_index=True).drop(columns=categorical_columns)

#### Draw clustermap

In [None]:
# white color to distinguish the independent variable from the dependent variables
cols = pd.Series(np.where(bookings.columns=='y', '#FFFFFF', '#000000'), bookings.columns)
sns.clustermap(bookings.corr(), figsize=(24, 24), row_colors=cols, col_colors=cols)
plt.show()

From the clustermap we can conclude the following:
1. The overall structure of correlations in the graph is scarce - there aren't many highly-correlated groups of variables. There are some that stand out like `reserved_` and `assigned_room_type` or `market_segment` and `distribution_channel_Direct`. Which, given by their name, should not be surprising.
2. Strong, negative, correlations (close to -1) were introduced through the elimination of infrequent classes in one of the previous steps. We were left, in some cases, with three classes and one of them was dropped during encoding. We can imagine that in some cases the least frequent class was dropped in such a way and the two remaining classes are their almost perfect reciprocals, hence high negative correlation.
3. Dependent variable is relatively low correlated with other features - the one that seems to exhibit the highest correlation is `hotel_Resort Hotel`.

## Modeling

In [None]:
from lightgbm import LGBMClassifier
from skopt import BayesSearchCV
from skopt.space import Real, Integer
from sklearn.model_selection import StratifiedKFold, train_test_split

In [None]:
rskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=90)

In [None]:
lgbm_classifier = LGBMClassifier(n_estimators=250, n_jobs=3, random_state=90)
lgbm_classifier = BayesSearchCV(lgbm_classifier, 
                                {'learning_rate': Real(0.01, 0.35, 'uniform'),
                                 'reg_alpha': Real(1e-6, 1, 'log-uniform'), 
                                 'reg_lambda': Real(1e-6, 1, 'log-uniform'), 
                                 'subsample': Real(0.5, 1, 'uniform'), 
                                 'colsample_bytree': Real(0.5, 1, 'uniform'),
                                 'max_depth': Integer(10, 30)
                                },
                                scoring='roc_auc',
                                n_iter=30,
                                n_jobs=2,
                                n_points=1,
                                cv=rskf,
                                refit=True,
                                random_state=90)

In [None]:
# let's leave 2500 observations aside as the test set
train_X, test_X, train_y, test_y = train_test_split(bookings.drop(columns=['y']), bookings['y'], test_size=2500)

In [None]:
lgbm_classifier.fit(train_X, train_y)

In [None]:
print(f"validation score:    {np.round(lgbm_classifier.best_score_, 3)}")
print(f"test score:          {np.round(lgbm_classifier.score(test_X, test_y), 3)}")

Even though both the results on the validation and test sets are decent, to say the least, it would be beneficial to validate the model in a proper, n-times repeated, cross-validation (for instance, via `sklearn.model_selection.cross_validate`. We didn't include it here for computational reasons. It would also allow us to determine the dipersion of the classificator's performance with respect to the repeated and randomized selection of the train set.

In [None]:
# let's also have look at the importances from the model - it will give us intuition on the features that influenced the model the most
# the plotted metric is `splits` - the number of times a given feature was used as in building trees
lgbm_importances = pd.DataFrame({'importance': lgbm_classifier.best_estimator_.feature_importances_,
                                 'feature': train_X.columns}).sort_values('importance', ascending=False)

# in order to unclatter the graph we will plot top 10 most important (most used) features
(pln.ggplot(pln.aes(x='feature', y='importance'), data=lgbm_importances.head(10))
 + pln.geom_bar(stat='identity')
 + pln.geom_text(pln.aes(label='importance'), va='center', ha='left')
 + pln.labs(x='Feature', y='Importance (# splits)', title='Top-10 most important features used for classification')
 + pln.coord_flip()
 + pln.theme_bw()
 + pln.theme(figure_size=(12, 9))
).draw();

Interestingly enough, among the 10 most important features we observe `arrival_date_week_number` and `arrival_date_day_of_month` - both of these were directly (`arrival_date_day_of_month`) or indirectly (`arrival_date_week_number`) shown in the heatmap above, where we hypothesised that they may be of importance down-the-line. They are not, however, the most important features. These are: `adr` and `lead_time`.

#### Shapley explanations

Eventually, we would like to have a closer look at the features used by the model that contribute to its performance. Contrarily to the linear models, tree-based ones only denote the most important features but do not show the "direction" of their impact (whether high values are associated with class 0 or 1, for instance).

In [None]:
import shap
shap.initjs()

In [None]:
lgbm_explainer = shap.TreeExplainer(lgbm_classifier.best_estimator_)
shap_values = lgbm_explainer.shap_values(test_X)

In [None]:
# explanations for the tenth observation in the test set
# the graph shows contributions towards class 1 - a stay in a hotel of >= 7 days
# the higher the value, the more certainty is exhibited by the model that the objects comes from class 1
# this particular observation is more likely to come from class 0 (as its Shapley value of -4.96 is lower than "neutral" value of -3.432)
obs = 10
shap.force_plot(lgbm_explainer.expected_value[1], shap_values[1][obs], test_X.iloc[obs, :], matplotlib=False)

In [None]:
# let's stack explanations for a subsample of the test observations
# subsample of size 200 is taken for computational reasons
shap.force_plot(lgbm_explainer.expected_value[1], shap_values[1][:200], test_X.iloc[:200, :])

Setting the graph above to show on X axis `sample order by output value` can help us visualize the impact of variables on the output value across the output's domain (the lower the output, the more certain is the model that the object is predicted as class 0 - less than 7 nights). We see that to the left the most prevalent features are `hotel_Resort Hotel` and (relatively) long `lead_time` > 40 days, which would suggest, that longer stays are planned well in advance. Intuitively, it seems probable. More to the right, where shorter stays are predicted, we see that more prevalent are City Hotels (`hotel_Resort Hotel = 0`) and short `lead_time`s. Finally, across the whole domain of outputs, we see a considerable impact of the `adr` variable. On the kaggle site we read: _Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights_. We hypothesize that since the number of staying nights (strongly correlated with our dependent variable) was used as a compotent of `adr` calculations, we may want to remove this variable from the dataset as it may be a manifestation of an undesired data leak. We would need to consult that with a domain expert.

Finally, we may have a look at individual variables - in order to do so, we set a variable on the X axis and its effects on the Y axis. It will depict the impact of a given feature in the domain of the values of the said feature. By doing so, we clearly observe that higher values of `lead_time` definetely contribute towards prediction of shorter stays (class 0). The impact of `adr` is non-linear - it contributes towards class 1 in the middle of its domain and on the extremes it pushes predictions towards class 0. Eventually, we may also have a look at the variables we deemed earlier as important - `arrival_date_week_number` (which substantively replaces `arrival_date_month`, that was encoded into a series of binary columns) and `arrival_date_day_of_month`. In the first case long stays are associated either with low numbers, which means the beginning of the year (which we did not spot in  the heatmap) or with weeks 25-35 that cover the holidays (observed earlier). When it comes to the second variable - `arrival_date_day_of_month`, low values (beginnings of the months) contribute to longer stays, whereas higher values (months' ends) support more class 0 (shorter stays). Most likely though, the interplay between these two values matters more than any of them in isolation, but the graph above doesn't let us prove that easily - other techniques would have to be used.

## Summary

In the course of the research we've accomplished the following:
- Cleaning the data from observations and variables that do could potentially spoil the analysis (both on basis of merit and technically).
- Drawing a number of graphs to get the feeling of the data.
- Training of the LGBM classifier with very satisfying, at least for a layman, efficacy. It was performed while adhering to the good practices of machine learning - cross validation (although limited, for computational purposes) and hyperparameter tuning.
- Explanation of the predictions both at model's and individual predictions' level.