# Predicting Hotel Booking CancellationsIn this notebook, we will build a machine learning model to predict whether or not a customer cancelled a hotel booking.We will use a dataset on hotel bookings from the article ["Hotel booking demand datasets"](https://www.sciencedirect.com/science/article/pii/S2352340918315191), published in the Elsevier journal, [Data in Brief](https://www.sciencedirect.com/journal/data-in-brief). The abstract of the article states:> This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.For convenience, the two datasets have been combined into a single CSV file `hotel_bookings.csv`. Let us start by importing all the functions needed to import, visualize, and model the data.

In [ ]:
# Data importsimport pandas as pdimport numpy as np# Visualization importsimport matplotlib.pyplot as pltimport plotly.express as pxplt.rcParams['figure.figsize'] = [8, 4]# ML Importsfrom sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_scorefrom sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom sklearn.impute import SimpleImputerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score

## 0. Get the DataLet us get the data and explore it.

In [ ]:
# Load hotel_bookings.csv file as pandas dataframehotel_bookings = pd.read_csv('../data/raw/hotel_bookings.csv')# Review first few rows of datahotel_bookings.head()

Let us look at the number of bookings by month.

In [ ]:
# Group bookings by months and visualize as bar charts using plotly.expressbookings_by_month = hotel_bookings['arrival_date_month'].value_counts().sort_index()fig = px.bar(bookings_by_month, x=bookings_by_month.index, y=bookings_by_month.values, title='Bookings by Month')fig.show()

### <center>Data Dictionary</center>| Variable                        | Class       | Description                                                                                      ||:-------------------------------|:-----------|:------------------------------------------------------------------------------------------------|| adr                             | Numeric     | Average daily rate                                                                               || adults                          | Integer     | Number of adults                                                                                 || agent                           | Categorical | The id of the travel agency                                                                      || arrival_date_day_of_month      | Integer     | Day of the month of the arrival date                                                             || arrival_date_month             | Categorical | Month of arrival date with 12 categories: “January” to “December”                                || arrival_date_week_number       | Integer     | Week number of the arrival date                                                                  || arrival_date_year              | Integer     | Year of arrival date                                                                             || assigned_room_type             | Categorical | The code for type of room assigned                                                               || babies                          | Integer     | Number of babies                                                                                 || booking_changes                 | Integer     | The number of changes made to the booking                                                        || children                        | Integer     | Number of children                                                                               || company                         | Categorical | The id of the company making the booking                                                         || country                         | Categorical | The country of origin in ISO 3155-3:2013 format                                                  || customer_type                   | Categorical | The type of booking: Contract / Group / Transient / Transient-Party                              || days_in_waiting_list           | Integer     | The number of days the booking was in the waiting list                                           || deposit_type                    | Categorical | The type of deposit: No Deposit / Non Refund / Refundable                                        || distribution_channel            | Categorical | The booking distribution channel: TA / TO etc.                                                   || is_cancelled                    | Categorical | A boolean indicating if the booking was cancelled (1) or not (0)                                 || is_repeated_guest               | Categorical | A boolean indicating if it was a repeated guest (1) or not (0)                                   || lead_time                       | Integer     | The number of days between the booking date and arrival date                                     || market_segment                  | Categorical | A designation for the market segment: TA. TO                                                     || meal                            | Categorical | The type of meal booked: Bed & Breakfast (BB), Half Board (HB), and Full Board (FB)              || previous_bookings_not_cancelled | Integer     | The number of previous bookings not cancelled by the customer prior to the current booking       || previous_cancellations          | Integer     | The number of previous bookings that were cancelled by the customer prior to the current booking || required_car_parking_spaces     | Integer     | The number of car parking spaces required by the customer                                        || reservation_status              | Categorical | The last status of the reservation: Canceled / Check-Out / No-Show                               || reservation_status_date         | Date        | The date at which the last status was set.                                                       || reserved_room_type              | Categorical | The code of room type reserved.                                                                  || stays_in_weekend_nights        | Integer     | The number of weekend nights stayed or booked to stay                                            || stays_in_week_nights           | Integer     | The number of week nights stayed or booked to stay                                               || total_of_special_requests       | Integer     | The number of special requests made by the customer                                              |

Our objective is to build a model that predicts whether or not a user cancelled a hotel booking.## 1. Split the Data into Training and Test SetsLet us start by defining a split to divide the data into training and test sets. The basic idea is to train the model on a portion of the data and test its performance on the other portion that has not been seen by the model. This is done in order to prevent __overfitting__. We will use four-fold cross-validation with shuffling.

In [ ]:
X = hotel_bookings.drop(columns=['is_cancelled'])y = hotel_bookings['is_cancelled']split = KFold(n_splits=4, shuffle=True, random_state=42)

## 2. Choose a Class of Models and HyperparametersThe next step is to choose a class of models and specify hyperparameters.

In [ ]:
# Define a list of models to experiment withmodels = [    ('Random Forest', RandomForestClassifier()),    ('Decision Tree', DecisionTreeClassifier())]

## 3. Preprocess the DataThe next step is to set up a pipeline to preprocess the features. We need to impute all missing values with a constant, and one-hot encode all categorical features.

In [ ]:
# Preprocess numerical features:features_num = [    'lead_time', 'adr', 'adults', 'children', 'babies', 'previous_bookings_not_cancelled', 'previous_cancellations', 'required_car_parking_spaces', 'stays_in_weekend_nights', 'stays_in_week_nights', 'total_of_special_requests']# Preprocess categorical features:features_cat = [    'arrival_date_month', 'meal', 'customer_type', 'distribution_channel', 'deposit_type', 'is_repeated_guest', 'market_segment']# Create a preprocessing pipelinepreprocessor = ColumnTransformer(transformers=[    ('num', SimpleImputer(strategy='mean'), features_num),    ('cat', OneHotEncoder(handle_unknown='ignore'), features_cat)])

## 4. Fit the Models and Evaluate PerformanceFinally, we have to fit chosen models on the training data and use 4-fold cross-validation to evaluate their performance.

In [ ]:
for name, model in models:    # Compose data preprocessing and model into a single pipeline    steps = Pipeline(steps=[        ('preprocessor', preprocessor),        ('model', model)    ])        # Compute cross-validation accuracy for each model    cv_results = cross_val_score(steps, X, y, cv=split, scoring='accuracy')        # Outputs rounded to 4 decimal places:    min_score = np.min(cv_results).round(4)    max_score = np.max(cv_results).round(4)    mean_score = np.mean(cv_results).round(4)    std_dev = np.std(cv_results).round(4)    print(f"[{name}] Cross Validation Accuracy Score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

## 5. Interpret ResultsPlease provide here a result interpretation and final conclusion about model relevance.

In [ ]:
# This cell is intentionally left blank.