In [1]:
from IPython.display import Image

image_url = "https://i1.wp.com/maphappy.org/wp-content/uploads/2014/09/LgTEc8xSy2NndQoeCU9RUYgMq_meq7dlEbZAfMcadUg-1-e1438274302149.jpg?resize=800%2C444&ssl=1"
Image(url=image_url)

## Predicting Hotel Cancellations
Booking cancellations can significantly affect demand management strategies in the hospitality sector. The internet sees over 140 million bookings annually, with a significant proportion of hotel bookings being made through popular travel websites.

To overcome the problems caused by booking cancellations, hotels implement rigid cancellation policies, inventory management, and overbooking strategies, which can also have a negative influence on revenue and reputation.

Once the reservation has been canceled, there is almost nothing to be done and it creates discomfort for many Hotels and Hotel Technology companies. Therefore, predicting reservations which might get canceled and preventing these cancellations will create a surplus revenue for both Hotels and Hotel Technology companies.



## Motivation

Imagine if there was a way to predict which guests are likely to cancel their hotel bookings. Using Machine Learning with Python, this is possible. By predicting cancellations, hotels can generate additional revenue, improve forecasting accuracy, and reduce uncertainty in business management decisions.

For those who want to follow a structured approach while working on a machine learning project, this analysis provides a comprehensive guide. It covers the entire process of solving a real-world machine learning project, from understanding the business problem to deploying the model on the cloud.

# 1. Description of the project

- Understanding the Business Problem
- Data Collection and Understanding
- Data Exploration
- Data Preparation
- Modeling
- Model Deployment

## 1.1 Understanding Business Problem
The Goal of this project is to Predict the Guests who are likely to Cancel the Hotel Booking using Machine Learning with Python. Therefore, predicting reservations which might get canceled and preventing these cancellations will create a surplus revenue, better forecasts and reduce uncertainty in business management decisions.

## Data Collection and Understanding

The business has provided us with their bookings data in a file called `hotel_bookings.csv`, which contains the following:

| Column     | Description              |
|------------|--------------------------|
| `Booking_ID` | Unique identifier of the booking. |
| `no_of_adults` | The number of adults. |
| `no_of_children` | The number of children. |
| `no_of_weekend_nights` | Number of weekend nights (Saturday or Sunday). |
| `no_of_week_nights` | Number of week nights (Monday to Friday). |
| `type_of_meal_plan` | Type of meal plan included in the booking. |
| `required_car_parking_space` | Whether a car parking space is required. |
| `room_type_reserved` | The type of room reserved. |
| `lead_time` | Number of days before the arrival date the booking was made. |
| `arrival_year` | Year of arrival. |
| `arrival_month` | Month of arrival. |
| `arrival_date` | Date of the month for arrival. |
| `market_segment_type` | How the booking was made. |
| `repeated_guest` | Whether the guest has previously stayed at the hotel. |
| `no_of_previous_cancellations` | Number of previous cancellations. |
| `no_of_previous_bookings_not_canceled` | Number of previous bookings that were canceled. |
| `avg_price_per_room` | Average price per day of the booking. |
| `no_of_special_requests` | Count of special requests made as part of the booking. |
| `booking_status` | Whether the booking was cancelled or not. |

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

## 1.3. Data Exploration
In this step, we will apply Exploratory Data Analysis (EDA) to extract insights from the data set to know which features have contributed more in predicting Cancellations by performing Data Analysis using Pandas and Data visualization using Matplotlib & Seaborn. It is always a good practice to understand the data first and try to gather as many insights from it.

In [2]:
# import necessary libraries
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats
import statsmodels.api as sm
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score
from matplotlib import pyplot

pd.set_option('display.max_columns', 999)

In [3]:
# load data an print first few row
hotel_booking_ori = pd.read_csv('data/hotel_bookings.csv')
hotel_booking_ori.head()

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,,,,,,,,,,,,,,,,,,Not_Canceled
1,INN00002,2.0,0.0,2.0,3.0,Not Selected,0.0,Room_Type 1,5.0,2018.0,11.0,6.0,Online,0.0,0.0,0.0,106.68,1.0,Not_Canceled
2,INN00003,1.0,0.0,2.0,1.0,Meal Plan 1,0.0,Room_Type 1,1.0,2018.0,2.0,28.0,Online,0.0,0.0,0.0,60.0,0.0,Canceled
3,INN00004,2.0,0.0,0.0,2.0,Meal Plan 1,0.0,Room_Type 1,211.0,2018.0,5.0,20.0,Online,0.0,0.0,0.0,100.0,0.0,Canceled
4,INN00005,2.0,0.0,1.0,1.0,Not Selected,0.0,Room_Type 1,48.0,2018.0,4.0,11.0,Online,0.0,0.0,0.0,94.5,0.0,Canceled


In [4]:
# check the brief info on columns, rows and data types
hotel_booking_ori.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          35862 non-null  float64
 2   no_of_children                        35951 non-null  float64
 3   no_of_weekend_nights                  35908 non-null  float64
 4   no_of_week_nights                     35468 non-null  float64
 5   type_of_meal_plan                     35749 non-null  object 
 6   required_car_parking_space            33683 non-null  float64
 7   room_type_reserved                    35104 non-null  object 
 8   lead_time                             35803 non-null  float64
 9   arrival_year                          35897 non-null  float64
 10  arrival_month                         35771 non-null  float64
 11  arrival_date   

## 1.4 Data Cleaning

In [5]:
#make a copy of the original data
hotel_bookings = hotel_booking_ori.copy()


In [6]:
# check for duplicates
duplicate_rows = hotel_bookings[hotel_bookings.duplicated()]
print("number of duplicate rows: ", duplicate_rows.shape[0])

number of duplicate rows:  0


**Conclusion:** Based on the output of the count of missing values in each column, we can see that the dataset contains missing values in several columns. The highest percentage of missing values is in the `required_car_parking_space` column, with 7.15% of the values missing. Other columns with a high percentage of missing values include `market_segment_type` (4.17%), `room_type_reserved` (3.23%), `arrival_date` (2.70%), and `no_of_week_nights` (2.22%).

In [7]:
# check for missing values
print(pd.DataFrame({'Num of NULL in each column':hotel_bookings.isnull().sum(),
                   'NULL percentage':round(hotel_bookings.isnull().mean() * 100, 2)}))

                                      Num of NULL in each column  \
Booking_ID                                                     0   
no_of_adults                                                 413   
no_of_children                                               324   
no_of_weekend_nights                                         367   
no_of_week_nights                                            807   
type_of_meal_plan                                            526   
required_car_parking_space                                  2592   
room_type_reserved                                          1171   
lead_time                                                    472   
arrival_year                                                 378   
arrival_month                                                504   
arrival_date                                                 981   
market_segment_type                                         1512   
repeated_guest                                  

Next, we will need to clean the data by handling the missing values and addressing any inconsistencies in the data formatting.

For the columns with missing values, we can either drop the rows with missing values or impute the missing values with appropriate values. The choice of method depends on the nature and quantity of missing data and the analytical objectives. In this case, we will impute the missing values with appropriate values, based on the data and the analytical objectives.

- For the `type_of_meal_plan` column, we can replace the "null" values with NaN values to make them consistent with the other missing values.

- For the `room_type_reserved` column, we can impute the missing values with the mode (most frequent value) of the column, since the column is categorical.

- For the `required_car_parking_space` column, we can impute the missing values with 0, since the missing values likely indicate that the customers did not request a car parking space.

- For the `arrival_date` column, we can impute the missing values with the median value of the column, since the column is numerical.

- For the `no_of_week_nights` column, we can impute the missing values with 0, since the column is numerical and we assume that customer did not stay throughout the week.

- For the `no_of_weekend_nights` column, we can impute the missing values with 0, since the column is numerical and we assume that customer did not stay on the weekend.

In [8]:
# Replace 'null' values in the type_of_meal_plan columns with NaN 
hotel_bookings['type_of_meal_plan'].replace('null', np.nan, inplace=True)

# Impute missing values in the room_type_reserved column with the mode
room_type_mode = hotel_bookings['room_type_reserved'].mode().iloc[0]
hotel_bookings['room_type_reserved'].fillna(room_type_mode, inplace = True)

# Impute missing values in the required_car_parking_space column with 0
hotel_bookings['required_car_parking_space'].fillna(0, inplace=True)

# Impute missing values in the arrival_date column with the median
arrival_date_median = hotel_bookings['arrival_date'].median()
hotel_bookings['arrival_date'].fillna(arrival_date_median, inplace = True)

# Impute missing values in the no_of_weekend_nights column with 0
hotel_bookings['no_of_weekend_nights'].fillna(0, inplace = True)

# Impute missing values in the no_of_week_nights column with 0
hotel_bookings['no_of_week_nights'].fillna(0, inplace = True)

For the remaining columns with missing values, we can impute them based on the nature and quantity of the missing data and the analytical objectives.

- For example, we can impute the missing values in the `no_of_adults` column with the median, since this column is numerical and it is reasonable to assume that the median values would be representative of the missing values.

- For the `arrival_year` and `arrival_month` columns, we can impute the missing values with the mode, since these columns are categorical and the mode values would be representative of the missing values.

- For the `market_segment_type` column, we can impute the missing values with a new category called "Unknown", since we do not have enough information to infer the missing values.

- For the `repeated_guest`, `no_of_previous_cancellations`, `no_of_previous_bookings_not_canceled`, and `no_of_special_requests` columns, we can impute the missing values with 0, since the missing values likely indicate that the customers did not have any previous bookings, cancellations, or special requests.

- For example, if a customer did not provide any information on the number of children they were bringing to the hotel, it would be appropriate to assume that they did not bring any children and fill in 0 for that missing value.

In [10]:
# Impute missing values in the no_of_adults column with the median
no_of_adult_median = hotel_bookings['no_of_adults'].median()
hotel_bookings['no_of_adults'].fillna(no_of_adult_median, inplace = True)


# Impute missing values in the arrival_year column with the mode
arrival_year_mode = hotel_bookings['arrival_year'].mode().iloc[0]
hotel_bookings['arrival_year'].fillna(arrival_year_mode, inplace=True)

# Impute missing values in the arrival_month column with the mode
arrival_month_mode = hotel_bookings['arrival_month'].mode().iloc[0]
hotel_bookings['arrival_month'].fillna(arrival_month_mode, inplace=True)

# Impute missing values in the market_segment_type column with "Unknown"
hotel_bookings['market_segment_type'].fillna('unknown', inplace=True)

# Impute missing values in the repeated_guest, no_of_previous_cancellations,
# no_of_previous_bookings_not_canceled, and no_of_special_requests columns with 0
hotel_bookings['repeated_guest'].fillna(0, inplace=True)
hotel_bookings['no_of_previous_bookings_not_canceled'].fillna(0, inplace=True)
hotel_bookings['no_of_previous_cancellations'].fillna(0, inplace=True)
hotel_bookings['no_of_special_requests'].fillna(0, inplace=True)
hotel_bookings['no_of_children'].fillna(0, inplace=True)


- For the `avg_price_per_room` column, we can impute the missing values with the median, since this column is numerical and the median value would be representative of the missing values.

- For the `lead_time` column, we can impute the missing values with the median, since this column is numerical and the median value would be representative of the missing values.

- For the `type_of_meal_plan` column, we can impute the missing values with a new category called "Unknown", since we do not have enough information to infer the missing values.

In [12]:
# Impute missing values in the avg_price_per_room column with the median
avg_price_median = hotel_bookings['avg_price_per_room'].median()
hotel_bookings['avg_price_per_room'].fillna(avg_price_median, inplace = True)

# Impute missing values in the avg_price_per_room column with the median
lead_time_median = hotel_bookings['lead_time'].median()
hotel_bookings['lead_time'].fillna(lead_time_median, inplace = True)

# Impute missing values in the type_of_meal_plan column with "Unknown"
hotel_bookings['type_of_meal_plan'].fillna('unknown', inplace=True)

In [13]:
# Replace "Not_Canceled" with 0 and "Canceled" with 1 in the booking_status column
hotel_bookings['booking_status'].replace({'Not_Cancelled':0,
                                         'Cancelled':1}, inplace=True)

In [14]:
# verify that all null vlaue shave ben taking care of
print(pd.DataFrame({"# of NULL in each columns:": hotel_bookings.isnull().sum(), 
                    '%NaN': round(hotel_bookings.isnull().mean() * 100, 2)}))

                                      # of NULL in each columns:  %NaN
Booking_ID                                                     0   0.0
no_of_adults                                                   0   0.0
no_of_children                                                 0   0.0
no_of_weekend_nights                                           0   0.0
no_of_week_nights                                              0   0.0
type_of_meal_plan                                              0   0.0
required_car_parking_space                                     0   0.0
room_type_reserved                                             0   0.0
lead_time                                                      0   0.0
arrival_year                                                   0   0.0
arrival_month                                                  0   0.0
arrival_date                                                   0   0.0
market_segment_type                                            0   0.0
repeat