# **Data Information**

***Idea***
---

The online hotel reservation channels have dramatically changed booking possibilities and customers’ behavior. A significant number of hotel reservations are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue diminishing factor for hotels to deal with.

***Column Detail***
---

* *Features*
  * **Booking_ID** $:$ *unique identifier of each booking*
  * **no_of_adults** $:$ *Number of adults*
  * **no_of_children** $:$ *Number of Children*
  * **no_of_weekend_nights** $:$ *Number of **weekend nights (Saturday or Sunday)** the guest stayed or booked to stay at the hotel*
  * **no_of_week_nights** $:$ *Number of **week nights (Monday to Friday)** the guest stayed or booked to stay at the hotel*
  * **type_of_meal_plan** $:$ *Type of meal plan booked by the customer:*
    * *Mean Plan 1*
    * *Mean Plan 2*
    * *Mean Plan 3*
    * *Not Selected*
  * **required_car_parking_space** $:$ *Does the customer require a car parking space?*
    * *0 : No*
    * *1 : Yes*
  * **room_type_reserved** $:$ *Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.*
    * *Room Type 1*
    * *Room Type 2*
    * *Room Type 3*
    * *Room Type 4*
    * *Room Type 5*
    * *Room Type 6*
    * *Room Type 7*
  * **lead_time** $:$ *Number of days between the date of booking and the arrival date*
  * **arrival_year** $:$ *Year of arrival date*
  * **arrival_month** $:$ *Month of arrival date*
  * **arrival_date** $:$ *Date of the month*
  * **market_segment_type** $:$ *Market segment designation.*
  * **repeated_guest** $:$ *Is the customer a repeated guest?*
    * *0 : No*
    * *1 : Yes*
  * **no_of_previous_cancellations** $:$ *Number of previous bookings that were canceled by the customer prior to the current booking*
  * **no_of_previous_bookings_not_canceled** $:$ *Number of previous bookings not canceled by the customer prior to the current booking*
  * **avg_price_per_room** $:$ *Average price per day of the reservation; prices of the rooms are dynamic. (in euros)*
  * **no_of_special_requests** $:$ *Total number of special requests made by the customer (e.g. high floor, view from the room, etc)*


* *Target*
  * **booking_status** $:$ *Flag indicating if the booking was canceled or not.*
    * *Cancelled*
    * *Not_Cancelled*

# **Imports**

Below are all the **modules** used in the **notebook**📔.

In [None]:
# Data 
import numpy as np
import pandas as pd
from datetime import datetime

# Data Visualization
import plotly.express as px
import matplotlib.pyplot as plt

# Data Splitting
from sklearn.model_selection import train_test_split

# Machine Learning Model
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Deep Learning Model
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import ModelCheckpoint, EarlyStopping

# Optimizer
from tensorflow.keras.optimizers import Adam

# Metrics
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, confusion_matrix

# **Data Loading**

In order to **analyse the data**, we need to **load the data** first.

In [None]:
# Specify the data file path
path = '/kaggle/input/hotel-reservations-classification-dataset/Hotel Reservations.csv'

# Load the data 
data = pd.read_csv(path)
org_data = data.copy() # Will be used later

# Have a Quick look
data.head(10)

The data seems clear and ready to use, let's make a confirmation.

In [None]:
data.info()

The **good news** is **majority of the columns** are **numerical**. So we don't have to worry about the conversion.

In [None]:
data.isnull().sum()

Awesome, the data does **not contains** any **null values**.

In [None]:
data.describe()

Although it is **not clear** here, but **some columns** seems to be **statistically correlated**. We will have an in depth look in **data visualization**.

# **Data Processing**

Although we **do not require** any **high level processing**, but we still need to convert our **categorical values** into **numerical values**. You could have used **SKlearn** for doing so, but as it is a **small task**, let's do it by hand.

In [None]:
# Collect the original values.
org_target_vals = data.booking_status.unique()
org_target_vals

In [None]:
# Define a mapping which is responsible for converting the target values from categorical to numerical.
target_mapping = {name:index for index, name  in enumerate(org_target_vals)}

# Apply the mapping to the target column.
data.booking_status = data.booking_status.map(target_mapping)

# Let's check for our confirmation.
print("{:20} : {}".format("Target Mapping", target_mapping))
print("{:20} : {}".format("New Unique Values", data.booking_status.unique()))

We **do not require** the **Booking ID**, so we can **drop it**.

In [None]:
data.drop(columns=['Booking_ID'], inplace=True)

In [None]:
cat_cols = [col for col in data.columns if data[col].dtype == "object"]
cat_cols

Let's check the **unique values** present in each **column**.

In [None]:
# Collect the unique values
cat_uniques = [data[col].unique() for col in cat_cols]

# Create a mapping dictionary for these unique values
cat_unique_mappings = []
for cols, name in zip(cat_uniques, cat_cols):
    mapping = {}
    for index, value in enumerate(cols):
        mapping[value] = index
    cat_unique_mappings.append(mapping)
cat_unique_mappings

In [None]:
for col, mapping in zip(cat_cols, cat_unique_mappings):
    data[col] = data[col].map(mapping)

That's all we need to do for converting all the **categorical values to numerical values** in one go(i.e otherwise we would have selected each categorical column and then performed the same operation as we did for the booking ID again and again). Let's check it.

In [None]:
data.info()

As you can notice that all the **values are converted to numerical values**. Now we can move forward to **data visualization and model building**.

# **Exploratory Data Analysis - Feature Columns**

Let's try to gain some **insights** of the data by **visualising it**. First, let's understand the relation of each **feature column** with respect to the **target column**.

In [None]:
vals = org_data.booking_status.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

There is **one major problem** with the data. The **data is biased**. **67%** of the **data** is filled with **not canceled** the class. Which makes it **biased** towards the **not canceled class**.

In [None]:
fig = px.histogram(data.no_of_adults, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=data.no_of_adults, color=org_data.booking_status)
fig.show()

* Insights
  * We can say that **majority** of the **customers** are just **2 adults**.
  * These customers are **less likely** to **cancel** the the **booking**.
  * The customers with **0 and 4 adults** could be **outliner** to our data because they **do not belong** to the **normal data distribution**.

In [None]:
fig = px.histogram(data.no_of_children, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=data.no_of_children, color=org_data.booking_status)
fig.show()

* Insights
  * The customers are **less likely** to bring/come with **children**. 
  * This can also **imply that majority** of the customers are **not married**.
  * The values with **3, 9 and 10 are outliners**. Because again, they **do not belong to the normal distribution**.

In [None]:
fig = px.histogram(x=data.no_of_weekend_nights, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.box(data.no_of_weekend_nights, color=org_data.booking_status)
fig.show()

In [None]:
fig = px.violin(x=data.no_of_weekend_nights, color=org_data.booking_status)
fig.show()

* Insights
  * Majority of the customers did **not stay** for **weekend nights**.
  * But those who stay are **less likely to cancel the booking**.
  * Again, the data is filled with **outliners**. That is, data values which do not belong to the normal data distribution.

In [None]:
fig = px.histogram(data.no_of_week_nights, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=data.no_of_week_nights, color=org_data.booking_status)
fig.show()

* Insights
  * The **good news** is those people which tend to stay a **full week** are **less likely** to **cancel the booking**.
  * Among these customers, **majority** of the customers stay at least for **two days**.
  * Again, the data is filled with **outliners**.

In [None]:
vals = org_data.type_of_meal_plan.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

In [None]:
fig = px.histogram(org_data.type_of_meal_plan, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=org_data.type_of_meal_plan, color=org_data.booking_status)
fig.show()

There is **one unique insight**.

* Insights
  * Those who **did not cancel** their **bookings** are more likely to order **the meal plan 1**.

In [None]:
org_data.required_car_parking_space = org_data.required_car_parking_space.map({0:"No",1:"Yes"}) 
vals = org_data.required_car_parking_space.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

**Almost all customers did not require any parking space.**

In [None]:
vals = org_data.room_type_reserved.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

In [None]:
fig = px.histogram(org_data.room_type_reserved, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=org_data.room_type_reserved, color=org_data.booking_status)
fig.show()

* **Insights**
  * Almost **77%** of the **customers booked** the **Room Type 1**. Followed by **Room Type 4 and 6**.**

In [None]:
fig = px.histogram(org_data.lead_time, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=org_data.lead_time, color=org_data.booking_status)
fig.show()

* **Insights**
  * This shows that **earlier values did not cancel their bookings**. But after that, the **number of people** that **canceled the booking increased** as compared to the **number of people that did not cancel the meeting**.
  * From the **violin graph**, we can see that the **Not_Canceled bookings** are **concentrated** towards the **lower end** and the **Canceled bookings** are **concentrated** towards the **Upper End**.

In [None]:
vals = org_data.arrival_year.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

* **82%** of the **data belongs to 2018** and **18%** of the **data belongs to 2017**.

In [None]:
month_mapping = {
    1:'Jan',
    2:'Feb',
    3:'March',
    4:'April',
    5:'May',
    6:'June',
    7:'July',
    8:'August',
    9:'September',
    10:'October',
    11:'November',
    12:'December',
}
org_data.arrival_month = org_data.arrival_month.map(month_mapping)
vals = org_data.arrival_month.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

In [None]:
fig = px.histogram(org_data.arrival_month, color=org_data.booking_status, barmode='relative')
fig.show()

In [None]:
fig = px.violin(x=org_data.arrival_month, color=org_data.booking_status)
fig.show()

* **14%** of the **total bookings** are done in **October**. Followed by **September and December**.

In [None]:
vals = org_data.market_segment_type.value_counts()
names = vals.index
fig = px.pie(values=vals, names=names)
fig.show()

* **64%** of the bookings are **online bookings**.
* **29%** are **offline bookings**.

In [None]:
fig = px.histogram(org_data.market_segment_type, color=org_data.booking_status, barmode='group')
fig.show()

In [None]:
fig = px.violin(x=org_data.market_segment_type, color=org_data.booking_status)
fig.show()

* Although I expected that the **number of canceled booking** must be **higher in online bookings**, but it's actually **lower** than the **not canceled bookings**.

In [None]:
mapping = {0:"No",1:"Yes"}
org_data.repeated_guest = org_data.repeated_guest.map(mapping)

vals = org_data.repeated_guest.value_counts()
names = vals.index

fig = px.pie(values=vals, names=names)
fig.show()

* We can say that almost no one demanded for a **requested guest**.

In [None]:
fig = px.histogram(org_data.no_of_previous_cancellations, barmode='group')
fig.show()

* We can see that there is **no strong relation** among **previous cancelations** and **future cancelations**.

In [None]:
vals = org_data.no_of_special_requests.value_counts()
names = vals.index

fig = px.pie(values=vals, names=names)
fig.show()

In [None]:
fig = px.histogram(org_data.no_of_special_requests, color=org_data.booking_status, barmode='group')
fig.show()

* **54%** of customers **did not** made **any special request**.

# **Exploratory data analysis - Feature Co-relation**

In [None]:
corr = np.round(data.corr(), 2)

fig = px.imshow(corr, text_auto=True, height=1000, color_continuous_scale='earth')
fig.show()

It's **surprising** to know that there is **no strong, positive or negative correlation** among feature columns. But we should keep in mind that this correlation can only estimate the **linear possibilities**. So maybe there is **any quadratic or any other** **N-dimensional** relation among the data.

# **Model Functions**

In [None]:
def performance(y_pred, y_true, store:list, name:str):
    p_score = precision_score(y_true, y_pred)
    r_score = recall_score(y_true, y_pred)
    f1__score = f1_score(y_true, y_pred)
    store.append((name, [p_score, r_score, f1__score]))
    return p_score, r_score, f1__score, confusion_matrix(y_true, y_pred), classification_report(y_true, y_pred)

# **Data Splitting**

Before training our **machine learning or deep learning models**, we need to first **split the data** into **training and testing subparts**.

In [None]:
y = data.pop('booking_status')
x = data

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle=True)

# **ML Models**

In [None]:
model_performances = []

In [None]:
# Decision tree classifier 
dtc = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)
dtc.fit(X_train, y_train)

# Performance
dtc_pred = dtc.predict(X_test)
p_score, r_score, f1_score_, cm, report = performance(dtc_pred, y_test, store=model_performances, name="DecisionTree")

# show
print("{:18} : {}".format("Precision Score", p_score))
print("{:18} : {}".format("Recall Score", r_score))
print("{:18} : {}".format("F1 Score", f1_score_))
print("{:20} : {}".format("Confusion Matrix", f"\n{cm}"))
print("{:20} : {}".format("Classification Report", f"\n{report}"))

In [None]:
# Random Forest Classifier 
rfc = RandomForestClassifier(n_estimators=500, max_depth=None, min_samples_leaf=5, warm_start=True, bootstrap=True)
rfc.fit(X_train, y_train)

# Performance
pred = rfc.predict(X_test)
p_score, r_score, f1_score_, cm, report = performance(pred, y_test, store=model_performances, name="RandomForest")

# show
print("{:18} : {}".format("Precision Score", p_score))
print("{:18} : {}".format("Recall Score", r_score))
print("{:18} : {}".format("F1 Score", f1_score_))
print("{:20} : {}".format("Confusion Matrix", f"\n{cm}"))
print("{:20} : {}".format("Classification Report", f"\n{report}"))

In [None]:
# Decision tree classifier 
xgb = XGBClassifier(n_estimators=100, max_depth=20)
xgb.fit(X_train, y_train)

# Performance
pred = xgb.predict(X_test)
p_score, r_score, f1_score_, cm, report = performance(pred, y_test, store=model_performances, name="XGBoost")

# show
print("{:18} : {}".format("Precision Score", p_score))
print("{:18} : {}".format("Recall Score", r_score))
print("{:18} : {}".format("F1 Score", f1_score_))
print("{:20} : {}".format("Confusion Matrix", f"\n{cm}"))
print("{:20} : {}".format("Classification Report", f"\n{report}"))

# **Model Comparision**

In [None]:
plt.figure(figsize=(25,8))
plt.style.use('bmh')
names = [model_performances[0][0], model_performances[1][0], model_performances[2][0]]

# Precision Scores
plt.subplot(1,3,1)
p_scores = [model_performances[0][1][0], model_performances[1][1][0], model_performances[2][1][0]]
plt.bar(x=names, height=p_scores)
plt.title("Precision Score", fontsize=15)

# Recall Scores
plt.subplot(1,3,2)
r_scores = [model_performances[0][1][1], model_performances[1][1][1], model_performances[2][1][1]]
plt.bar(x=names, height=r_scores)
plt.title("Recall Score", fontsize=15)

# F1 Scores
plt.subplot(1,3,3)
f1__scores = [model_performances[0][1][2], model_performances[1][1][2], model_performances[2][1][2]]
plt.bar(x=names, height=f1__scores)
plt.title("F1 Score", fontsize=15)

# Show
plt.show()

Although the **performance can differ on multiple runs**, what I found is **XGBoost** is performing the **best overall**. The **precision** of **XGBoost** is **lower than that of random forest**, but it is **higher, both in terms of Recall and F1 Score**.

# **Dense Model**

In [None]:
model = Sequential([
    Dense(100, activation='relu', kernel_initializer='he_normal'),
    Dense(100, activation='relu', kernel_initializer='he_normal'),
    Dropout(0.2),
    Dense(1, activation='sigmoid', kernel_initializer='glorot_normal'),
])
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=1e-3), metrics=['accuracy'])

cbs = [ModelCheckpoint("DenseModel.h5", save_best_only=True), EarlyStopping(patience=3, restore_best_weights=True)]
model.fit(X_train, y_train, validation_split=0.1, callbacks=cbs, epochs=100)

In [None]:
# Performance
pred = model.predict(X_test)
pred = np.ravel(np.round(pred))
p_score, r_score, f1_score_, cm, report = performance(pred, y_test, store=model_performances, name="DenseModel")

# show
print("{:18} : {}".format("Precision Score", p_score))
print("{:18} : {}".format("Recall Score", r_score))
print("{:18} : {}".format("F1 Score", f1_score_))
print("{:20} : {}".format("Confusion Matrix", f"\n{cm}"))
print("{:20} : {}".format("Classification Report", f"\n{report}"))

This was expected because for such kind of datasets, **machine learning methods outperforms deep learning methods**. You can perform **hyperparameter search** and get a **better deep learning model**. But the **time that you invest** in it will **not worth the performance**. In the same time, **machine learning methods yield a much better result**.

**Thank You**

---
**DeepNets**