# Practical task - modelling: Hotel cancellations
# Introduction

This data set contains information on 119,390 hotel bookings between July 2015 and August 2017. Each observation represents a hotel booking.

The data for two hotels is given. Both hotels are located in Portugal: the Resort Hotel is in the region of the Algarve and the City Hotel is in the city of Lisbon. A variety of categorical and numeric features are provided, including whether the book was cancelled.

Hotel management would find it useful to be able to predict whether a booking is likely to be cancelled.

# Importing libraries and data

## Importing the libraries

In [None]:
# pandas for data analysis
import pandas as pd

# seaborn for visualisation
import seaborn as sns
sns.set_context("talk")

# seaborn has some unhelpful warnings at the moment
import warnings
warnings.filterwarnings("ignore", module="seaborn")

# Import functions from sklearn for building the model, training-testing split, visualising the model and metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Function to draw the model
def plot_decision_tree(tree_model):
    fig, ax = plt.subplots(figsize=(20,8))
    plot_tree(tree_model,
        filled=True,
        impurity=False,
        feature_names=input_features,
        class_names=["No","Yes"],
        proportion=True,
        ax=ax)
    plt.show()

## Importing the data

In [None]:
hotel_data = pd.read_csv('../input/hotel-bookings/bookings_2025.csv')
hotel_data

# Preparing the data and exploratory data analysis





## Phase 1 - Cleaning up dataset

In [None]:
# Checking to see whether dataset loaded correctly
# Analyse the data scraped from the dataset
hotel_data.info()
# This will show the structure of the dataset, including number of rows, columns, and each column’s data type.

In [None]:
# Check for (and remove) NULL values ("NaN" in output)

# Check for NULL values
hotel_data.isnull().sum()
# This part counts how many missing values there are in each column to identify where any potential cleaning is needed

In [None]:
# Printing which columns have NULL values
for column in hotel_data.columns:
    if hotel_data[column].isnull().any():
        print(column + "\n")

# Printing entire row for said columns to analyse what is null:
for column in hotel_data.columns:
    if hotel_data[column].isnull().any():
        print(hotel_data[hotel_data[column].isnull()][column])

In [None]:
# Since it's only children who have NaN/NULL values, it's generally better to consider them as 0 children
hotel_data["children"] = hotel_data["children"].fillna(0)
# The code here replaces missing values in 'children' with 0

In [None]:
# Re-check for missing values after cleaning to confirm the dataset is ready for analysis
hotel_data.isnull().sum()
if hotel_data.isnull().sum().sum() == 0:
    print("No missing values found") # Move onto next section
else:
    print("Missing values still exist (Name of value: {})".format(hotel_data.isnull().sum().idxmax()))

In [None]:
# Check for duplicate values

# Note to moderation team: skip this section, I wasn't thinking about how hotel data would probably have duplicated values
# No data in this section is changed, but it does print any instances of columns having duplicated values

# Check for duplicated values in rows
row_duplicates = hotel_data.duplicated().sum()
print(f"Number of duplicated rows: {row_duplicates}\n")

# Check for duplicated values within each column
print("Checking for duplicated values within each column:")
for column in hotel_data.columns:
    col_duplicates = hotel_data[column].duplicated().sum()
    if col_duplicates > 0:
        print(f"Column ", {column}, " has ", {col_duplicates}, " duplicated values.")

# Ok thinking about this, maybe if there's over 100k duplicated values for each column, it's probably intentional (also considering the context)
# Regardless, if in the future the duplicated values need to be removed for whatever reason, the code for doing so is "hotel_data.drop_duplicates(inplace=True)".

In [None]:
# Checking to see if dataset clean-up was successful

# Analyse the data scraped from the dataset again for further discrepancies
hotel_data.info()
# No discrepancies seen

## Phase 2 - Adding new features

In [None]:
# I'll be adding new features to the dataset through the use of pre-existing columns solely to make the process easier.

# Create stay_length as the total nights stayed (weekend + week nights)
hotel_data["stay_length"] = hotel_data["stays_in_weekend_nights"] + hotel_data["stays_in_week_nights"]

# Create total_guests as the total number of people in the booking
hotel_data["total_guests"] = hotel_data["adults"] + hotel_data["children"] + hotel_data["babies"]

# Create a simple indicator is_family (1 if more than one guest, else 0) to capture group/family bookings
hotel_data["is_family"] = (hotel_data["total_guests"] > 1).astype(int)

In [None]:
# Making sure new features got added by printing their columns
print(hotel_data.columns[-1])
print(hotel_data.columns[-2])
print(hotel_data.columns[-3])

## Phase 3 - Non-visual analysis

In [None]:
# Get summary statistics for a key numerical feature (I've chosen lead_time here) to understand its distribution (count, mean, quartiles)
hotel_data["lead_time"].describe().round(2)

In [None]:
# Now, we compare the mean values of selected features between non-cancelled (0) and cancelled (1) bookings
features = ["lead_time", "stay_length", "total_guests", "total_of_special_requests", "adr"]
hotel_data.groupby("is_canceled")[features].mean().round(2)

In [None]:
# This part now creates a two-way table of hotel type vs cancellation counts to see how cancellation varies by hotel
pd.crosstab(hotel_data["hotel"], hotel_data["is_canceled"])

In [None]:
# Lastly for this part, we create a two-way table normalised by rows (proportions) to compare cancellation rates within each hotel type
pd.crosstab(hotel_data["hotel"], hotel_data["is_canceled"], normalize="index").round(3)

## Phase 4 - Visual exploration of key relationships

In [None]:
# Box plot of lead_time by cancellation status to compare distributions (longer lead times may cancel more often)
sns.catplot(kind="box", x="is_canceled", y="lead_time", data=hotel_data, aspect=1)

In [None]:
# Box plot of adr (average daily rate) by cancellation status to see if price level relates to cancellations
sns.catplot(kind="box", x="is_canceled", y="adr", data=hotel_data, aspect=1)

### Histogram

In [None]:
# Stacked histogram of stay_length by cancellation to compare how stay length differs for cancelled vs not
sns.displot(data=hotel_data, x="stay_length", hue="is_canceled", multiple="stack", aspect=1)

### Bar Chart

In [None]:
# Proportion bar chart: show proportion of cancellations within each hotel type for a quick visual comparison
sns.displot(data=hotel_data, x="hotel", hue="is_canceled", stat="proportion", multiple="fill", aspect=1)

### Visualise key relationships

# Building a model

## Phase 1 - Defining input features and target

In [None]:
# Select the features (columns) we want to use as inputs for the model
input_features = ["lead_time", "stay_length", "total_guests", "total_of_special_requests", "adr"]
# These were chosen based on earlier EDA: lead_time, stay_length, total_guests, total_of_special_requests, adr

In [None]:
# X will contain the input features (independent variables)
X = hotel_data[input_features]

In [None]:
# y will contain the target variable (dependent variable) we want to predict: whether the booking was cancelled
y = hotel_data["is_canceled"]

## Phase 2 - Split into training and test sets

In [None]:
# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1)
# random_state ensures the split is reproducible (avoids biased results)

## Phase 3 - Train a decision tree model

In [None]:
# Create a Decision Tree Classifier with a maximum depth of 4
tree_model = DecisionTreeClassifier(max_depth=4, random_state=1)
# max_depth limits how deep the tree can grow, preventing it from becoming too complex

In [None]:
# Fit (train) the model using the training data
tree_model.fit(X_train, y_train)

In [None]:
# Visualise the trained decision tree using the helper function defined earlier
plot_decision_tree(tree_model)
# (Large image)

## Phase 4 - Making predictions & evaluating

In [None]:
# Use the trained model to predict cancellation outcomes on the test set
y_pred = tree_model.predict(X_test)

In [None]:
# Calculate and print accuracy
print("Accuracy:", round(100*accuracy_score(y_test, y_pred),1), "%")
# Code is finding proportion of correct predictions overall

In [None]:
# Calculate and print precision
print("Precision:", round(100*precision_score(y_test, y_pred, zero_division=0),1), "%")
# "Of the bookings predicted as cancelled, how many were actually cancelled?" is what this code is estimating.

In [None]:
# Calculate and print recall
print("Recall:", round(100*recall_score(y_test, y_pred),1), "%")
# The code is answering the questison "Of all the bookings that were actually cancelled, how many did the model correctly identify?"

In [None]:
# Create and display a confusion matrix (cross-tabulation of actual vs predicted outcomes)
print(pd.crosstab(y_test, y_pred, rownames=["Actual"], colnames=["Predicted"], margins=True))
# 0 and 1 are boolean values

## Phase 5 - Comparsion between models

In [None]:
# Train another Decision Tree with a different maximum depth (e.g. 6) to compare performance
tree_model2 = DecisionTreeClassifier(max_depth=6, random_state=1)
tree_model2.fit(X_train, y_train)
y_pred2 = tree_model2.predict(X_test)

In [None]:
# Print evaluation metrics for the second model
print("Accuracy (depth=6):", round(100*accuracy_score(y_test, y_pred2),1), "%")
print("Precision (depth=6):", round(100*precision_score(y_test, y_pred2, zero_division=0),1), "%")
print("Recall (depth=6):", round(100*recall_score(y_test, y_pred2),1), "%")

# Conclusion

In this project, the aim was to investigate hotel booking data and build a model to predict whether a booking would be cancelled.

**Data preparation and exploration:**
- The dataset was cleaned by checking for missing values.
- The dataset was also checked for duplicate values before realising that clearing them all would be silly.
- Three new features were created (`stay_length`, `total_guests`, and `is_family`), to calculate more meaningful patterns in the data.
- Exploratory analysis showed that cancelled bookings generally had longer lead times, higher average daily rates (ADR), and fewer special requests.

**Model building and evaluation:**
- A Decision Tree model was trained using features identified during the EDA.
- The model achieved an accuracy of around *72.3%*, with a precision of *66.7%* and recall of *49.3%*.
- Comparing trees of different depths showed that deeper trees improved recall but risked overfitting, while shallower trees were simpler but less accurate.
- The confusion matrix showed that the model was better at predicting non-cancellations than cancellations, which is important context for hotel managers.

**Conclusions in context:**
- The analysis suggests that lead time and special requests are strong indicators of cancellation risk.
- Hotels could use this insight to adjust policies, for example by requiring deposits for long-lead bookings or offering incentives for guests with fewer special requests to reduce cancellations.
- While the model provides useful predictions, it does not account for external factors such as economic conditions, travel restrictions, or customer behaviour outside the dataset.