# **Exploratory Data Analysis of Hotel Booking Data**

This project performs a comprehensive exploratory data analysis (EDA) on a real-world hotel booking dataset. The goal is to uncover key insights into booking trends and customer behavior, with a specific focus on identifying the factors that influence booking cancellations.

The initial steps in this analysis will include:
- **Data Inspection**  
Checking for missing values, data types, and a statistical summary.
- **Data Cleaning**  
Handling missing data and preparing columns for analysis.
- **Feature Engineering**  
Creating new, useful variables from the existing data.

First, let's get a handle on what each of these columns means. I've used **Google Search** to find detailed descriptions of the variables. This is a crucial first step in any data analysis project, as it ensures you understand the context of your data before you start working with it.


---



| Column Name | Description | Data Type |
| :--- | :--- | :--- |
| `hotel` | Type of hotel: "Resort Hotel" or "City Hotel" | Categorical |
| `is_canceled` | Whether the booking was canceled (1) or not (0) | Binary |
| `lead_time` | Number of days between booking and arrival | Numerical (Integer) |
| `arrival_date_year` | Year of arrival date | Numerical (Integer) |
| `arrival_date_month` | Month of arrival date | Categorical |
| `arrival_date_week_number` | Week number of arrival date | Numerical (Integer) |
| `arrival_date_day_of_month` | Day of the month of arrival date | Numerical (Integer) |
| `stays_in_weekend_nights` | Number of weekend nights (Saturday/Sunday) | Numerical (Integer) |
| `stays_in_week_nights` | Number of weeknights (Monday-Friday) | Numerical (Integer) |
| `adults` | Number of adults | Numerical (Integer) |
| `children` | Number of children | Numerical (Integer) |
| `babies` | Number of babies | Numerical (Integer) |
| `meal` | Type of meal booked | Categorical |
| `country` | Country of origin of the guest (as a code) | Categorical |
| `market_segment` | Market segment designation, like "Online TA" or "Groups" | Categorical |
| `distribution_channel` | Booking distribution channel, like "Direct" or "Corporate" | Categorical |
| `is_repeated_guest` | Whether the guest is a repeated guest (1) or not (0) | Binary |
| `previous_cancellations` | Number of previous bookings canceled by the customer | Numerical (Integer) |
| `previous_bookings_not_canceled` | Number of previous bookings not canceled | Numerical (Integer) |
| `reserved_room_type` | Code of the room type reserved | Categorical |
| `assigned_room_type` | Code of the room type assigned at check-in | Categorical |
| `booking_changes` | Number of changes made to the booking | Numerical (Integer) |
| `deposit_type` | Type of deposit made: "No Deposit," "Non Refund," or "Refundable" | Categorical |
| `agent` | ID of the travel agency that made the booking | Categorical/ID |
| `company` | ID of the company that made the booking | Categorical/ID |
| `days_in_waiting_list` | Number of days on the waiting list before booking confirmed | Numerical (Integer) |
| `customer_type` | Type of booking, like "Transient" or "Group" | Categorical |
| `adr` | Average Daily Rate (total lodging divided by number of nights) | Numerical (Float) |
| `required_car_parking_spaces` | Number of car parking spaces requested | Numerical (Integer) |
| `total_of_special_requests` | Number of special requests made | Numerical (Integer) |
| `reservation_status` | Last status of the reservation | Categorical |
| `reservation_status_date` | Date of the last status | Date |

In [None]:
#
# Import Libraries
#
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import calendar

#
# Google Collab from upload Dataset into Drive
# We can also mount Google Drive and use it as a Local Drive
#
from google.colab import files

In [None]:
#
# Suppress scientific notation for clarity in output
#
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.float_format', '{:.2f}'.format)
np.set_printoptions(suppress = True, precision = 2)

In [None]:
#
# Upload Data File into Google Colab Worksheet
# /content/hotel_bookings.csv
#
uploadOutput = files.upload()

# **Load Data**

Loads the dataset from a CSV file and prepares it for analysis

In [None]:
#
# Read the File into a Data Frame and do initial checks
#
fileName = 'hotel_bookings.csv'

print("Loading dataset from '{}'...".format(fileName))
try:
  df = pd.read_csv(fileName)
  print("Dataset loaded successfully.")

  #
  # Print DataFrame Information
  #
  print("")
  print("DataFrame Information")
  print("=====================")
  print(df.info())

  #
  # Print DataFrame Statistical Details
  #
  print("")
  print("DataFrame Statistical Details")
  print("=============================")
  print(df.describe())

  #
  # Print First 5 Rows of the DataFrame
  #
  print("")
  print("First 5 rows of the DataFrame")
  print("=============================")
  print(df.head())
except FileNotFoundError:
  print(f"Error: The file '{fileName}' was not found.")

In [None]:
#
# Get the Columns from the Data Frame
#
df.columns

In [None]:
#
# Get Shape of the Data Frame
#
df.shape

In [None]:
#
# Get Missing Values
#
df.isna().sum().sort_values(ascending = False)

In [None]:
#
# Get Missing Values
#
# df.isnull().sum()[df.isnull().sum() > 0]
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending = False)

In [None]:
#
# Get Missing Values - In terms of Average
#
# df.isnull().mean()[df.isnull().mean() > 0]
df.isnull().mean()[df.isnull().mean() > 0].sort_values(ascending = False)

In [None]:
#
# Get Rows that has Missing Values
#
missingValueRows = df[df.isna().any(axis = 1)]
print(missingValueRows)

**Observation on Missing Values**

Although the column `company` has 112,593 missing values, this represents less than 1% of the total dataset. Therefore, we plan to address this by implementing a suitable imputation strategy.

In [None]:
#
# Check for duplicate Rows
#
duplicateRowCount = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicateRowCount}")

In [None]:
#
# Remove Duplicate Rows
#

#
# Take a copy of the original Data Frame before deleting duplicate rows
#
df_WithDuplicateRows = df.copy()
df.drop_duplicates(inplace=True)

In [None]:
#
# Check for logically invalid entries
#

#
# Count bookings with no guests
#
# print(df[['adults', 'children', 'babies']][(df['adults'] == 0) & (df['children'] == 0) & (df['babies'] == 0)])
noGuestsCount = df[(df['adults'] == 0) & (df['children'] == 0) & (df['babies'] == 0)].shape[0]
print(f"Number of bookings with no guests: {noGuestsCount}")

# Count bookings with a zero ADR
zeroADRCount = df[df['adr'] == 0].shape[0]
print(f"Number of bookings with a zero Average Daily Rate (ADR): {zeroADRCount}")

# **Data Cleansing and Feature Engineering**

This section handles missing values, removes invalid entries, and creates new, useful features for a more effective exploratory data analysis

### **Handle Missing Values**

In [None]:
#
# Handling Missing Values
#

#
# For 'company' and 'agent', we will replace NaN values with 0
# These columns are IDs, and 0 is a good placeholder for "no company/agent"
#
df[['company', 'agent']] = df[['company', 'agent']].fillna(0)
print("Columns 'company' and 'agent' missing values filled with 0")

#
# Comparing 0 vs. Median Imputation
# Imputing with 0
# This is a domain-specific or logical imputation
# It relies on your knowledge of the business context—that a lack of data for
# children means there are no children.
# This approach is simple, accurate in this context, and doesn't distort the
# data distribution
#
#
# Imputing with Median
# This is a purely statistical imputation.
# It assumes that the missing data is missing at random, and the best guess
# for the missing value is the most common or central value of the data we have
# While it's a safe statistical method, for a column like children where 0 is a
# meaningful category in itself, using the median can sometimes be less
# intuitive
#

#
# For 'children', we will fill the few missing values with the median
# The median is robust to outliers and is a sensible choice here
#
# df[['children']] = df[['children']].fillna(df[['children']].median())
# print("Column 'children' missing values filled with the median")
#

#
# For 'children', we will fill the few missing values with 0.
# This is a domain-specific, logical assumption that a blank value means
# no children were booked, which is a more accurate approach than a statistical
# imputation like using the median.
#
# df['children'].fillna(0, inplace = True)
# Using inplace = True raises the below Warning
#
# /tmp/ipython-input-948976368.py:45: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
# The behavior will change in pandas 3.0.
# This inplace method will never work because the intermediate object on which
# we are setting values always behaves as a copy.
#
# For example, when doing 'df[col].method(value, inplace=True)',
# try using 'df.method({col: value}, inplace=True)' or
# df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
#
#   df['children'].fillna(0, inplace = True)
#

# df[['children']] = df[['children']].fillna(0) # Can use this as well
df.fillna({'children': 0}, inplace = True)
print("Column 'children' missing values filled with 0.")

#
# For the 'country' column, we will replace the missing values with 'UNKNOWN'
# This is a more accurate approach than filling with the mode, which could be misleading
#
df[['country']] = df[['country']].fillna('UNKNOWN')
print("Column 'country' missing values filled with 'UNKNOWN'")

#
# Check again for any remaining Missing Values
#
print("\nFinal check for Missing Values")
# df.isnull().sum()[df.isnull().sum() > 0]
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending = False)

### **Handle Logical Inconsistencies**

In [None]:
#
# Handling Logical Inconsistencies
#

#
# We will remove rows where the total number of guests is zero (0 adults, 0 children, 0 babies).
# These are likely invalid entries and should not be included in the analysis.
#
initialShape = df.shape
# df.drop(df[df['adults'] == 0].index, inplace = True)
df = df.drop(df[df['adults'] == 0].index)
print(f"Removed rows with zero adults. Shape before: {initialShape}, Shape now: {df.shape}")

#
# We will also remove rows where the Average Daily Rate (ADR) is zero.
# This is a financial metric, and a value of 0 is not a valid data point.
#
initialShape = df.shape
# df.drop(df[df['adr'] == 0].index, inplace = True)
df = df.drop(df[df['adr'] == 0].index)
print(f"Removed rows with zero ADR. Shape before: {initialShape}, Shape now: {df.shape}")

### **Feature Engineering**

Mapping month names to numbers for the conversion
using Dictionary Comprehension



---



Convert month names to numbers



---



This creates a dictionary to map full month names to their numerical representation.  

The `enumerate` function provides an index `i` and value `name` for each item in the list.  

The `if i` condition cleverly skips the empty string at index 0, resulting in a dictionary like: {'January': 1, 'February': 2, ...}.  

The `.map()` method then applies this dictionary to the 'arrival_date_month' column, transforming the month names into their corresponding numbers.

In [None]:
#
# Feature Engineering
#

#
# Create a new column 'total_guests' by combining adults, children, and babies.
#
df['total_guests'] = df['adults'] + df['children'] + df['babies']
print("New feature 'total_guests' created.")

#
# Create a 'total_nights' column by combining weekend and weeknight stays.
# This gives a single, clear metric for the length of the stay.
#
df['total_nights'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']
print("New feature 'total_nights' created.")

#
# Create a new datetime column from the separate date fields.
# This is essential for any time-based analysis and visualization.
# Note: You may need to convert the 'arrival_date_month' to a number first.
#

#
# Mapping month names to numbers for the conversion
# Dictionary Comprehension
#
# Convert month names to numbers
#
# This creates a dictionary to map full month names to their numerical representation.
# The `enumerate` function provides an index `i` and value `name` for each item in the list.
# The `if i` condition cleverly skips the empty string at index 0,
# resulting in a dictionary like: {'January': 1, 'February': 2, ...}.
# The `.map()` method then applies this dictionary to the 'arrival_date_month' column,
# transforming the month names into their corresponding numbers.
#
monthToNumber = {name: i for i, name in enumerate(calendar.month_name) if i}
df['arrival_date_month'] = df['arrival_date_month'].map(monthToNumber)

df['arrival_date'] = pd.to_datetime(
                                     df['arrival_date_year'].astype(str) + '-' +
                                     df['arrival_date_month'].astype(str) + '-' +
                                     df['arrival_date_day_of_month'].astype(str)
                                   )

print("New feature 'arrival_date' created.")

In [None]:
#
# Display the first few rows with the new features to confirm changes.
#
print("")
print("DataFrame head after cleaning and feature engineering:")
# print(df[['total_guests', 'total_nights', 'arrival_date', 'adr', 'country']].head())
df.head()

# **Data Visualization**

Data visualization transforms complex data into visual formats like graphs and charts to make it easier to understand patterns, trends, and outliers, leading to faster, more informed decisions and improved communication of insights

## **Cancellation Rate by Hotel Type**

In [None]:
#
# Cancellation Rate by Hotel Type
#
# We'll use a bar plot to compare the cancellation rates between the two hotel types
# A bar plot is ideal for comparing a single metric across different categories
# We group the data by 'hotel' and then calculate the mean of the 'is_canceled' column
#
plt.figure(figsize = (8, 6))

#
# The 'sns.barplot' function is used to create the bar chart.
#
# 'x' axis: We pass 'df['hotel'].unique()' to get the distinct hotel types for the bars.
# 'y' axis: We calculate the mean cancellation rate for each hotel type.
# 'df.groupby('hotel')' groups the DataFrame by the 'hotel' column.
# '['is_canceled'].mean()' then computes the average of the 'is_canceled'
# column for each group. This average represents the cancellation rate.
#
sns.barplot(
             x = df['hotel'].unique(),
             y=df.groupby('hotel')['is_canceled'].mean()
           )

plt.title('Cancellation Rate by Hotel Type', fontsize = 16)
plt.xlabel('Hotel Type')
plt.ylabel('Cancellation Rate')
plt.show()

## **Observation**  

Based on the bar plot, it appears that **Resort Hotels have a higher cancellation rate than City Hotels**. While both types of hotels experience cancellations, the difference is visually quite clear.

This is an interesting finding. It suggests that the factors influencing a guest's decision to cancel might be different for a leisure-focused resort compared to a business or short-stay city hotel.

To dig deeper, we could try to answer some follow-up questions

* Do resort hotel guests book further in advance, giving them more time to change their plans?

* Are there specific months or seasons where resort cancellations are higher?

* Do certain market segments (e.g., "Groups" or "Direct") have a disproportionately high cancellation rate for resorts?

**Why use a KDE plot?**  

A **KDE (Kernel Density Estimate)** plot is a powerful tool for visualizing the distribution of a continuous variable. It's similar to a histogram but presents the data as a smooth curve, making it easier to see the overall shape of the distribution without the "binning" effect of a histogram. We use a KDE plot for this analysis for a few key reasons:

**Direct Comparison**  
By plotting both Resort and City Hotels on the same graph, we can directly compare their distributions.

**Highlighting Peaks**  
The smooth curve clearly shows where the majority of cancellations occur. This is especially useful for seeing subtle differences in peaks, such as the one we'll discuss below.

**Overall Shape**  
It provides a clear visual summary of the data, helping us understand the full range of lead times for cancellations.

## **Lead Time Distribution for Cancelled Bookings**

In [None]:
#
# Lead Time Distribution for Cancelled Bookings
#
# We'll use a KDE (Kernel Density Estimate) plot to visualize the distribution
# of 'lead_time' for cancelled bookings, separated by hotel type. This helps us
# see if guests who cancel resort bookings tend to book much further in advance.
#
plt.figure(figsize=(12, 6))
sns.kdeplot(
             data        = df[df['is_canceled'] == 1],
             x           = 'lead_time',
             hue         = 'hotel',
             fill        = True,
             common_norm = False,
           )
plt.title('Lead Time Distribution for Cancelled Bookings', fontsize = 16)
plt.xlabel('Lead Time (Days)')
plt.ylabel('Density')
plt.xlim(0, 400) # Limit the x-axis for better readability
plt.show()

## **Observation**
**Lead Time Distribution Analysis**  

The plot shows a **clear and distinct difference** in the lead time for cancellations between the two hotel types.

**City Hotels** have a high density of cancellations at a very short lead time, peaking sharply around **0-25 days** before arrival. This suggests that a significant number of city hotel cancellations are last-minute.

**Resort Hotels** have a much broader and flatter distribution of cancellations. While they still have a peak in the short-term, their density remains high for bookings made much further in advance, even out to **100-200 days** or more.

The takeaway here is that **resort guests do, in fact, book further in advance and cancel further in advance**. This supports our initial hypothesis and is a key factor contributing to their higher overall cancellation rate. Guests booking a long-term vacation at a resort have more time for their plans to change, and thus, more time to cancel their booking.

## **Monthly Cancellation Rate by Hotel Type**

In [None]:
#
# Monthly Cancellation Rate by Hotel Type
#
# A line plot is a good way to show how the cancellation rate changes over the year
# for each hotel type, allowing for a direct comparison of their seasonal patterns.
#
plt.figure(figsize=(12, 6))

#
# We group by both 'hotel' and 'arrival_date_month' and then take the mean of
# 'is_canceled' to get the cancellation rate for each month and hotel type
#
cancellationByMonth = df.groupby(['hotel', 'arrival_date_month'])['is_canceled'].mean().reset_index()
sns.lineplot(
              data   = cancellationByMonth,
              x      = 'arrival_date_month',
              y      = 'is_canceled',
              hue    = 'hotel',
              marker = 'o'
            )
plt.title('Monthly Cancellation Rate by Hotel Type', fontsize = 16)
plt.xlabel('Month')
plt.ylabel('Cancellation Rate')
plt.xticks(range(1, 13), calendar.month_abbr[1:]) # Use short month names for x-axis labels
# plt.xticks(range(1, 13), calendar.month_name[1:]) # Use full month names for x-axis labels
plt.grid(True)
plt.show()

## **Obsevation**

**Monthly Cancellation Rate Analysis**  

The line plot reveals a significant seasonal pattern for both hotel types, but with different magnitudes and peak periods.

**City Hotels** have a consistently higher cancellation rate throughout the year compared to Resort Hotels, with their rates fluctuating between approximately 26% and 34%. They experience their highest cancellation peaks during **April and August**, which are often popular travel months. Their lowest point is in **February**, a period of low travel.

**Resort Hotels** have a much more pronounced seasonal swing. Their cancellation rate is at its lowest in the cooler months, specifically **January and February**, but it rises sharply as the weather warms, reaching its highest point in **June, July, and August**. This aligns perfectly with the peak summer vacation season. Their cancellation rate is highest when demand for vacation spots is at its peak.

The key inference is that while City Hotels see a high cancellation rate throughout the year, Resort Hotels see a dramatic increase in cancellations during their peak season. This is likely due to the nature of resort vacations—people book months in advance for their summer getaways, but then their plans change or they find a better deal. The data clearly shows that the peak season for resort bookings is also the peak season for their cancellations.

## **Cancellation Rate by Market Segment for Each Hotel**

In [None]:
#
# Cancellation Rate by Market Segment for Each Hotel
#
# We use a bar plot to compare the cancellation rate across different market segments.
# The 'hue' parameter in seaborn separates the bars by 'hotel' type.
#
plt.figure(figsize=(14, 7))
sns.barplot(
             data = df.groupby(['market_segment', 'hotel'])['is_canceled'].mean().reset_index(),
             x    = 'market_segment',
             y    = 'is_canceled',
             hue  = 'hotel'
)
plt.title('Cancellation Rate by Market Segment for Each Hotel Type', fontsize = 16)
plt.xlabel('Market Segment')
plt.ylabel('Cancellation Rate')
plt.xticks(rotation = 45, ha = 'right')
plt.tight_layout()
plt.show()

## **Observation**  

**Interpreting the Bar Chart**  

The bar chart's key insight is the stark difference in cancellation behavior across various market segments and hotel types. The height of each bar represents the average cancellation rate for that specific combination.

**Online Travel Agents (Online TA) is the riskiest segment**  
Bookings coming from online travel agencies have a very high cancellation rate for both City and Resort hotels. This could be because customers are "shopping around" online, booking a room they might not be sure about, and then canceling later.

**The "Complimentary" segment has the lowest cancellation rate** on the entire chart. This makes a lot of sense, as these bookings are likely for a free stay (e.g., promotional bookings, staff stays, or rewards). Since there's no financial commitment, there's very little reason to cancel.

**In contrast, while "Corporate" bookings are very reliable**, they can still be canceled due to last-minute business changes, so their cancellation rate is slightly, but consistently, higher than "Complimentary" bookings.

**City Hotels generally have a higher cancellation rate**  
For nearly every market segment, the blue bars (City Hotel) are taller than the orange bars (Resort Hotel). This suggests that City Hotel bookings, perhaps due to their nature (business travel, short stays), are more likely to be canceled than resort vacations, which are often planned well in advance. The only exception is the Groups segment, where Resort Hotels show a slightly higher cancellation rate, likely due to the complexity of managing a large group's travel plans.

**The "Undefined" segment is a problem**  
The "Undefined" market segment has a cancellation rate of 100% for City Hotels and a very high rate for Resort Hotels. . This suggests a data cleaning or collection issue. This data should likely be investigated further or removed from the analysis, as it can skew the results and make it difficult to draw accurate conclusions about a hotel's performance.

## **Number of Bookings Per Month (Seasonality Analysis)**

In [None]:
#
# Number of Bookings Per Month (Seasonality Analysis)
#
# A line plot is perfect for showing trends over time, like monthly booking volume.
#
plt.figure(figsize=(10, 6))

#
# We use `sns.lineplot` to create the plot.
#
# 'x' axis: We get the month numbers (the indices of the value counts).
# 'df['arrival_date_month'].value_counts()' counts the occurrences of each month.
# '.sort_index()' ensures the months are in chronological order (1, 2, 3, etc.).
# '.index' gets the month numbers (e.g., [1, 2, 3, ...]).
#
# `y` axis: We get the booking counts (the values of the value counts).
# `.values` gets the corresponding counts for each month.
sns.lineplot(
              x = df['arrival_date_month'].value_counts().sort_index().index,
              y = df['arrival_date_month'].value_counts().sort_index().values,
            )
plt.title('Total Bookings per Month', fontsize = 16)
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.show()

## **Observation**

**The seasonal pattern in the "Total Bookings per Month"** chart strongly supports the inference about school holidays.

The graph shows a clear and predictable trend that's very typical for the travel industry. Bookings steadily increase from the start of the year, culminating in a significant peak around months 7 and 8, which correspond to July and August. This is the heart of the summer vacation season, a time when families and students are off school, making it the most popular time for leisure travel.

Following this peak, there is a steep decline in September (month 9) as schools and businesses resume normal schedules, and bookings remain at a lower level throughout the fall and winter months.

This pattern suggests that hotel bookings are heavily influenced by **leisure travelers** taking advantage of the summer break, as opposed to a steady flow of business travelers. This is a crucial insight for hotel management, as it allows them to predict demand and adjust pricing and staffing accordingly

## **Total Bookings per Month**

In [None]:
#
# Total Bookings per Month
# Creates a line plot to show the total number of bookings by month, separated by hotel type
#

#
# Calculate the total number of bookings per month by hotel type
#
# This groups the DataFrame by month and hotel type, then counts the number of rows in each group
# using .size(), and resets the index to turn it back into a DataFrame.
#
monthlyBookings = df.groupby(['arrival_date_month', 'hotel']).size().reset_index(name='total_bookings')

#
# Define the correct order of months to ensure the plot is chronological
# This creates a list of full month names in chronological order
#
monthNames = [calendar.month_name[i] for i in range(1, 13)] # We converted catgorical month into numerical above

#
# Map the numerical month to its name
#
# A new column 'arrival_date_month_name' is created by applying a lambda function
# that converts the numerical month to its corresponding name
#
monthlyBookings['arrival_date_month_name'] = monthlyBookings['arrival_date_month'].apply(lambda x: calendar.month_name[x])

#
# Convert the new month name column to a Categorical data type with a defined order.
# This ensures that when Seaborn plots the data, the months appear in the correct
# chronological sequence on the x-axis, rather than in alphabetical order
#
monthlyBookings['arrival_date_month_name'] = pd.Categorical(monthlyBookings['arrival_date_month_name'], categories = monthNames, ordered = True)

#
# Sort the DataFrame based on the ordered month names to prepare for plotting
#
monthlyBookings = monthlyBookings.sort_values('arrival_date_month_name')

#
# Create the line plot
#
plt.figure(figsize = (14, 8)) # Creates a new figure with a specified size

#
# sns.lineplot is a function from the Seaborn library for creating line plots
# data   : The DataFrame containing the data to be plotted
# x      : The column to be used for the x-axis (the ordered month names)
# y      : The column to be used for the y-axis (the total bookings)
# hue    : The column to use for color separation, creating a different line
#          for each hotel type
# marker : Adds a marker ('o' for circle) at each data point on the line
#
sns.lineplot(
              data   = monthlyBookings,
              x      = 'arrival_date_month_name',
              y      = 'total_bookings',
              hue    = 'hotel',
              marker = 'o'
            )

#
# plt.title is from Matplotlib and sets the title of the plot
#
plt.title('Total Bookings per Month by Hotel Type', fontsize = 18)

#
# plt.xlabel and plt.ylabel set the labels for the x and y axes
#
plt.xlabel('Month', fontsize = 14)
plt.ylabel('Total Bookings', fontsize =14)

#
# plt.grid adds a grid to the plot for easier reading of values
#
plt.grid(True, linestyle = '--', alpha = 0.6)

#
# plt.legend sets the title for the legend, which explains what the colors
# represent
#
plt.legend(title = 'Hotel Type')

#
# plt.xticks rotates the x-axis labels by 45 degrees to prevent them from
# overlapping
#
plt.xticks(rotation = 45)

#
# plt.tight_layout automatically adjusts the plot parameters to give a tight
# layout
#
plt.tight_layout()

#
# plt.show() displays the generated plot
#
plt.show()

## **Observation**

**City hotels have a consistently higher total booking** volume than resort hotels throughout the year.

While both types of hotels experience their peak booking seasons during the summer months (July and August), the number of bookings for city hotels remains significantly higher than the resort hotels in every single month of the year. This suggests a more stable, year-round demand for city accommodations compared to the more seasonal nature of resort stays

In [None]:
#
# Monthly Cancellation Rate - Alternative Method (see above)
# The Observation still remains the same above (Monthly Cancellation Rate by Hotel Type)
#
# Creates a line plot to show the monthly cancellation rate by hotel type
#

#
# Calculate monthly cancellation rate by hotel type
# This groups the DataFrame by month and hotel type, then calculates the mean of 'is_canceled' for each group.
# The result is stored in the monthly_cancellations DataFrame.
#
monthlyCancellations = df.groupby(['arrival_date_month', 'hotel'])['is_canceled'].mean().reset_index()

# Define the correct order of months to ensure the plot is chronological
# This creates a list of full month names in chronological order.
monthNames = [calendar.month_name[i] for i in range(1, 13)]

# Map the numerical month to its name.
# A new column 'arrival_date_month_name' is created by applying a lambda function
# that converts the numerical month (e.g., 1) to its corresponding name (e.g., 'January').
monthlyCancellations['arrival_date_month_name'] = monthlyCancellations['arrival_date_month'].apply(lambda x: calendar.month_name[x])

# Convert the new month name column to a Categorical data type with a defined order.
# This ensures that when Seaborn plots the data, the months appear in the correct
# chronological sequence on the x-axis, rather than in alphabetical order.
monthlyCancellations['arrival_date_month_name'] = pd.Categorical(monthlyCancellations['arrival_date_month_name'], categories = monthNames, ordered = True)

# Sort the DataFrame based on the ordered month names to prepare for plotting.
monthlyCancellations = monthlyCancellations.sort_values('arrival_date_month_name')

#
# Create the line plot
#
plt.figure(figsize = (14, 8)) # Creates a new figure with a specified size for better readability.

#
# sns.lineplot is a function from the Seaborn library for creating line plots
# data   : The DataFrame containing the data to be plotted
# x      : The column to be used for the x-axis (the ordered month names)
# y      : The column to be used for the y-axis (the cancellation rate)
# hue    : The column to use for color separation, creating a different line for each hotel type
# marker : Adds a marker ('o' for circle) at each data point on the line
#
sns.lineplot(data = monthlyCancellations, x = 'arrival_date_month_name', y = 'is_canceled', hue = 'hotel', marker = 'o')

# plt.title is from Matplotlib and sets the title of the plot.
plt.title('Monthly Cancellation Rate by Hotel Type', fontsize = 18)

# plt.xlabel and plt.ylabel set the labels for the x and y axes.
plt.xlabel('Month', fontsize = 14)
plt.ylabel('Cancellation Rate', fontsize = 14)

# plt.grid adds a grid to the plot for easier reading of values.
plt.grid(True, linestyle = '--', alpha = 0.6)

# plt.legend sets the title for the legend, which explains what the colors represent.
plt.legend(title = 'Hotel Type')

# plt.xticks rotates the x-axis labels by 45 degrees to prevent them from overlapping.
plt.xticks(rotation = 45)

# plt.tight_layout automatically adjusts the plot parameters to give a tight layout.
plt.tight_layout()

# plt.show() displays the generated plot.
plt.show()

## **Top 10 Countries by Guest Count**

In [None]:
#
# Top 10 Countries by Guest Count
#
# A bar plot is a great way to visualize the count of guests from the top
# countries
#

#
# We use 'value_counts()' to count the occurrences of each country and then
# '.head(10)' to select only the top 10 most frequent countries
#
top10Countries = df['country'].value_counts().head(10)

plt.figure(figsize=(12, 6))
#
# 'top_10_countries.index' provides the country names for the x-axis,
# and 'top_10_countries.values' provides the booking counts for the y-axis
#
sns.barplot(x = top10Countries.index, y = top10Countries.values)
plt.title('Top 10 Countries with Most Bookings', fontsize = 16)
plt.xlabel('Country')
plt.ylabel('Number of Bookings')
plt.show()

## **Observation**
The main inference is that Portugal (PRT) is by far the top country for hotel bookings, with a significantly higher number of bookings than any other country on the list.

Additionally, the chart shows that the vast majority of the top booking countries are from Europe

## **Correlation and Heat Map**

This section generates a correlation heatmap for the numerical features
of the hotel booking dataset. A correlation heatmap is a powerful
tool to visualize the linear relationships between variables,
which can help in feature selection and understanding the data.

In [None]:
#
# Select only numerical features for the correlation matrix
#
numericalFeatures = df.select_dtypes(include=[np.number])

#
# Calculate the correlation matrix
#
numericalFeatures = numericalFeatures.corr()

#
# Create a heatmap of the correlation matrix
#
plt.figure(figsize = (14, 12))
sns.heatmap(numericalFeatures, annot = True, fmt = ".2f", cmap = 'coolwarm', linewidths = .5)
plt.title('Correlation Heatmap of Hotel Booking Features', fontsize = 18)
plt.show()

## **A brief explanation of the heatmap**
## **Heatmap Insights**


---


The heatmap visualizes the correlation coefficient between each pair of numerical features.  

The values range from -1 to 1. A value close to 1 indicates a strong positive correlation, a value close to -1 indicates a strong negative correlation, and a value close to 0 indicates no linear correlation.  

Look for values close to 1 or -1 to identify the strongest relationships.

---

# **Summary and Next Steps**

This project aims to provide a comprehensive exploratory data analysis of a hotel booking dataset. The analysis so far has focused on data cleaning and the creation of fundamental features, such as **`total_nights`** and **`total_guests`**. The initial visualizations have helped us understand basic relationships and trends within the data.

## **Key Insights from Initial Analysis**
- The `is_canceled` variable is the central focus, allowing us to build a model to predict booking cancellations.
- **`total_nights`** and **`total_guests`** are crucial metrics for segmenting and understanding customer behavior.
- Box plots of `lead_time` and `adr` against cancellation status have provided preliminary insights into the distribution differences between canceled and non-canceled bookings.

---

## **Possible Additional Feature Engineering**

To build a more robust and predictive model, we can create additional features that capture more nuanced aspects of the booking behavior.

### **Temporal Features**
- **`booking_window`**: The time difference between the `reservation_status_date` and the `arrival_date`. This can provide insight into how far in advance a cancellation or check-in occurs.
- **`time_of_year`**: Categorize months into seasons (e.g., 'high_season', 'low_season') to see if cancellations are seasonal.
- **`day_of_week`**: Extract the day of the week from the arrival date to see if bookings on certain days are more prone to cancellation.

### **Behavioral Features**
- **`has_children`**: A simple binary flag (`True`/`False`) to indicate if the booking includes any children or babies. This is useful for analyzing family-specific trends.
- **`total_previous_bookings`**: The sum of `previous_cancellations` and `previous_bookings_not_canceled` to get a complete history of the customer's booking behavior.
- **`cancellation_ratio`**: A ratio of `previous_cancellations` to `total_previous_bookings`. This can be a strong predictor of a repeated guest's cancellation probability.
- **`is_group_booking`**: A flag for bookings with a high number of guests, which may have different cancellation patterns.

---

## **Additional Visualizations**

The current visualizations are a great start, but we can dive deeper into the data with more targeted plots.

### **Booking Status by Market Segment**
A stacked bar chart can reveal which market segments (`market_segment`) have the highest and lowest cancellation rates. This is a crucial insight for hotel revenue management.

### **Cancellation Rate by Country**
A world map or a bar chart showing the top 10 countries with the highest cancellation rates can highlight potential geographical trends.

### **Impact of Deposit Type**
A simple count plot or a pie chart can show the distribution of cancellations based on the `deposit_type` (`No Deposit`, `Non Refund`, `Refundable`). This is a very strong indicator, as non-refundable deposits are designed to prevent cancellations.

### **Special Requests and Cancellations**
A bar chart comparing the cancellation rates for bookings with and without special requests can help you understand if these requests are linked to more committed bookings.

### **Time-Series Analysis of ADR**
A line plot showing the `average_daily_rate` over time, segmented by `hotel` type, can reveal pricing strategies and seasonal price fluctuations.

---
This video on [Feature Engineering for Predicting Hotel Bookings](https://www.youtube.com/watch?v=7JybsRgzyYI) provides a great overview of how to decide which features to engineer and how to evaluate their impact on a machine learning model.