<a href="https://colab.research.google.com/github/Gaurav-Negi142/My-Projects-/blob/main/Capstone_project_module_2_Hotel_booking_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **PROJECT NAME -  Booking.com(Hotel Booking Analysis)**



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**


  The Hotel Booking Analysis project, based on data from BOOKING.COM, offers valuable insights into the key factors influencing hotel bookings and customer behavior. By leveraging these insights, hotel managers can make data-driven decisions to optimize operations, enhance the guest experience, and increase overall profitability. The analysis highlights the significance of understanding booking trends, customer segmentation, and cancellation patterns in today’s highly competitive hospitality industry.

# **GitHub Link -**

https://github.com/Gaurav-Negi142/My-Projects-/blob/07f0abd0c4e13d47a8f6498de3fb45483453b335/Capstone_project_module_2_Hotel_booking_analysis.ipynb

# Key Features of the Dataset :


Hotel Type: Specifies the category of the hotel, such as City Hotel or Resort Hotel.

Booking Time: Indicates when the booking was made relative to the check-in date.

Check-in/Check-out Dates: Records the guest's arrival and departure dates.

Length of Stay: Captures the duration of the stay, measured in nights.

Booking Channels: Identifies the source of the booking, such as Online Travel Agencies, direct bookings, or travel agents.

Customer Demographics: Provides customer information including country of origin, age, and gender.

Booking Status: Reflects whether a booking was confirmed or canceled.

Special Requests: Notes any specific requests made by the guests during the booking process.


# **Problem Statement**


Key Questions to Address:
Booking Trends:
What are the peak booking periods (months or seasons)?
Which types of rooms are most frequently booked?

Customer Segmentation:
What is the demographic profile of the customers (e.g., age, country)?
How does booking behavior vary across different customer segments?

Cancellation Analysis:
What is the overall cancellation rate?
What factors contribute to booking cancellations?

Revenue Analysis:
What is the average revenue per booking?
How does revenue vary based on factors such as length of stay and room type?

Channel Performance:
Which booking channels are the most popular?
How do different channels compare in terms of booking volume and revenue generation?

Special Requests and Preferences:
What are the most common special requests made by customers?
How do special requests impact customer satisfaction and booking decisions?

Analytical Approach:
Data Cleaning and Preprocessing:
Handle missing values and outliers to ensure data consistency and accuracy.

Descriptive Statistics:
Summarize key attributes to understand data distributions.
Visualize trends and patterns using charts and graphs.

Exploratory Data Analysis (EDA):
Identify correlations and relationships between variables.
Segment customers based on booking behavior and demographic profiles.

Recommendation Generation:
Provide actionable insights aimed at improving hotel performance, optimizing pricing strategies, and enhancing the customer experience.

Expected Outcomes:
Comprehensive insights into booking trends and customer behavior.
Identification of key factors influencing bookings and cancellations.

Actionable recommendations for optimizing the hotel’s booking processes and revenue management.
Enhanced understanding of customer preferences and special request patterns.

**By completing this analysis, hotel management will be equipped to make informed, data-driven decisions that optimize operational efficiency, enhance guest satisfaction, and increase overall profitability.**


#### **Business Objective**

The primary objective of this project is to analyze hotel booking data from BOOKING.COM to uncover patterns and insights that can help improve hotel operations, maximize revenue, and enhance customer satisfaction. By understanding booking trends, customer demographics, cancellation behaviors, revenue patterns, and channel performance, the project aims to provide actionable recommendations that enable hotel management to:

Optimize pricing and promotional strategies based on booking patterns and customer segments.

Reduce cancellation rates by identifying and addressing key contributing factors.

Improve channel management by focusing on high-performing booking sources.

Personalize customer experiences by catering to common preferences and special requests.

Strengthen overall operational efficiency and profitability through data-driven decision-making.

Ultimately, the goal is to empower hotels to make informed, strategic decisions that will strengthen their competitive position in the hospitality industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
file_path = "/content/drive/MyDrive/project datasets/Hotel Bookings.csv"
Hotel_booking_df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look

Hotel_booking_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows = Hotel_booking_df.shape[0]
num_columns = Hotel_booking_df.shape[1]

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset Info
Hotel_booking_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = Hotel_booking_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = Hotel_booking_df.isnull().sum()
print(missing_values)

In [None]:
# Total missing values
total_missing_values = missing_values.sum()
print(f"Total number of missing values: {total_missing_values}")

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(Hotel_booking_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The given dataset is from "Booking.com - Hotel Booking Analysis", where the objective is to analyze the data and extract meaningful insights.

So far, we have Imported the necessary libraries,Loaded the dataset,Performed an initial data quality check focusing on duplicate and missing values.

These checks are critical, as duplicates and missing values can significantly impact the accuracy of the analysis and the reliability of the insights derived from the data.

Key findings:

Duplicate values: 31,994 duplicate records were identified.
Missing values: A total of 129,425 missing values were found, spread across three different columns.

Identifying and addressing these issues early ensures a cleaner and more trustworthy dataset for deeper analysis.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
column_names = Hotel_booking_df.columns
print(column_names)

In [None]:
# Checking the column's info
column_info = Hotel_booking_df.info()

In [None]:
# Dataset Describe
Hotel_booking_df.describe(include = 'all')

### Variables Description

#

| Variable Name                    | Description                                                                 |
|----------------------------------|-----------------------------------------------------------------------------|
| `hotel`                          | Type of hotel: Resort Hotel or City Hotel.                                  |
| `is_canceled`                    | Indicates if the booking was canceled or not.                               |
| `lead_time`                      | Number of days between booking date and arrival date.                       |
| `arrival_date_year`             | Year of arrival date.                                                       |
| `arrival_date_month`            | Month of arrival date (e.g., January to December).                          |
| `arrival_date_week_number`      | Week number of the year for the arrival date.                               |
| `arrival_date_day_of_month`     | Day of the month for arrival.                                               |
| `stays_in_weekend_nights`       | Number of weekend nights (Saturday/Sunday) the guest stayed.                |
| `stays_in_week_nights`          | Number of week nights (Monday to Friday) the guest stayed.                  |
| `adults`                         | Number of adults included in the booking.                                   |
| `children`                       | Number of children included in the booking.                                 |
| `babies`                         | Number of babies included in the booking.                                   |
| `meal`                           | Type of meal plan booked.                                                   |
| `country`                        | Country of origin of the guest (represented by ISO country code).           |
| `market_segment`                 | Market segment through which the booking was made.                          |
| `distribution_channel`          | Channel through which the booking was distributed.                          |
| `is_repeated_guest`             | Indicates whether the guest has booked before.                              |
| `previous_cancellations`        | Number of previous bookings that were canceled by the guest.                |
| `previous_bookings_not_canceled`| Number of previous bookings that were not canceled.                         |
| `reserved_room_type`            | Code for the room type initially reserved.                                  |
| `assigned_room_type`            | Code for the room type actually assigned.                                   |
| `booking_changes`               | Number of changes made to the booking.                                      |
| `deposit_type`                  | Indicates whether a deposit was made and its type.                          |
| `agent`                          | ID of the booking agent who made the reservation.                           |
| `company`                        | ID of the company that made the booking.                                    |
| `booking_type`                  | Source of the booking — individual, agent, or company.                      |
| `days_in_waiting_list`          | Number of days the booking stayed on the waiting list.                      |
| `customer_type`                 | Type of customer based on contract and stay patterns.                       |
| `adr`                            | Average Daily Rate — the average price per occupied room.                   |
| `required_car_parking_spaces`   | Number of car parking spaces requested by the guest.                        |
| `total_of_special_requests`     | Number of special requests made by the guest.                               |
| `reservation_status`            | Current status of the reservation: Canceled, No-Show, or Checked-Out.       |
| `reservation_status_date`       | Date on which the reservation status was last updated.                      |

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each column
unique_df = pd.DataFrame({
    'unique_values': [Hotel_booking_df[col].dropna().unique().tolist() for col in Hotel_booking_df.columns]
}, index=Hotel_booking_df.columns)
unique_df

## 3. ***Data Wrangling***

Correcting the Data Types

In [None]:
# Change Data-Type
Hotel_booking_df['is_canceled'] = Hotel_booking_df['is_canceled'].astype(bool)

Hotel_booking_df['is_repeated_guest'] = Hotel_booking_df['is_repeated_guest'].astype(bool)

### Data Wrangling Code

Removing duplicate values :

In [None]:
# Write your code to make your dataset analysis ready.

print(f"Number of duplicate rows: {duplicate_count}")
# Removing the Duplicaate values
Hotel_booking_df.drop_duplicates(inplace=True)
# Checking
duplicate_count = Hotel_booking_df.duplicated().sum()
print(f"Number of remaining duplicate rows: {duplicate_count}")

Handling with missing values :

In [None]:
# Handling the missing values

Hotel_booking_df['booking_type'] = Hotel_booking_df.apply(lambda row: 'agent' if pd.notna(row['agent']) else ('company' if pd.notna(row['company']) else 'Individual Booking'), axis=1)
col_data = Hotel_booking_df['booking_type']
Hotel_booking_df = Hotel_booking_df.drop(columns=['booking_type'])
Hotel_booking_df.insert(25, 'booking_type', col_data)


In [None]:
# Handling outliers
Q1 = Hotel_booking_df['lead_time'].quantile(0.25)
Q3 = Hotel_booking_df['lead_time'].quantile(0.75)
IQR = Q3 - Q1
outliers = Hotel_booking_df[
    (Hotel_booking_df['lead_time'] < Q1 - 1.5 * IQR)|(Hotel_booking_df['lead_time'] > Q3 + 1.5 * IQR)]
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

Hotel_booking_df['lead_time'] = np.where(Hotel_booking_df['lead_time'] > upper, upper, np.where(Hotel_booking_df['lead_time'] < lower, lower, Hotel_booking_df['lead_time']))

In [None]:
Hotel_booking_df.info()

In [None]:
unique_df = pd.DataFrame({
    'unique_values': [Hotel_booking_df[col].dropna().unique().tolist() for col in Hotel_booking_df.columns]
}, index=Hotel_booking_df.columns)
unique_df

### What all manipulations have you done and insights you found?

In the data wrangling phase, we first identified 31,994 duplicate rows, which were removed using Pandas' drop_duplicates() function, as duplicates can distort insights. Next, we assessed missing values. While a few country names were missing and deemed negligible, a significant amount of data was missing in the "agent" (16,340) and "company" (112,593) columns. These columns represent who made the booking. Based on domain knowledge, bookings generally fall into three categories: agent, company, or individual. To capture this, we created a new column, "booking_type", indicating the booking source for each record.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Box plot for checking outliers
# Set the figure size
plt.figure(figsize=(5,8))

# Create the boxplot
sns.boxplot(data=Hotel_booking_df[['lead_time', ]], palette='Set2')

# Add title and axis labels
plt.title('Boxplot of Lead Time', fontsize=16)
plt.xlabel('Booking Features', fontsize=12)
plt.ylabel('Values', fontsize=12)

# Show grid for better readability
plt.grid(True, linestyle='--', alpha=0.5)

# Display the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Box plots are ideal for identifying outliers in numerical data by showing the data distribution and highlighting values that lie beyond the typical range.

Box plots helped me:
*  Detect extreme values that might distort the overall analysis (e.g., very high lead times or unusually long stays).
*  Understand the spread, median, and skewness of the data.



##### 2. What is/are the insight(s) found from the chart?

1. There is a large number of outliers in lead_time, showing that some bookings were made very far in advance, which is not typical.

2. Most bookings happen with short to moderate lead times, indicating guests often book closer to their travel dates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact. :**

Understanding the booking behavior and stay duration helps hotel management:
1. Optimize inventory and staffing based on expected lead times and stay lengths.
2. Design targeted promotions for short-lead or long-stay bookings.
3. Reduce last-minute cancellation risk by offering incentives for early bookings.

**Negative growth :**

Yes. The high number of outliers in lead time suggests many bookings are made far in advance and may later be canceled, increasing unpredictability.
This can lead to:
1. Revenue loss from unused inventory.
2. Operational inefficiencies due to unexpected booking changes.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Pie chart of hotel type
hotel_counts = Hotel_booking_df['hotel'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(hotel_counts, labels=hotel_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Hotel Type')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()


##### 1. Why did you pick the specific chart?

To begin our analysis, we ask:
"What percentage of bookings come from Resort Hotel vs. City Hotel?"

Since this is a categorical variable with two distinct categories, a pie chart is an effective choice. It clearly shows the proportion of each hotel type in the dataset, providing a quick and intuitive visual summary of the distribution

##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals the proportion of total bookings made for each hotel type: Resort Hotel vs. City Hotel

City Hotels have a significantly larger share, it may indicate that:

City hotels are more popular,More accessible for business or travel needs,Or better availability and pricing compared to resort hotels.

---

The histogram shows how data is distributed across lead_time :

You can see if most bookings are made last-minute or well in advance.
Spikes or long tails can reveal unusual booking patterns or potential issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact insights from Pie chart :

Uses:

*  Helps stakeholders understand customer preferences.
*  Useful for marketing strategies or resource allocation (e.g., focusing promotions on underbooked hotel types).

factors contributing to negative growth in bookings or revenue in pie chart.  
pie chart shows that Resort Hotels have a significantly lower share of bookings compared to City Hotels, it could indicate underperformance.

*   This might be due to seasonal demand, poor accessibility, or lack of marketing.
*   It may also reflect customer dissatisfaction with resort experiences or facilities.

Long-term imbalance may lead to revenue decline or overhead strain on underutilized properties.


---


Business Impact insights from Histogram :
Helps in demand forecasting, pricing strategy, and understanding customer booking behavior.

Long Lead Times or High Cancellation Rates (Histogram of lead_time)
Insight:
A histogram of lead_time that shows a high frequency of very long lead times (e.g., bookings made 200+ days in advance) might correlate with higher cancellation rates.

Justification:

Long lead times are often speculative or for large group bookings.

They are more likely to be canceled, especially if no deposit is required.

This results in poor forecasting, lost inventory, and lost revenue.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8, 6))
# Count values and sort them for better visual arrangement
status_counts = Hotel_booking_df['reservation_status'].value_counts().sort_values(ascending=False)
hotel_counts = Hotel_booking_df['hotel'].value_counts()
# Create the bar chart using sorted categories
ax = sns.barplot(x=status_counts.index,y=status_counts.values,palette='Set2', hue = status_counts.index, legend = False )
# Annotate each bar with its value
for i, count in enumerate(status_counts.values):
    ax.text(i, count + 100, str(count), ha='center', va='bottom', fontsize=10)

# Set plot titles and labels
plt.title('Reservation Status Counts', fontsize=16)
plt.xlabel('Reservation Status', fontsize=12)
plt.ylabel('Count', fontsize=12)


plt.tight_layout()

# Show the plot
print(f"Total number of bookings: {len(Hotel_booking_df)}")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart for this analysis because it clearly visualizes the frequency of categorical values, making it easier to compare categories at a glance.

##### 2. What is/are the insight(s) found from the chart?

The bar plot illustrates the distribution of hotel booking reservation statuses. It highlights that the majority of bookings result in check-outs, indicating successful stays. However, a considerable number of cancellations suggest potential revenue loss and areas where customer retention strategies could be improved.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 the insights can help create a positive business impact by identifying areas where customer experience or policies could be improved to reduce cancellations.

 The high cancellation rate is a key indicator of negative growth, as it directly impacts revenue. Understanding the reasons behind these cancellations can help the hotel adjust pricing, improve communication, or offer flexible booking policies to convert more bookings into successful check-outs.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Histogram of lead time
plt.figure(figsize=(8, 5))
sns.histplot(Hotel_booking_df['lead_time'], bins=20, kde=True)
plt.title('Distribution of Lead Time')
plt.xlabel('Lead Time (days)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram for this analysis because it effectively displays the distribution of continuous data. It helps identify skewness, clusters, and gaps, and makes it easier to spot potential outliers or long tails in the dataset, especially for variables like lead time.

##### 2. What is/are the insight(s) found from the chart?

The histogram reveals the distribution of lead times for bookings. It shows that a significant number of reservations are made with short notice, indicating a trend toward last-minute bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

identifying that most bookings have short lead times allows hotels to optimize last-minute marketing strategies and manage inventory more efficiently.

As we can see there is a large amount of last-moment booking, which can lead to problems such as unpredictable demand, difficulties in staff scheduling, and inefficient inventory management, ultimately affecting customer service quality and revenue optimization.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Convert arrival month and year to a single datetime column
Hotel_booking_df['arrival_date'] = pd.to_datetime(Hotel_booking_df['arrival_date_year'].astype(str) + '-' + Hotel_booking_df['arrival_date_month'] + '-01',format='%Y-%B-%d')

# Group by hotel type and arrival date, then count number of bookings
monthly_hotel_bookings = Hotel_booking_df.groupby(['hotel', 'arrival_date']).size().reset_index(name='num_bookings')

# Plot the line chart with 'hotel' as hue to differentiate Resort and City hotels
plt.figure(figsize=(18, 6))
sns.lineplot(data=monthly_hotel_bookings, x='arrival_date', y='num_bookings', hue='hotel', marker='o')

# Add titles and labels
plt.title('Monthly Hotel Bookings Over Time by Hotel Type', fontsize=16)
plt.xlabel('Arrival Date (Month-Year)', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

The line chart effectively illustrates booking trends over time, making it easier to assess business growth and identify seasonal patterns.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that City Hotels exhibit overall growth throughout the year, particularly peaking in the third quarter. In contrast, Resort Hotels demonstrate stagnated growth, with no significant upward trend across the same period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the gained insights can help create a positive business impact by highlighting growth trends in City Hotels, especially in the third quarter. This can guide strategic decisions such as targeted promotions during peak months.

However, the stagnation in Resort Hotel bookings indicates a potential area of concern, signaling negative growth. This could be due to seasonal factors, location-specific limitations, or lack of marketing, which, if not addressed, may lead to reduced revenue and missed business opportunities.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
month_heatmap = Hotel_booking_df.pivot_table(index='arrival_date_year',columns='arrival_date_month',values='hotel',aggfunc='count')

# Optional: sort months in calendar order
month_order = ['January', 'February', 'March', 'April', 'May', 'June','July', 'August', 'September', 'October', 'November', 'December']
month_heatmap = month_heatmap[month_order]

# Plot heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(month_heatmap, cmap='YlOrBr', annot=True, fmt='g')
plt.title('Heatmap of Bookings by Month and Year')
plt.xlabel('Month')
plt.ylabel('Year')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose heatmaps because they provide a clear and intuitive visual representation of booking density across time. By mapping months  against years, heatmaps help quickly identify seasonal trends, peak booking periods, and patterns in customer behavior over multiple years.

##### 2. What is/are the insight(s) found from the chart?

for this heat map chart we get these 2 insights :
1. The chart shows a consistent year-over-year increase in bookings, indicating positive growth.

2. January consistently has the lowest number of bookings, while August—particularly in 2016 and 2017—shows a noticeable peak in customer reservations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding that bookings increase year-over-year highlights overall business growth, which can inform expansion and marketing strategies. Identifying low-performing months like January allows the business to run targeted promotions or campaigns to improve occupancy. No major signs of negative growth are seen, though seasonality (e.g., low bookings in January) suggests a need for proactive planning.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Set figure size
plt.figure(figsize=(12, 6))

# Define correct month order
month_order = ['January', 'February', 'March', 'April', 'May', 'June','July', 'August', 'September', 'October', 'November', 'December']

# Create the countplot grouped by hotel type
ax = sns.countplot(data=Hotel_booking_df,x='arrival_date_month',hue='hotel',order=month_order,palette='Set2')

# Add count labels on top of each bar
for container in ax.containers:
    ax.bar_label(container, fmt='%d', label_type='edge', fontsize=9, padding=2)

# Set titles and labels
plt.title('Total Bookings by Arrival Month and Hotel Type', fontsize=16)
plt.xlabel('Arrival Month', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Hotel Type')
plt.tight_layout()

# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a grouped bar chart because it effectively compares booking volumes across months while also showing the difference between Resort and City Hotels. This makes it easy to observe monthly trends and distinguish performance between hotel types.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that bookings peak during the summer months, especially in July and August, for both hotel types. City Hotels generally receive more bookings than Resort Hotels throughout the year, and January consistently sees the lowest booking volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can guide marketing and resource allocation strategies to maximize bookings during high-demand months. However, the consistently low bookings in January may indicate a period of negative growth. Understanding the cause—such as off-season travel patterns—can help in designing promotions or offers to increase bookings during that time.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Count each customer type
customer_counts = Hotel_booking_df['customer_type'].value_counts()
# Pie chart
plt.figure(figsize=(8, 8))
plt.pie(customer_counts.values,labels=customer_counts.index,autopct='%1.1f%%',colors=plt.cm.Set3.colors,startangle=140)
plt.title('Distribution of Customer Types', fontsize=14)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pie chart because it is ideal for representing the proportional distribution of categorical variables. It clearly shows how customer types are split across the dataset, making it easier to identify which segment dominates the bookings.

##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals that Transient customers make up the largest share of bookings, followed by Transient-Party and Contract customers. Group bookings form the smallest portion. This indicates that most customers prefer individual or short-term stays over long-term or group contracts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

these insights can guide targeted marketing and service strategies. Knowing that transient customers dominate bookings, hotels can optimize last-minute offers, personalized services, and dynamic pricing to boost conversion. Conversely, the low share of group bookings highlights an area for growth through corporate or event partnerships.

#### Chart - 9

In [None]:
#instaling library
!pip install pycountry

In [None]:
# Chart - 9 visualization code
import pycountry
# Step 1: Calculate total guests
Hotel_booking_df['total_guests'] = Hotel_booking_df['adults'] + Hotel_booking_df['children'] + Hotel_booking_df['babies']
# Step 2: Get top 10 countries by total guests
top_countries = Hotel_booking_df.groupby('country')['total_guests'].sum().sort_values(ascending=False).head(10)

# Step 3: Convert ISO country codes to full names
def get_country_name(code):
    try:
        return pycountry.countries.get(alpha_3=code).name
    except:
        try:
            return pycountry.countries.get(alpha_2=code).name
        except:
            return code  # Return code if not found

# Apply conversion
top_countries.index = [get_country_name(code) for code in top_countries.index]

# Step 4: Plot bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, hue=top_countries.index, palette='Set2', dodge=False, legend=False)

# Step 5: Add labels
for i, value in enumerate(top_countries.values):
    plt.text(value + 50, i, str(int(value)), va='center')

# Titles and labels
plt.title('Top 10 Countries by Number of Guests', fontsize=16)
plt.xlabel('Number of Guests')
plt.ylabel('Country')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I selected a bar chart because it effectively displays the comparative number of guests from each country, making it easy to rank and visualize the top contributors. Bar charts are especially useful for showing differences in frequency across categorical variables like country names.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that the majority of guests come from a handful of countries, with Portugal and the United Kingdom contributing significantly to total bookings. This concentration indicates where most of the hotel business originates geographically.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 these insights can help create a positive business impact by allowing targeted marketing and partnerships in the top contributing countries to boost bookings further. Conversely, countries with minimal guest numbers may indicate untapped or underserved markets, or potential negative growth if customer engagement is declining there.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Group by market segment and hotel, summing the cancellations
cancellation_data = Hotel_booking_df.groupby(['market_segment', 'hotel'])['is_canceled'].sum().unstack().fillna(0)

# Sort by total cancellations in descending order
cancellation_data['Total'] = cancellation_data.sum(axis=1)
cancellation_data = cancellation_data.sort_values(by='Total', ascending=False).drop(columns='Total')

# Plot
ax = cancellation_data.plot(kind='bar', stacked=True, colormap='Set2', figsize=(12, 7))

# Title and labels
plt.title('Cancellations by Hotel Type and Market Segment (Descending)', fontsize=16)
plt.xlabel('Market Segment', fontsize=12)
plt.ylabel('Number of Cancellations', fontsize=12)
plt.legend(title='Hotel Type')
plt.xticks(rotation=45)

# Add annotations
for i, idx in enumerate(cancellation_data.index):
    total = 0
    for hotel_type in cancellation_data.columns:
        value = cancellation_data.loc[idx, hotel_type]
        if value > 0:
            ax.text(i, total + value / 2, f'{int(value)}', ha='center', va='center', fontsize=9, color='black')
            total += value

plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

I chose a stacked bar chart because it clearly compares cancellations across different market segments while also showing the contribution of each hotel type within those segments. It helps visualize both absolute numbers and the composition in a single view.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that the highest number of cancellations comes from the Online TA (Travel Agency) segment, with a significant portion from City Hotels. Other segments like Groups and Offline TA/TO also show notable cancellations. Resort Hotels generally have fewer cancellations across segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact :**
these insights can guide strategies to reduce cancellations. For instance, understanding that the Online TA segment has high cancellation rates allows the business to reconsider its policies or improve customer engagement in that segment.

**Negative Impact :**
High cancellations contribute to revenue loss and operational inefficiency, which could lead to negative growth if unaddressed.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Group by parking request and hotel type
parking_hotel_counts = Hotel_booking_df.groupby(['required_car_parking_spaces', 'hotel']).size().reset_index(name='count')

# Plot bar chart with hue for hotel type
plt.figure(figsize=(8, 5))
ax = sns.barplot(data=parking_hotel_counts,x='required_car_parking_spaces',y='count',hue='hotel',palette='Set2')

# Annotate each bar
for container in ax.containers:
    ax.bar_label(container, fmt='%d', label_type='edge', fontsize=9)

# Set plot titles and labels
plt.title('Car Parking Space Requests by Hotel Type', fontsize=14)
plt.xlabel('Car Parking Space Requested (0 = No, 1 = Yes)', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)
plt.legend(title='Hotel Type')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I choose this bar chart to clearly compare the number of parking space requests across different hotel types. Using hue='hotel' helps highlight differences in guest preferences or needs between City and Resort hotels, which is essential for operational planning.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most guests do not request a parking space. However, among those who do, Resort Hotels have a noticeably higher number of parking requests compared to City Hotels, likely due to travel by private vehicle in less urban areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can guide parking space management. If Resort Hotels see more parking demand, management can allocate or expand parking accordingly. Not meeting this demand may lead to customer dissatisfaction, which could negatively impact future bookings and reviews.

#### Chart - 12

In [None]:
# Chart - 13 visualization code
# Set plot style
sns.set(style="whitegrid")

# Create a figure
plt.figure(figsize=(8, 6))

# Count the values for total_of_special_requests
special_request_counts = Hotel_booking_df['total_of_special_requests'].value_counts().sort_index()
special_requests_df = special_request_counts.reset_index()
special_requests_df.columns = ['total_of_special_requests', 'count']

# Create bar plot with hue same as x, and legend=False to avoid duplicate legend
ax = sns.barplot(data=special_requests_df,x='total_of_special_requests',y='count',palette='Set2',hue='total_of_special_requests',legend=False)

# Annotate bars with counts
for i, count in enumerate(special_requests_df['count']):
    ax.text(i, count + 100, str(count), ha='center', va='bottom', fontsize=10)

# Set title and labels
plt.title('Distribution of Special Requests', fontsize=16)
plt.xlabel('Number of Special Requests', fontsize=12)
plt.ylabel('Number of Bookings', fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar chart because it clearly displays the frequency of bookings for each count of special requests. Bar charts are ideal for visualizing discrete categorical variables like the number of special requests, making comparison straightforward.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that the majority of customers either make no special request or only one, indicating minimal demand for personalized services. Very few customers request two or more special services, which may reflect either satisfaction with standard offerings or lack of awareness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Positive Insights :** These insights can help optimize service offerings. Since most guests make few or no special requests, hotels can streamline operations around standard packages, saving costs.
  
 **Negative Insights :** if low request numbers are due to a lack of awareness, there’s a missed opportunity to upsell premium services, potentially limiting revenue growth.

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select only numeric columns from the DataFrame
numeric_df = Hotel_booking_df.select_dtypes(include='number')

# Calculate the correlation matrix
corr_matrix = numeric_df.corr()

# Set up the matplotlib figure
plt.figure(figsize=(12, 8))

# Create the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Set title and layout
plt.title('Correlation Heatmap of Hotel Booking Data', fontsize=16)
plt.tight_layout()

# Display the heatmap
plt.show()


##### 1. Why did you pick the specific chart?

I selected the correlation heatmap because it visually highlights the relationships between numerical variables in the dataset. It makes it easier to identify strong positive or negative correlations, helping in feature selection, understanding dependencies, and detecting multicollinearity before modeling.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe a strong positive correlation between total_stays and both stays_in_week_nights and stays_in_weekend_nights, which is expected. Additionally, lead_time shows a mild positive correlation with cancellations, suggesting bookings made far in advance may have a higher chance of being canceled. Some variables show little to no correlation, which may have less predictive power.

#### Chart - 14 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select relevant numerical columns
num_cols = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adr']

# Create pair plot with hue as hotel type
sns.pairplot(Hotel_booking_df[num_cols + ['hotel']], hue='hotel', corner=True)
plt.suptitle("Pair Plot of Booking Features by Hotel Type", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pair plot because it visually explores relationships between multiple numerical variables at once. It helps identify patterns, correlations, clusters, or outliers and how they differ across hotel types. It is especially useful when comparing variables like lead_time, adr, and stay durations.

##### 2. What is/are the insight(s) found from the chart?

1. Resort hotels tend to have longer lead times and more weekend night stays compared to city hotels.

2. City hotels show a tighter cluster in adr and shorter stays, indicating more last-minute or business-related bookings.

3. There's a slight positive relationship between lead_time and adr, especially for resort hotels.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objectives, the client should focus on reducing cancellations, particularly from specific market segments and hotel types, by improving booking policies, offering flexible booking options, and understanding the reasons behind cancellations. Implementing customer retention strategies like loyalty programs or targeted offers can reduce the negative impact of cancellations. Additionally, leveraging data-driven promotions during low-booking months (e.g., January) and for Resort Hotels will help boost occupancy rates. Insights into special requests and customer types allow for more personalized offerings, which can increase customer satisfaction and drive additional sales through upselling. Targeting customers from high-performing countries can further expand reach and improve bookings. Improving lead time management and analyzing booking trends will help optimize marketing efforts, ensuring better forecasting and operational planning. Lastly, by focusing on ADR (Average Daily Rate) optimization, the business can maximize profitability per booking and adjust pricing strategies to cater to both business and leisure customers effectively.

# **Conclusion**

In this hotel booking analysis project, we explored the various patterns and insights that could drive business decisions for both Resort and City Hotels. By analyzing key metrics like reservation status, lead time, special requests, customer types, and hotel occupancy trends, we identified critical factors influencing customer behavior, booking trends, and potential growth opportunities. The visualizations, including bar charts, heatmaps, histograms, and box plots, provided a comprehensive understanding of the data, helping to uncover actionable insights.

One of the major findings was the significant number of last-minute bookings, particularly for Resort Hotels, which could be problematic as it might lead to inefficient resource allocation and a lack of availability for customers who booked well in advance. Additionally, the varying performance of City Hotels and Resort Hotels over time highlighted seasonal trends, allowing the business to plan marketing and promotional strategies more effectively.

By identifying customer types and their preferences, we also gained a better understanding of the target market, and where efforts should be focused to maximize bookings and customer satisfaction. The analysis also suggested that City Hotels are experiencing steady growth, while Resort Hotels might need strategic adjustments to boost their performance, especially during off-peak periods.

In conclusion, the insights gained from this analysis will be crucial in shaping future business strategies. With careful attention to booking trends, customer preferences, and operational efficiency, the hotels can make informed decisions that will not only enhance their service offerings but also ensure positive growth in the coming years. Moving forward, continuous monitoring of customer behavior and adapting to changes in the market will help the business stay competitive and maximize profitability.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***