<a href="https://colab.research.google.com/github/NikamPratiksha0506/Hotel-Booking-Analysis/blob/main/Capstone_Project_2_Hotel_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Hotel Booking Analysis





##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

In the highly competitive hotel industry, maximizing occupancy and reducing cancellations are critical to profitability and customer satisfaction. This project focuses on conducting an Exploratory Data Analysis (EDA) on hotel booking data to uncover insights that can guide effective decision-making for both city and resort hotels. In this we
devide data manipulation workflow in three categories data collection, data Cleaning and data manipulation. The analysis employed methods like info(), head(), and columns() to understand the dataset’s structure. Univariate and bivariate analyses revealed insights such as higher cancellation rates for bookings made well in advance and peak bookings during holidays. Through an analysis of booking trends, customer demographics, and financial data, this project aims to identify patterns and factors that affect bookings, cancellations, and revenue generation, providing actionable insights for improving operational strategies.
In the EDA process, we began with univariate analysis to examine individual variables and understand the basic characteristics of the data. For example, analyzing the lead time showed that a significant number of bookings were made well in advance, indicating certain groups of customers preferred planning their stays months ahead.Bivariate analysis provided deeper insights by allowing us to explore relationships between two variables, such as the link between booking status (canceled or not) and lead time. It was found that bookings with long lead times tended to have a higher cancellation rate, possibly due to changes in customer plans or competitive pricing from other providers closer to the booking date.



# **GitHub Link -**
https://github.com/NikamPratiksha0506/EDA/blob/main/Capstone_Project_2_Hotel_Booking_Analysis.ipynb
```



Provide your GitHub Link here.

# **Problem Statement**


This project seeks to analyze hotel booking data to identify patterns and factors that influence booking behaviors, cancellations, and revenue generation.

1. What are the seasonal trends in hotel bookings, and how do they differ between city and resort hotels?
2. Which factors are associated with higher cancellation rates, and can these factors be mitigated through specific booking policies or incentives?
3. How do different customer segments (e.g., nationality, booking lead time, previous stays) impact revenue and occupancy rates?
4. What strategies can hotel management implement to balance occupancy during peak and off-peak seasons effectively?

#### **Define Your Business Objective?**

1. Maximize Occupancy Rates: Identify booking patterns and seasonal trends to develop strategies for increasing occupancy, especially during off-peak periods.
2. Reduce Cancellations: Analyze factors contributing to cancellations and implement policies (e.g., non-refundable options, incentives) to minimize cancellations.
3. Optimize Revenue Generation: Use insights to align pricing with demand, maximizing revenue during peak times and minimizing underutilization during low-demand periods.
4. Enhance Customer Retention: Leverage customer demographics and preferences to offer personalized services and loyalty programs, fostering repeat business.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder, StandardScaler
import missingno as msno
import datetime

### Dataset Loading

In [None]:
# Define the file path
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
import os
file_path = '/content/drive/My Drive/Hotel Bookings.csv'
print(os.path.exists(file_path))    # to know if the file exists at the specified path

# Load the CSV file

Hotel_Booking_df = pd.read_csv(file_path)

# Display the first few rows
Hotel_Booking_df.head()


### Dataset First View

In [None]:
# Dataset First Look
Hotel_Booking_df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

rows, columns = Hotel_Booking_df.shape
print(f"The dataset has {rows} rows and {columns} columns.")

print(Hotel_Booking_df.index)  # refers to the row labels of a DataFrame
print('\n')
print(Hotel_Booking_df.columns)

### Dataset Information

In [None]:
# Dataset Info
Hotel_Booking_df.info()   # The number of rows,columns,data types.The count of non-null entries in each column


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

Hotel_Booking_df.drop_duplicates(inplace=True)   # we use drop.duplicate to remove duplicate
uni_no_of_rows=Hotel_Booking_df.shape[0]
uni_no_of_rows

duplicate_count = Hotel_Booking_df.duplicated().sum()   # Count duplicate values
# Print the result
print(f"The dataset has {duplicate_count} duplicate rows.")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Hotel_Booking_df.reset_index()   # reset the index of a DataFrame or Series.

In [None]:
# Visualizing the missing values

# find,count null val and Sort missing values in descending order
missing_val = Hotel_Booking_df.isnull().sum().sort_values(ascending=False)

# Display missing values
print(missing_val)

### What did you know about your dataset?

In an EDA of a hotel booking dataset, you would analyze booking trends, cancellation rates, and customer behavior. Key insights might include peak booking seasons, the correlation between lead time and cancellations, and variations in average daily rate across hotel types and customer segments. Visualizations like bar plots and heatmaps help uncover patterns, such as the higher cancellation rates among transient customers and higher ADR during peak months.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_columns= Hotel_Booking_df.columns
df_columns

In [None]:
# Dataset Describe
Hotel_Booking_df.describe()  # provides a quick summary of the numerical columns

### Variables Description

**Hotel**: The type of hotel (e.g., "Resort Hotel" or "City Hotel").        
**Is Canceled**: Indicates if the booking was canceled (1 = canceled, 0 = not canceled).                                                         
**Lead Time**: The number of days between the booking date and the arrival date.                                                            
**Arrival Date**: Contains information about the year, month, week, and day of the arrival.                                                     
**Country**: The guest's country of origin, represented by country codes (e.g., "PRT" for Portugal).                                    
**Market** **Segment**: The segment through which the reservation was made (e.g., "Direct," "Corporate").                                 
**Distribution Channel**: The channel through which the booking was made (e.g., "Direct," "Corporate," "TA/TO").                             
**Is Repeated Guest**: Indicates whether the guest is a returning customer (1 = yes, 0 = no).                                         
**Previous Cancellations**: The number of prior bookings canceled by the guest.                                                           
**Previous Bookings Not Canceled**: The number of previous bookings that were not canceled.                                            
**Reserved Room Type**: The type of room initially reserved by the guest.                                                                 
**Assigned Room Type**: The actual room assigned to the guest upon arrival.                                                           
**Booking Changes**: The number of changes made to the reservation after the initial booking.                                        
**Deposit Type**: The type of deposit required for the booking (e.g., "No Deposit," "Non Refundable").                                    
**Agent**: The agent responsible for making the booking (e.g., agent IDs).                                                                   
**Company**: The company making the reservation, if applicable.          
**Days in Waiting List**: The number of days the reservation was on the waiting list before being confirmed.                                   
**Customer Type**: The type of customer (e.g., "Transient," "Contract," "Group").                                                              
**ADR** (Average Daily Rate): The average price per night for the booking.                                                              
**Required Car Parking Spaces**: The number of parking spaces requested for the booking.                                                      
**Total of Special Requests**: The total number of special requests made by the guest.                                                  
**Reservation Status**: The current status of the reservation (e.g., "Check-Out," "Canceled").                                            
**Reservation Status Date**: The date when the reservation status was last updated.                                                        

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for element in Hotel_Booking_df.columns.tolist():
    print("no of Unique values in",element,"is",Hotel_Booking_df[element].nunique())

In [None]:
Hotel_Booking_df['hotel'].unique()  # check the unique value of hotel

In [None]:
# check the unique arrival date month of column
Hotel_Booking_df['arrival_date_month'].unique()

In [None]:
# check the unique adult of column
Hotel_Booking_df['adults'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
missing_val[:10]      # Displays the top 10 columns with the most missing data

In [None]:
# lets check what is the percentage of null values present in each column

# Get the total number of rows in the DataFrame
uni_no_of_rows = Hotel_Booking_df.shape[0]

# Calculate the percentage of null values for the first column in missing_val
percentage_company_null = (missing_val.iloc[0] / uni_no_of_rows) * 100

print(percentage_company_null)

In [None]:
# lets check what is the percentage of null values present in agent column

# Get the total number of rows in the DataFrame
uni_no_of_rows = Hotel_Booking_df.shape[0]

# Calculate the percentage of null values for the second column in missing_val
percentage_company_null = (missing_val.iloc[1] / uni_no_of_rows) * 100

print(percentage_company_null)

In [None]:
# lets check what is the percentage of null values present in all columns(Optional)

#divides the count of null values in each column by the total number of rows.
null_percentage = (Hotel_Booking_df.isnull().sum() / Hotel_Booking_df.shape[0]) * 100
null_percentage

In [None]:
# Fill missing values in the 'hotel' column with the most common hotel type (mode)

# Get the most common value (mode) in the 'hotel' column
most_common_hotel = Hotel_Booking_df['hotel'].mode()[0]

# Fill missing values in the 'hotel' column with the most common value (without inplace=True)
Hotel_Booking_df['hotel'] = Hotel_Booking_df['hotel'].fillna(most_common_hotel)

# Check if there are any remaining null values in the 'hotel' column
missing_hotel_values = Hotel_Booking_df['hotel'].isnull().sum()

print("Hotel:", missing_hotel_values)


# This is for agent column

# Fill missing values in the 'agent' column with 0 (without inplace=True)
Hotel_Booking_df['agent'] = Hotel_Booking_df['agent'].fillna(0)

# Check if there are any remaining null values in the 'agent' column
missing_agent_values = Hotel_Booking_df['agent'].isnull().sum()

print("agent:", missing_agent_values)


In [None]:
#lets check what is the percentage of null values present in Country column
percentage_country_null = (missing_val.iloc[2] / uni_no_of_rows) * 100
percentage_company_null

In [None]:
#we have less percentage of null values present in Country column so we can replce null to other
Hotel_Booking_df['country'].fillna(value= 'other')
Hotel_Booking_df['country'].isnull().sum()

In [None]:
#lets check now shape
Hotel_Booking_df.shape


In [None]:
Hotel_Booking_df.isnull().sum()

In [None]:
Hotel_Booking_df.info()

### What all manipulations have you done and insights you found?

The percentage of missing values helped us understand which columns needed the most attention for data cleaning.
By filling missing values appropriately (with mode or a predefined value), we ensured that the dataset was ready for further analysis without being biased by missing data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

Data visualization is the graphical representation of data and information. It uses visual elements like charts, graphs, and maps to help people understand trends, outliers, and patterns in data.

#### Chart 1 - Plotting Histogram

 histogram is a graphical representation of the distribution of numerical data, used to visualize the frequency or count of data points within different ranges (bins).
 Histograms are used when your data is continuous, meaning it can take any value within a given range.

In [None]:
# Assigning data hotel_booking_df in df
df = Hotel_Booking_df

In [None]:
sns.histplot(df['lead_time'], bins=30, kde=True)
plt.title('Distribution of Booking Lead Time')
plt.xlabel('Lead Time (days)') #number of days between the booking date and the arrival date.
plt.ylabel('Frequency')  # how many bookings fall into each "lead time" range.
plt.show()

##### 1. Why did you pick the specific chart?

Choosing the right chart for your data is essential for effective communication and insight generation. Each type of chart has its strengths and is best suited for specific kinds of data or analysis. Here’s why I picked specific charts like histograms

Histograms are ideal for visualizing the distribution of continuous numerical data (like lead time, length of stay, or booking prices). Booking Lead Time: By plotting a histogram of lead time, we can easily see if most customers tend to book far in advance or last minute.

##### 2. What is/are the insight(s) found from the chart?

If the histogram shows a long right tail, it might suggest that many customers book far in advance, while most bookings happen in the near future.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing hotel booking data can optimize pricing strategies, improve customer segmentation, and enhance operational efficiency, leading to increased bookings and profitability.

#### Chart 2- Pie Chart

It is most effective when you have a small number of categories and want to visualize how each part contributes to the total, such as showing market share or budget distribution. Pie charts are simple and intuitive, making it easy to understand percentages or relative contributions at a glance.

In [None]:
# Chart - 2 visualization code
cancellation_counts = df['is_canceled'].value_counts()
cancellation_counts.plot.pie(autopct='%1.1f%%', colors=['#ff9999', '#66b3ff'], startangle=90, legend=False)
plt.title('Proportion of Canceled vs. Confirmed Bookings')
plt.ylabel('')  # Hide y-axis label for clarity
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are ideal for showing proportions of a whole. In this case, the goal is to compare two categories: canceled and confirmed bookings.

##### 2. What is/are the insight(s) found from the chart?

It allows us to easily compare categories number of cancellations vs. number of confirmed bookings.

##### 3. Will the gained insights help creating a positive business impact?


Are there any insights that lead to negative growth? Justify with specific reason.

By analyzing booking cancellations versus confirmed bookings, hotels can gain actionable insights into customer behavior, pricing strategies, operational efficiencies, and marketing. These insights can help reduce cancellations, improve the customer experience, optimize pricing and inventory management, and ultimately lead to higher revenue, better customer retention, and operational cost savings. Thus, the insights from this analysis can create a positive business impact by enhancing both profitability and customer satisfaction.

#### Chart 3 - Scatter Plot

A scatter plot is used to display the relationship between two continuous variables. It helps in visualizing trends, correlations, and patterns in the data, making it ideal for understanding how one variable impacts another.

In [None]:
# Chart - 3 visualization code
sns.scatterplot(x='lead_time', y='stays_in_weekend_nights', data=df)
plt.title('Lead Time vs. Length of Stay (Weekend Nights)')
plt.xlabel('Lead Time (days)')
plt.ylabel('Weekend Nights Stay')
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots are great for showing the relationship between two numerical variables (e.g., lead time vs. length of stay). Scatter plots help to Visualize any correlation or trend between two variables. Identify patterns, clusters, or outliers. Understand how changes in one variable affect another.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot you've created, which visualizes the relationship between lead time (the number of days before booking) and stays in weekend nights (the number of nights stayed over the weekend), can provide several insights into customer booking behaviors. Below are some potential insights that you can derive from this chart: 1.Lead Time and Weekend Stay Length Relationship 2.Clusters or Groupings in Data 3.Outliers (Booking Behavior Deviations) 4.Peak Booking Times (Seasonality or Events) 5.Planned Travelers (Long Lead Time, Longer Stay)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This scatter plot thus provides actionable insights that can help the hotel optimize its marketing strategy, pricing structure, and booking policies to better cater to different customer groups based on their booking behaviors.



#### Chart 4 - Bar Chart

It is especially useful when you need to compare discrete groups or categories, such as comparing sales figures across different products or revenue by region. Bar plots are easy to interpret and make it straightforward to compare the magnitude of different categories.

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt

# Define the function to plot the bar chart
def plot_bar_chart_from_column(df, column_name, title):
    # Check if the column exists in the dataframe
    if column_name not in df.columns:
        print(f"Error: '{column_name}' does not exist in the DataFrame")
        return

    # Count the occurrences of each value in the specified column
    value_counts = df[column_name].value_counts()

    # Plotting the bar chart
    plt.figure(figsize=(10, 6))
    value_counts.plot(kind='bar', color='skyblue')

    # Adding titles and labels
    plt.title(title)
    plt.xlabel(column_name)
    plt.ylabel('Count')

    # Rotate x-axis labels if necessary
    plt.xticks(rotation=45)

    # Adjust layout to prevent overlap
    plt.tight_layout()

    # Display the plot
    plt.show()

# Example usage (ensure Hotel_Booking_df is defined earlier in your code)
plot_bar_chart_from_column(Hotel_Booking_df, 'assigned_room_type', 'Assigned Room by Type')



##### 1. Why did you pick the specific chart?

I chose a bar chart because it effectively visualizes the distribution of categorical data, such as the count of each assigned room type.

##### 2. What is/are the insight(s) found from the chart?

The bar chart reveals which room types are most and least popular among guests, providing insights into customer preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help optimize room allocation and marketing strategies, leading to increased bookings and profitability.

#### Chart 5  - horizontal bar chart

A horizontal bar chart is particularly useful when the category labels are long, as it provides more space for them to be easily readable. It is also effective when dealing with a large number of categories, as it helps avoid crowding and makes comparisons between categories clearer.

In [None]:
# Chart - 5 visualization code horizontal bar chart
market_segment_counts = df['market_segment'].value_counts()
market_segment_counts.plot(kind='barh', color=['#4CAF50', '#FF9800', '#2196F3', '#F44336'])
plt.title('Market Segment Distribution')
plt.xlabel('Count of Bookings')
plt.ylabel('Market Segment')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the horizontal bar chart because it effectively displays categorical data (market segments) along with their counts, making it easy to compare the relative sizes of each segment. Horizontal bars are especially useful when the category names are long or when there are many categories.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart would reveal the relative distribution of bookings across different market segments, highlighting which segments contribute the most or least to total bookings. This can help identify trends or areas for targeted marketing efforts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact by identifying the most profitable market segments, allowing businesses to allocate resources effectively and target high-performing segments. If a segment shows significantly low bookings, it could signal a need for improvement in that area, such as marketing efforts or service offerings, which could potentially lead to negative growth if neglected.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve business objectives, I suggest the client focuses on optimizing room pricing and promotions based on demand patterns, while also considering the underutilized room types for potential rebranding or repurposing. Additionally, targeting customer segments with tailored offers for popular room types can increase bookings and profitability.

In [None]:
df.to_csv('Clinned_data.csv',index=False)   # save a DataFrame (df) to a CSV file

In [None]:
# from google.colab import files
# files.download('Clinned_data.csv')   #  used in Google Colab to download the file to local machine.

# **Conclusion**

In conclusion, the hotel booking analysis through EDA provides valuable insights into booking trends, customer preferences, and operational efficiencies. By understanding room type popularity, booking patterns, and seasonal trends, the hotel can optimize pricing, marketing strategies, and resource allocation to improve customer satisfaction and profitability. Additionally, identifying areas of inefficiency, such as underutilized room types, allows for better inventory management and targeted promotions, ultimately supporting the business's growth and long-term success.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***