# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    -Individual
##### Project by               - Sandeep kumar

# **Project Summary -** Hotel Booking Analysis

This EDA project on Hotel Booking Analysis investigates cancellations, and their underlying patterns; and suggests measures that can be implemented to reduce cancellations and secure revenue1:

The project covers booking information for a city hotel and a resort hotel including information such as when the booking was made, length of stay, the number of adults, children. The project went through the basic idea of the EDA and visualization process.

In this project I will do Exploratory Data Analysis on the given dataset. The project suggests measures that can be implemented to reduce cancellations and secure revenue. For example, hotels can offer discounts or promotions to customers who book early or who book for longer stays. Hotels can also offer incentives such as free parking or free breakfast to customers who book directly with them instead of through third-party websites.

This EDA involves following steps where in first step involves exploration and inspection over raw data, and second in second step I have deal with data impurities and cleaned the data by handling null values and dropping irrelevent data from the dataset.

This EDA is divided into following 3 analysis:

 Univariate analysis: Univariate analysis is the simplest of the three analyses where the data, you are analyzing is only one variable.

Bivariate analysis: Bivariate analysis is where you are comparing two variables to study their relationships.

Multivariate analysis: Multivariate analysis is similar to Bivariate analysis but you are comparing more than two variables.

The project concludes that by analyzing hotel bookings data and understanding cancellations patterns, hotels can take steps to reduce cancellations and increase revenue.

# **GitHub Link -**

https://github.com/SandyCherry96/EDA---Hotel-Booking-Analysis.git

# **Problem Statement**


Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions! This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data..

#### **Define Your Business Objective?**

The project aims to gain interesting insight into customers’ behavior when booking a hotel. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.These insights can guide hotels to adjust their customer strategies and make preparation for unknown.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

# for visualisation
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset
path = '/content/Hotel Bookings.csv'
hotelbooking_df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
hotelbooking_df

In [None]:
# First 5 Rows of dataset
hotelbooking_df.head()

In [None]:
# Last 5 rows of dataset
hotelbooking_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

rows,columns = hotelbooking_df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
hotelbooking_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
hotelbooking_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotelbooking_df.isnull().sum()

In [None]:
# Visualizing the missing values

import seaborn as sns
import matplotlib.pyplot as plt

# Visualize the missing values using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(hotelbooking_df.isnull().transpose(), cbar=False, cmap='Blues')
plt.title('Missing Values Heatmap')
plt.show()


### What did you know about your dataset?

We can see that there are total four columns with missing/null values : company, agent, country, children.

In children column, I will replace null values with 0 assuming that customer did not have any children.

 In country has null values. I will replace null values in this column with 'Others' assuming customer's country was not mentioned while booking.

In company and agent column it might be a case when customers did not book hotel through them so these columns might have null values in it. As these 2 columns have numeric data in it, I will replace them with 0.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotelbooking_df.columns

In [None]:
# Dataset Describe
hotelbooking_df.describe()

### Variables Description

Hotel : (Resort Hotel or City Hotel)

is_canceled: Value indicating if the booking was canceled (1) or not (0)

lead_time : *Number of days that elapsed between the entering date of the booking into the PMS and the arrival date*

arrival_date_year : Year of arrival date

arrival_date_month : Month of arrival date

arrival_date_week_number : Week number of year for arrival date

arrival_date_day_of_month : Day of arrival date

stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

adults : Number of adults

children : Number of children

babies : Number of babies

meal : Type of meal booked. Categories are presented in standard hospitality meal packages

country : Country of origin.

 market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)

previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.

assigned_room_type : Code for the type of room assigned to the booking.

booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

deposit_type : Indication on if the customer made a deposit to guarantee the booking.

agent : ID of the travel agency that made the booking

company : ID of the company/entity that made the booking or responsible for paying the booking.

days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer

customer_type : Type of booking, assuming one of four categories

adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

required_car_parking_spaces : Number of car parking spaces required by the customer

total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)

reservation_status : Reservation last status, assuming one of three categories

**Canceled** – booking was cancelled by the customer

Check-Out – customer has checked in but already departed

No-Show – customer did not check-in and did inform the hotel of the reason why


reservation_status_date - Date at which the last status was set

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
hotelbooking_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
hb_df = hotelbooking_df.copy()

In [None]:
# Write your code to make your dataset analysis ready.
hb_df.columns

In [None]:
# replacing null values in children column with 0 assuming that family had 0 children
# replacing null values in company and agent columns with 0 assuming those rooms were booked without company/agent

hb_df['children' ].fillna(0, inplace = True)
hb_df['company' ].fillna(0, inplace = True)
hb_df['agent' ].fillna(0, inplace = True)

# replacing null values in country column as 'Others'

hb_df['country'].fillna('Others', inplace = True)

In [None]:
# checking for null values after replacing
hb_df.isnull().sum()

In [None]:
# dropping the 'company' column as it contains a lot of null values in coparison to other columns
hb_df.drop(['company'], axis =1 , inplace = True)        # dropping the values vertically at axis 1 (columns)


In [None]:
# dropping rows where no adults , children and babies are available because no bookings were made that day

no_guest=hb_df[hb_df['adults']+hb_df['babies']+hb_df['children']==0]

hb_df.drop(no_guest.index, inplace=True)


In [None]:
# final dataset shape

hb_df.shape

### What all manipulations have you done and insights you found?

Created a copy of the dataset before doing any manipulation then filled missing values with 0 in children , company and agent columns as those columns had numerical values and in column country filled missing values with 'others'. after dealing with missing values I dropped the country column as this had 96% missing values and was of no use in our analysis.

After doing all the manipulation I checked new manipulated dataset to check if this is ready to be analyzed.

After manipulating the dataset these were the insights I found:

1. There are 2 types of hotel which guests could book so I can find which type of hotel was booked most.

2. There are different types of guests and they come from different countries.

3. Guests can choose different foods from the menu.

4. Guests can book hotel directly or through different channels that are available.

5. Guests can cancel their booking and there are repeated guests also.

6. Guests can choose rooms of their liking while booking.

7. There is column available in the dataset named 'adr' which could be used to analyze hotel's performance on the basis of revenue.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  
# **Which type of hotel is most preffered by the guests?**


In [None]:
# Chart - 1 visualization code

hb_df['hotel'].value_counts().plot.pie(autopct='%1.1f%%',figsize=(10,5),fontsize=10)
plt.title('Pie Chart for Most Preffered  Hotel')

##### 1. Why did you pick the specific chart?

I used pie chart here because it is used to show the proportions of data


##### 2. What is/are the insight(s) found from the chart?

I found out that guests prefer City Hotel most over Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight is useful for the stakeholder to check which hotel is performing best and they can invest more capitals in that. There is no such negative growth but stakeholders can focus more on City Hotel to get more booking and increase the overall revenue.

#### Chart - 2
# What is the pecentage of cancellation?

In [None]:
hb_df['is_canceled'].value_counts().plot.pie(autopct='%1.1f%%',figsize=(10,5),fontsize=10)
plt.title("Cancellation and non Cancellation")

##### 1. Why did you pick the specific chart?

I had to show a part-to-a-whole relationship and percentage of both the values and here pie chart was a good option to show segmented values.

##### 2. What is/are the insight(s) found from the chart?

Here we can see that around 63% bookins are not cancelled by guests but around 37% bookings are cancelled by guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight will help stakeholders in comparing the cancellation and non cancellation of bookings. With the help of this insight stakeholders can offer rescheduling the bookings instead of cancellation and set a flexible cancellation policy to reduce booking cancellation

#### Chart - 3
# Which year has the most bookings ?

In [None]:
# Chart - 3
plt.figure(figsize=(10,4))
sns.countplot(x=hb_df['arrival_date_year'],hue=hb_df['hotel'])
plt.title("Number of bookings across year", fontsize = 25)
plt.show()

##### 1. Why did you pick the specific chart?

Bar graphs are used to compare things between different groups

##### 2. What is/are the insight(s) found from the chart?


From above insight I found out that hotel was booked most times in year 2016.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Above insight shows that number of booking was declined after year 2016. Stakeholders can now what went wrong after 2016 and fix that problem to increase the number of bookings. One way to do this is ask for feedbacks from guests and have a meeting with old employees who else were serving int the year 2016.

#### Chart - 4
# Which is the most preferred room type by the customers?

In [None]:
plt.figure(figsize=(8,5))

#plotting
sns.countplot(x=hb_df['assigned_room_type'],order=hb_df['assigned_room_type'].value_counts().index, palette='Set2')

plt.xlabel('Room Type')
plt.ylabel('Count of Room Type')
plt.title("Most preferred Room type")

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data.

##### 2. What is/are the insight(s) found from the chart?

By observing the above chart we can understand that the room type A most preffered ( almost 70,000) by the guests while booking the hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


As it is clear that room type A is most used hotel should increase the number of A type room to maximize the revenue.

#### Chart - 5
# Which type of food is mostly preferred by the guests?

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.countplot(x=hb_df['meal'], palette='Set2')
plt.xlabel('Meal Type')
plt.ylabel('Count')
plt.title("Preferred Meal Type")

ypes of meal in hotels:

BB - (Bed and Breakfast)
HB- (Half Board)
FB- (Full Board)
SC- (Self Catering)

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data.

##### 2. What is/are the insight(s) found from the chart?

By observing the above chart we can understand that the bed and breakfast  most preffered ( almost 80000) by the guests while booking the hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


As it is clear that is most used BB - (Bed and Breakfast) hotel  should give bb to maximize the revenue.

#### Chart - 6
# Which month has the most bookings in each hotel type?


In [None]:
# Chart - 6

plt.figure(figsize=(15,5))
sns.countplot(x=hb_df['arrival_date_month'],hue=hb_df['hotel'])
plt.title("Number of booking across months", fontsize = 25)
plt.show()


##### 1. Why did you pick the specific chart?

I had to compare values across the months and for that bar chart was one of the best choice.

##### 2. What is/are the insight(s) found from the chart?


Above insight shows that August and July ware 2 most busy months in compare to others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


There is negative insight but hotel can use this insight to arrange everything in advance and welcome their guest in the best way possible and hotel can also run some promotional offer in these 2 months to attract more guests.

#### Chart - 7
# What is the Percentage of repeated guests?

In [None]:
# Chart - 7
hb_df['is_repeated_guest'].value_counts().plot.pie(autopct='%1.1f%%',figsize=(10,5),fontsize=10)

plt.title(" Percentgae (%) of repeated guests")

##### 1. Why did you pick the specific chart?

A pie chart helps organize and show data as a percentage of a whole

##### 2. What is/are the insight(s) found from the chart?

From the above insight we can see that 3.1% guests are repeated guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see that number of repeated guests is very low and it shows negative growth of the hotel. Hotel can offer loyality discount to their guests to increase repeated guests.

#### Chart - 8
# From which country most guests come?

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x=hb_df['country'],order=hb_df['country'].value_counts().iloc[:10].index, palette='Set2')
plt.xlabel('Country')
plt.ylabel('Number of guests')
plt.title("Number of guests from diffrent countries")

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I found out that most guests come from PRT(Portugal).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insight. After knowing that most of the guests come from Portugal Hotels can add more Portugal cousines in their menu to make guests order more food.

#### Chart - 9
# Which distribution channel is most used in booking?

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12, 5))
sns.countplot(x=hb_df['distribution_channel'],
              order=hb_df['distribution_channel'].value_counts().index, palette='Set1')
plt.title('Distribution Channel Used for Booking', fontsize=15)
plt.xlabel('Distribution Channel')
plt.ylabel('Count')

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data.

##### 2. What is/are the insight(s) found from the chart?

From the above insight it is clear that TA/TO (travel agents/Tour operators) is most used distribution channel by guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative insight. Hotels can run promotional offers to motivate other channels to contribute more in bookings

# Bivariate and Multivariate Analysis

#### Chart - 10
# Which hotel type has the more lead time?

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(12,6))
sns.boxplot(x='hotel', y='lead_time', data=hb_df)
plt.title('Lead Time for Each Hotel Type')
plt.xlabel('Hotel Type')
plt.ylabel('Lead Time')
plt.show()

##### 1. Why did you pick the specific chart?

Box plots are used to show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages.

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can see that Resort hotel has slightly more lead time than City hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it does create a positive business impact as Resort hotels have higher lead time so they can plan ahead and manage resources more efficiently.

#### Chart - 11

# Which Hotel type has the highest ADR?

In [None]:
# Chart - 11 visualization code
grup_by_hotel=hb_df.groupby('hotel')
highest_adr=grup_by_hotel['adr'].mean().reset_index()

#set plot size
plt.figure(figsize=(5,5))

# set labels
plt.xlabel('Hotel type')
plt.ylabel('ADR')
plt.title("Avg ADR of each Hotel type")

#plot the graph
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can see that City hotel has slightly higher ADR than Resort hotel.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it does create a positive business impact. As City hotel has more ADR, hotels can strategize accordingly to increase revenue.

#### Chart - 12
# Which distribution channel contributed more to adr in order to increase the the income.?

In [None]:
# Chart - 12 visualization code
# group by distribution channel and hotel
distribution_channel_df=hb_df.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()

# set plot size and plot barchart
plt.figure(figsize=(8,5))
sns.barplot(x='distribution_channel', y='adr', data=distribution_channel_df, hue='hotel')
plt.title('ADR across Distribution channel')


##### 1. Why did you pick the specific chart?


A bar plot shows categorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can see that GDS channel has highest ADR for City hotel and TA/TO has highest ADR for Resort hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it does create a positive business impact. As GDS has the highest ADR in City hotel and TA/TO has highest ADR in Resort hotel, hotels can strategize accordingly to increase revenue.

Corporate- These are corporate hotel booing companies which makes bookings possible.
GDS-A GDS is a worldwide conduit between travel bookers and suppliers, such as hotels and other accommodation providers. It communicates live product, price and availability data to travel agents and online booking engines, and allows for automated transactions.
Direct- means that bookings are directly made with the respective hotels
TA/TO- means that booings are made through travel agents or travel operators.
Undefined- Bookings are undefined. may be customers made their bookings on arrival.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,10))
sns.heatmap(hb_df.corr(numeric_only=True),annot=True)
plt.title('Co-relation of the columns')

##### 1. Why did you pick the specific chart?

Heatmaps are great for visualizing the correlation between multiple variables. Positive correlations are shown in warmer colors (reds), while negative correlations are shown in cooler colors (blues).

##### 2. What is/are the insight(s) found from the chart?

Total_of_special_requests has a positive correlation with adr which means that more special requests could lead to higher revenue. lead_time has a slight negative correlation with adr. This could mean that higher lead times could be associated with lower revenue.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(hb_df, hue='hotel')

##### 1. Why did you pick the specific chart?

Pair plots are a great way to visualize the relationship between all pairs of numerical variables in a dataset. They can help identify patterns and relationships that may not be apparent from looking at individual variables.

2.Will the gained insights help creating a positive business impact?

From the above pair plot we can see that if cancellation increases then total stay also decreases.
As the total number of people increases adr also increases. Thus adr and total people are directly proportional to each other.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1.Focus on City Hotels: City hotels are more popular and generate higher ADR, so allocate more marketing and resources to attract more city hotel bookings.

2.Reduce Cancellation Rates: Implement strategies like flexible cancellation policies, offering incentives for non-refundable bookings, and sending reminders to reduce the high cancellation rate (37%).

3.Target Repeat Guests: Offer loyalty programs, personalized discounts, and exclusive perks to encourage repeat bookings, as the current rate is very low (3.1%).

4.Optimize Pricing for Lead Time: Analyze the relationship between lead time and ADR to optimize pricing strategies. Consider offering discounts for longer lead times or dynamic pricing based on demand.


5.Promote Popular Room Types: Ensure availability of room type A, as it is the most preferred among guests. Consider offering promotions or upgrades to less popular room types.

6.Offer Attractive Meal Packages: Bed and Breakfast is the most popular meal type. Consider offering discounts or bundling it with room rates to attract more bookings.

7.Leverage Peak Season: Prepare for the busy months of July and August by ensuring adequate staffing and resources. Consider promotional offers during the off-season to increase occupancy.

8.Maximize Distribution Channels: While TA/TO is the most used channel, explore ways to increase bookings from other channels like GDS, which has a high ADR for City Hotels.

9.Analyze Country-Specific Preferences: Offer customized experiences and services based on the guest's country of origin. For example, since many guests are from Portugal, consider incorporating Portuguese-speaking staff or offering Portuguese cuisine.

# **Conclusion**

This project analyzed hotel booking data to understand customer behavior and identify areas for improvement. The insights gained can help hotels optimize pricing, reduce cancellations, target specific customer segments, and ultimately increase revenue. By implementing data-driven strategies, hotels can enhance the guest experience and achieve their business objectives


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***