# **Project Name**    -


**Hotel Booking Analysis**

##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual

# **Project Summary -**

"I worked with a dataset containing hotel booking information. I began by loading the dataset into my Colab notebook and thoroughly examined its contents, identifying the various variables it contained. Next, I performed data cleaning and preprocessing tasks using functions such as drop(), fillna(), and isna()/isnull() to ensure the data was in good shape for analysis.

During this process, I added new columns that were necessary for my analysis and removed columns that were not relevant to my objectives. Utilizing this refined dataset, I employed various visualization techniques such as pie charts, count plots, bar plots,violin plot, heatmaps,and many more to explore the dataset's variables.

These visualizations allowed me to extract valuable insights, which in turn enabled me to formulate important findings and conclusions about the factors that significantly influence hotel bookings. With these insights in hand, I can propose meaningful business objectives and recommendations to our clients."







# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


# **Performed EDA and tried answering the following questions:**

**1 . Which type of hotel is mostly prefered by the guests?

2 . Which agent made the most bookings?

3 . What is the percentage of repeated guests?

4 . What is the most preferred room type by the customers?

5 . What type of food is mostly prefered by the guests?

6 . In which month most of the bookings happened?

7 . Which distribution channel is mostly used for hotel booking?

8 . Which hotel type has the highest ADR?

9 . which hotel has longer waiting time?

10 . How many people are reservations made for?*

11 . Which hotel type has the most advanced reservations?*

12 . Which country makes the most reservations ?**

#### **Define Your Business Objective?**

"**Our main objective is to perform EDA on the given dataset and draw** **meaningful insights regarding overall trends in hotel bookings and how different factors combine to influence hotel reservations."**

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Importing libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

### Dataset Loading

In [None]:
# My drive is mounted here.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path='/content/Hotel Bookings (1).csv'
df=pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()


In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
#understanding the given information in dataset
df.info()


In [None]:
 # Here is total description of dataset
df.describe()

In [None]:
#Creating the copy of the dataset
df1 = df.copy()



#### Duplicate Values

In [None]:

# Dataset Duplicate Value Count
duplicate_values = df1.duplicated().value_counts()
duplicate_values

In [None]:

# Visualizing the duplicate values
plt.figure(figsize=(5,3))
sns.countplot(x=df1.duplicated())
plt.title('Visualisation of duplicated value', fontsize = 10)
plt.ylabel('Count of Duplicate Values', fontsize = 10)
plt.show()

In [None]:

 # Here duplicate values are drop from dataset
df1 = df1.drop_duplicates()



In [None]:
# shape of dataset after dropping the duplicates.
df1.shape


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
#missing value count
missing_value = df1.isnull().sum().sort_values(ascending=False)[:5]
missing_value


In [None]:

# Calculate the count of missing values in each column
missing_counts = df1.isnull().sum()

# Create a bar plot to visualize missing values
plt.figure(figsize=(7, 3))
missing_counts.plot(kind='bar')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=90)
plt.show()


### What did you know about your dataset?

The dataset provided contains information related to hotel bookings, and the goal is to analyze and explore this dataset to identify significant factors that influence hotel bookings. The dataset comprises 119,390 rows and 32 columns. It's worth noting that there are 31,994 rows that are duplicates of each other across all 32 columns. Additionally, there are four columns in the dataset that contain missing values, namely "company," "agent," "country," and "children."

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df1.columns

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Check Unique Values for each variable.
# I describe all the unique values using lambda fuction of indivisual column.
print(df1.apply(lambda col: col.unique()))


In [None]:
#Here duplicate values are drop which I have already found.
df1 = df1.drop_duplicates()
df1


In [None]:
# Null values are replaced using for loop and .fillna()
null_columns=['company','agent','children']
for columns in null_columns:
  df1[columns].fillna(0,inplace=True)

In [None]:
# Null valyes in contry column is replced by others using .fillna()
df1['country'].fillna('others',inplace=True)

In [None]:
df1.isna().sum().sort_values(ascending=False)[:5]


**Adding New Columns**

In [None]:

# Adding total staying days in hotels
df1['total_stay'] = df1['stays_in_weekend_nights']+df1['stays_in_week_nights']

# Adding total people num as a column
df1['total_people'] = df1['adults']+df1['children']+df1['babies']

# Creating 'guest_category' from variable 'total_people'

# Create a 'guest_category' column based on the 'total_people' column
df1['guest_category'] = np.where(df1['total_people'] == 1, 'single',
                                  np.where(df1['total_people'] == 2, 'couple', 'family'))




There are some rows with total number of adults, children or babies equal to zero this means there is no any booking were made. So we can remove such rows.

In [None]:
# shape of columns which have no bookings
df1[df1['adults']+df1['babies']+df1['children'] == 0].shape


In [None]:
# Columns are dropped here using drop function
df1.drop(df1[df1['adults']+df1['babies']+df1['children'] == 0].index, inplace = True)

### What all manipulations have you done and insights you found?

1 . In the given dataframe, there were 31994 duplicate values. So those values were removed.

2 . There were 4 columns which have missing values and the columns were 'company','agent','country','children'. The values from these columns are replaced by zero.

3 . In dataframe added two columns tatal_stay and total_people.

4 . Three columns 'adults','children','babies' had valuen zero which means no booking has done here, so these columns were removed.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***


#### Chart - 1

1. Which type of hotel is mostly prefered by the guests?

In [None]:

# Chart - 1 visualization code
hotel_value_counts = df1['hotel'].value_counts()
hotel_value_counts



In [None]:
hotel_value_counts.plot.pie(explode=[0.03, 0.03], autopct='%1.2f%%', shadow=True, figsize=(8, 4), fontsize=10)
plt.title('Pie Chart for Most Preferred Hotel', fontsize=10)
plt.show()


In [None]:
# piechart is used for visualization
explode = [0.04] * len(hotel_value_counts)  # Create a list with the same length as the data
hotel_value_counts.plot.pie(explode=explode, autopct='%1.2f%%', shadow=True, figsize=(8, 4), fontsize=10)
plt.title('Pie Chart for Most Preferred Hotel', fontsize=8)
plt.show()


##### 1. Why did you pick the specific chart?

I use pie chart because pie chart gives simple and easy to understand picture that shows which hotel has more bookings.

##### 2. What is/are the insight(s) found from the chart?

I found that city hotel has more bookings which are 61.07% and Resort hotel has less bookings which are 38.93%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, gained insights help creating a positive business impact.

City hotel can find more services to attract more guests to increase more revenue.

Resort hotel can find solution to attract guest and also find which facilities provided ny city hotel to attract the guest.



#### Chart - 2
2. Which agent made the most bookings?

In [None]:
# Chart - 2 visualization code
top_bookings_by_agent = df1['agent'].value_counts().reset_index().rename(columns={'index':'agent','agent':'num_of_bookings'})[:10]
top_bookings_by_agent


In [None]:

# barplot is used for visualization
plt.figure(figsize=(8,4))
sns.barplot(x=top_bookings_by_agent['agent'],y=top_bookings_by_agent['num_of_bookings'],order=top_bookings_by_agent['agent'])
plt.title('Most bookings by the agent', fontsize=10)
plt.ylabel('Number of bookings', fontsize=5)
plt.xlabel('Agent ', fontsize=5)
plt.show()


##### 1. Why did you pick the specific chart?

I choose barplot here because it gives data visualization in pictorial form and due to this comparison of data is easy.

##### 2. What is/are the insight(s) found from the chart?


The insight found here is Agent no. 9 made most of the bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1 . Yes, Agent no.9, 240 has more bookins which makes positive impact.

2 . Agent no. 1 and 6 has less bookins which makes neative impact.

3 . Booking made by agent no 1 and 6 are about 4.27% of agent no 9 which has highest bookings.

#### Chart - 3

3. What is the percentage of repeated guests?

In [None]:

# Chart - 3 visualization code
repeated_guests_count = df1['is_repeated_guest'].value_counts()
repeated_guests_count


In [None]:

# barplot is used for visaulization
repeated_guests_count.plot.pie(explode=[0.03, 0.03], autopct='%1.2f%%', shadow=True, figsize=(5,3),fontsize=10)
plt.title('Percentage of repeated guests ',fontsize = 20)

##### 1. Why did you pick the specific chart?

I use pie chart because pie chart gives simple and easy to understand picture that shows how many guests book perticular hotel repetadly.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is there are very few guests booking for the same hotel again

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help creating a positive business impact like the hotels which do not booked repetadly by guests can take feedbacks from the guests and try to impove there services

#### Chart - 4
What is the most preferred room type by the customers?

In [None]:
# Chart - 4 visualization code
room_type = df1['assigned_room_type'].value_counts()
plt.figure(figsize=(10,5))
sns.countplot(x=df1['assigned_room_type'],order=df1['assigned_room_type'].value_counts().index)
plt.title("Most preferred Room type", fontsize = 10)
plt.xlabel('Type of the Room', fontsize = 10)
plt.ylabel('Room type count', fontsize = 10)
plt.show()

##### 1. Why did you pick the specific chart?

I have choose countplot to visualize most prefferd roomtype because countplot display the count of each observation for each category and here we have to represent room type vs room type count.

##### 2. What is/are the insight(s) found from the chart?

The insighte found from the chart is A type rooms are most prefered rooms and the count is 46283 and after that D type rooms are prefered by the guest and count is 22419.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. A type rooms are most preferred rooms. This make positive impact on business.
2. H,I,K,L type rooms are less preferred this insight makes neative impact.
3. This is beacause type A rooms have 46283 bookings anf type L room has only one booking.

#### Chart - 5
5. What type of food is mostly prefered by the guests?

In [None]:
# Chart - 5 visualization code
preferred_food = df1['meal'].value_counts()
plt.figure(figsize=(10,4))
sns.countplot(x=df1['meal'],order=df1['meal'].value_counts().index)
plt.title("Most preferred Food", fontsize = 10)
plt.xlabel('Type of the food', fontsize = 5)
plt.ylabel('Food type count', fontsize = 5)


##### 1. Why did you pick the specific chart?

I have choose countplot to visualize most preferred food because countplot display the count of each observation for each category and here we have to represent food type vs food type count.

##### 2. What is/are the insight(s) found from the chart?

The insight found here is BB type food is most preferred anf FB type of food is less preferred.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. BB type of food is most preferred food this makes positive impact on business.
2. Undefined and FB type of food is less preferred this insight makes neative impact on business.

#### Chart - 6
6.In which month most of the bookings happened?

In [None]:
# Chart - 6 visualization code
bookings_by_months=df1.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts of booking"})
bookings_by_months=df1.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Counts of booking"})
sequence_of_months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
bookings_by_months['arrival_date_month']=pd.Categorical(bookings_by_months['arrival_date_month'],categories=sequence_of_months,ordered=True)
bookings_by_months=bookings_by_months.sort_values('arrival_date_month')
bookings_by_months


In [None]:

# barplot for visualization of month in which most booking happened.
plt.figure(figsize=(12,5))
sns.barplot(data=bookings_by_months, x="arrival_date_month", y="Counts of booking")
plt.title("Number of Bookings in Months", fontsize = 10)
plt.xlabel('Month', fontsize = 5)
plt.ylabel('Number of Bookings', fontsize = 5)


##### 1. Why did you pick the specific chart?

I choose barplot here because it gives data visualization in pictorial form. So comparison becomes easy.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is August month has maximum number of bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. November, December and January months have less bookins this is neative impact.
2. July and August months have bookings above the average bookings and November, December and January months have bookings below the average bookings.

#### Chart - 7
7.Which distribution channel is mostly used for hotel booking?


In [None]:
# Chart - 7 visualization code
group_by_dc_hotel = df1.groupby(['distribution_channel', 'hotel'])
d5 = pd.DataFrame(round((group_by_dc_hotel['adr']).agg(np.mean),2)).reset_index().rename(columns = {'adr': 'avg_adr'})
plt.figure(figsize = (7,5))
sns.barplot(x = d5['distribution_channel'], y = d5['avg_adr'], hue = d5['hotel'])
plt.ylim(40,140)
plt.show()

##### 1. Why did you pick the specific chart?

Because barplot gives simple and easy to understanding

##### 2. What is/are the insight(s) found from the chart?

Mostly used distribution channel is TA/TO channel.The total count of booking is 69028 and booking in percent is 79.13.

#### Chart - 8
8.Which hotel type has the highest ADR?

In [None]:
# Chart - 8 visualization code
highest_adr = df1.groupby('hotel')['adr'].mean().reset_index()
plt.figure(figsize=(8,4))
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])
plt.title('Average ADR for each Hotel type', fontsize=10)
plt.xlabel('Type of hotel',fontsize=5)
plt.ylabel('ADR', fontsize=5)
plt.show()

##### 1. Why did you pick the specific chart?

I choose bar plot because it gives simple pictorial diagram and it also easy to understand.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is City hotel has highest adr that means city hotel generate more revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1 . City hotel has high adr this makes positive impact.

2 . Resort hotel has less adr as compaire to city hotel this makes negative impact.

3 . City hotel has adr 110.98 means more revenue and resort hotel has 99.02 adr means less revenue than city hotel.

4 . Resort hotel should have increase there facilitis which increase revenue.

#### Chart - 9
**9.which hotel has longer waiting time?**

In [None]:
# Chart - 9 visualization code
Waiting_time = df1.groupby('hotel')['days_in_waiting_list'].mean().reset_index()
plt.figure(figsize=(8,4))
sns.barplot(x=Waiting_time['hotel'],y=Waiting_time['days_in_waiting_list'])
plt.title('Waiting time for each hotel type', fontsize=10)
plt.xlabel('Type of hotel',fontsize=5)
plt.ylabel('Waiting time', fontsize=5)


##### 1. Why did you pick the specific chart?

I choose barplot bacuase it gives easy to understand pictorial diagram for the visualization of which hotel has longer waiting time.

##### 2. What is/are the insight(s) found from the chart?

City hotel has longer waiting time.Therefore city hotel is much busier than Resort hotel.

#### Chart - 10
How many people are reservations made for?*

In [None]:
# Chart - 10 visualization code


# Create a 'guest_category' column based on the 'total_people' column
df1['guest_category'] = np.where(df1['total_people'] == 1, 'single',
                                  np.where(df1['total_people'] == 2, 'couple', 'family'))

# Define the annot_percent function
def annot_percent(plot, feature, ax):
    total = len(feature)
    for p in plot.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total)
        x = p.get_x() + p.get_width() / 2 - 0.15
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y), size=10)

plt.figure(figsize=(5, 5))
ax = sns.countplot(x=df1['hotel'], hue=df1['guest_category'])
ax.set_title('Hotel vs. Guest Category')
annot_percent(ax, df1['guest_category'], ax)
plt.show()


Most customers book hotels for two people (couples). Customers prefer city hotels over resorts for family bookings. A city hotel is preferred when booking for a single person.

#### Chart - 11
11. Which hotel type has the most advanced reservations?*

In [None]:
# Chart - 11 visualization code
# Plotting violin plot for hotel against lead_time
plt.figure(figsize=(5,5))
sns.violinplot(x=df1['hotel'], y=df1['lead_time'])
ax.set_title('is_canceled v/s lead_time')
plt.show()

##### 1. Why did you pick the specific chart?

I choose violin here because it gives data visualization in pictorial form. So comparison becomes easy.

##### 2. What is/are the insight(s) found from the chart?

In comparison to city hotels, guests book resort hotels a little bit in advance.

#### Chart - 12
Question 12 : Which country makes the most reservations

In [None]:
# Chart - 12 visualization code
country_df = pd.DataFrame(df1['country'].value_counts()).reset_index()
country_df.rename(columns={'index': 'country','country': 'num_of_bookings'},inplace=True)

# Plotting point plot for country with number of bookings
plt.figure(figsize=(7,5))
ax=sns.barplot(x=country_df['country'].head(10), y=country_df['num_of_bookings'])
ax.set_title('Top 10 countries with number of bookings')
plt.show()

The majority of reservations are made through country PRT. Customers make the most bookings in the following top 5 countries: PRT, GBR, FRA, ESP, and DEU.

#### Chart - 13

13.Does a longer waiting period cause the cancellation of bookings?

In [None]:
# Chart - 13 visualization code
#Selecting bookings with non zero waiting time
waiting_time=df1[df1['days_in_waiting_list']!=0]

#ploting graph

plt.figure(figsize=(5,5))
ax=sns.kdeplot(x=waiting_time['days_in_waiting_list'], hue=waiting_time['is_canceled'])
ax.set_title('days_in_waiting_list')
plt.tight_layout()
plt.show()

The majority of canceled bookings have a waiting period of less than 150 days, but those that are not canceled bookings by customers have a waiting period of less than 150 days, which has a higher density than the canceled bookings. So a longer waiting period is not a reason for booking cancellation.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
sns.heatmap(df1.corr(), vmin=-1, cmap='coolwarm', annot=True)
plt.show()

# **Heatmap conclusion**

In the heatmap, we see some high correlation between a few variables because we created new variables total_stays, total_people, and total_children from existing variables and did not drop old variables.
The variables lead_time and is_canceled have weak relationships. The most likely reason for cancellation is a longer lead time.

# **Conclusion**

1 . City hotel has almost 60% bookings and resort hotel has 40% bookings.

2 . Agent no. 9 made most bookins and those bookings are 28721.
Percentage of repeated guest is just 4%.

3 . Room type A is most preferred room type 46283 guests preferred A room type.

4 . BB type food is most preferred food type and 67907 preferred this food.

5 . August month has maximum number of bookings and those bookings are 11242.

6 . TA/TO distribution channel is mostly prefderred channel and the bookings are 69028.

7 . 2016 year has 42313 bookings.

8 . City hotel has highest ADR and trhe average ADR is 111.27.

9 . City hotel has longer waiting time means city hotel is busy hotel type.

10 . GDS contribution channel contributed more to ADR in order to incerease income in city hotel.

11 . Optimal stay length in both hotel type is leaa than 7 days.

12 . City hotel has longer waiting time.Therefore city hotel is much busier than Resort hotel.

13 . Most customers book hotels for two people (couples). Customers prefer city hotels over resorts for family bookings. A city hotel is preferred when booking for a single person.

14 . In comparison to city hotels, guests book resort hotels a little bit in advance reservatiopn

15 . The majority of reservations are made through the country PRT, and the top five countries with the highest booking volumes are PRT, GBR, FRA, ESP, and DEU.

16 . A longer waiting period does not appear to be a significant reason for booking cancellations, as the majority of canceled bookings and non-canceled bookings have waiting periods of less than 150 days, with non-canceled bookings having a higher density in this range.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***