<a href="https://colab.research.google.com/github/RiyazAhammad555/EDA-Project/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

Hotel booking EDA, or exploratory data analysis, involves analyzing a large dataset of hotel bookings to identify trends, patterns, and insights. The dataset includes information on city hotels, resort hotels, and various other features related to bookings.

The dataset contains nearly 32 columns, including information such as hotel type, booking date, lead time, number of adults and children, meal plan, room type, and booking status. With such a large amount of data, it is important to perform thorough exploratory data analysis to gain a better understanding of the dataset and identify potential issues or insights.

One key trend that emerges from the data is the seasonal nature of hotel bookings. The dataset includes bookings made between July 2015 and August 2017, and it is clear that there are peaks and valleys in hotel bookings throughout the year. For example, bookings tend to be highest in the summer months, particularly in August, while they are lowest in the winter months, particularly in December and January.

Another important trend to consider is the difference in booking patterns between city hotels and resort hotels. The dataset includes information on both types of hotels, and it is clear that there are significant differences in the way bookings are made. For example, resort hotels tend to have longer lead times than city hotels, and they are more likely to offer all-inclusive meal plans.

One interesting finding from the data is the relationship between lead time and cancellation rates. The data shows that bookings made further in advance are more likely to be cancelled, with cancellation rates declining as the booking date approaches. This may be due to the fact that people are more likely to change their plans the further in advance they make a booking, whereas last-minute bookings are more likely to be committed.

Another important insight from the data is the relationship between customer demographics and booking patterns. The data includes information on the number of adults and children in each booking, as well as the country of origin for each booking. By analyzing this information, it is possible to identify trends and patterns in the types of customers who book different types of hotels.

Overall, exploratory data analysis of the hotel booking dataset provides valuable insights into trends, patterns, and potential issues in the data. By understanding these trends and patterns, hotel operators can better understand their customers and make more informed decisions about pricing, marketing, and other business strategies.

# **GitHub Link -**

https://github.com/RiyazAhammad555/EDA-Project

# **Problem Statement**


The problem statement could be to identify trends, patterns, and potential issues in a large dataset of hotel bookings, in order to gain insights into customer behavior and inform business strategies for hotel operators. Specifically, the project aims to answer questions such as: What are the seasonal patterns in hotel bookings? How do booking patterns differ between city hotels and resort hotels? What is the relationship between lead time and cancellation rates? What are the demographics of customers who book different types of hotels? By answering these questions, the project seeks to provide valuable insights that can help hotel operators optimize their pricing, marketing, and other business strategies to better serve their customers and improve profitability.

#### **Define Your Business Objective?**

The business objective for the hotel booking EDA project could be to use data-driven insights to optimize pricing, marketing, and other business strategies for hotel operators, with the ultimate goal of improving customer satisfaction and profitability. Specifically, the project aims to help hotel operators:

1.Identify seasonal patterns in hotel bookings and adjust pricing and marketing strategies accordingly to maximize occupancy and revenue.

2.Understand the differences in booking patterns between city hotels and resort hotels, and tailor marketing and service offerings to better serve each customer segment.

3.Analyze the relationship between lead time and cancellation rates to optimize revenue management and minimize lost revenue due to cancellations.

4.Identify customer demographics and preferences for different types of hotels to tailor marketing and service offerings to specific customer segments.

5.By achieving these business objectives, hotel operators can improve customer satisfaction by offering personalized and tailored services, while also increasing revenue and profitability by optimizing pricing and marketing strategies.Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
#mounting drive to access csv file
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
path='/content/drive/MyDrive/Hotel Dataset/'
df=pd.read_csv(path+'Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Number of rows are',len(df.index))
print('Number of Columns are',len(df.columns))

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts() # true means number of duplicate rows presented

#here we have to drop the duplicate values from the dataset

df.drop_duplicates(inplace=True)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum().sort_values(ascending=False) #to find the null values

In [None]:
# Visualizing the missing values

#there are number of null values were found in company,agent,country and children columns and we are going to replace them
df['company'].fillna(0,inplace=True)
df['agent'].fillna(0,inplace=True)
df['country'].fillna('Others',inplace=True)
df['children'].fillna(df['children'].mode()[0],inplace=True)

#checking if there are any null values left or not
df.isna().sum()

### What did you know about your dataset?

Hotel bookings dataset contains a wide range of information related to hotel bookings, such as:

Hotel information: This may include details about the hotel, such as its name, location, number of rooms, and amenities.

1.Booking information: This may include details about the booking itself, such as the booking date, check-in and check-out dates, and length of stay.

2.Customer information: This may include details about the customer making the booking, such as their name, age, country of origin, and contact information.

3.Room information: This may include details about the type of room booked, such as the room type, number of adults and children, and meal plan.

4.Pricing information: This may include details about the price of the booking, such as the total price, taxes and fees, and payment method.

5.Cancellation information: This may include details about whether the booking was cancelled, and if so, the reason for the cancellation.

Overall, the hotel bookings dataset can provide valuable insights into customer behavior, pricing and revenue management, and other key aspects of hotel operations. By analyzing this data, hotel operators can make more informed decisions to optimize their business strategies and improve customer satisfaction.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns  #there are 32 columns in the given data set

In [None]:
# Dataset Describe
df.describe()   #this will give the statistical information of different columns

### Variables Description 

This dataset contains booking information for a city hotel and a resort hotel. It contains the following features
- hotel: Name of hotel ( City or Resort)
- is_canceled: Whether the booking is canceled or not (0 for no canceled and 1 for canceled)
- lead_time: time (in days) between booking transaction and actual arrival.
- arrival_date_year: Year of arrival
- arrival_date_month: month of arrival
- arrival_date_week_number: week number of arrival date.
- arrival_date_day_of_month: Day of month of arrival date
- stays_in_weekend_nights: No. of weekend nights spent in a hotel
- stays_in_week_nights: No. of weeknights spent in a hotel
- adults: No. of adults in single booking record.
- children: No. of children in single booking record.
- babies: No. of babies in single booking record. 
- meal: Type of meal chosen 
- country: Country of origin of customers (as mentioned by them)
- market_segment: What segment via booking was made and for what purpose.
- distribution_channel: Via which medium booking was made.
- is_repeated_guest: Whether the customer has made any booking before(0 for No and 1 for Yes)
- previous_cancellations: No. of previous canceled bookings.
- previous_bookings_not_canceled: No. of previous non-canceled bookings.
- reserved_room_type: Room type reserved by a customer.
- assigned_room_type: Room type assigned to the customer.
- booking_changes: No. of booking changes done by customers
- deposit_type: Type of deposit at the time of making a booking (No deposit/ Refundable/ No refund)
- agent: Id of agent for booking
- company: Id of the company making a booking
- days_in_waiting_list: No. of days on waiting list.
- customer_type: Type of customer(Transient, Group, etc.)
- adr: Average Daily rate.
- required_car_parking_spaces: No. of car parking asked in booking
- total_of_special_requests: total no. of special request.
- reservation_status: Whether a customer has checked out or canceled,or not showed 
- reservation_status_date: Date of making reservation status.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# creating a dictionary to store unique values of each column
unique_value_dict={}

#creating list of columns
list_of_columns=list(df.columns)

for i in list_of_columns:
  unique_value_dict[i]=df[i].unique()

#unique value dict.
unique_value_dict

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df1=df.copy() #creating a copy of dataframe

# we will remove those rows where adults, children and babies equal to zero
df1.drop(df1[df1['adults']+df1['babies']+df1['children']==0].index,inplace=True)

#coverting data type of adults ,babies,children to int 
df1[['adults','babies','children']]=df1[['adults','babies','children']].astype(int)

#converting reservation status date to datetime object
df1['reservation_status_date']=df1['reservation_status_date'].apply(lambda x:datetime.strptime(x,"%Y-%m-%d"))

#addting total stay and total people columns
df1['total_stays']=df1['stays_in_weekend_nights']+df['stays_in_week_nights']
df1['total_people']=df1['adults']+df1['children']+df1['babies']

### What all manipulations have you done and insights you found?

We are given dataset of 119390 observations which is having 32 columns.
Performed feature engineering for the given dataset, while doing feature engineering I have found missing values for some columns and replaced them with appropriate values.I have found a lot of duplicate data in observations and successfully dropped them from the dataset to make calculations easy.

I have figured out the categorical and non-categorical variables.I have changed the data type of for some columns which were need to be changed.I have created new columns which will be helpful for our calculations while performing Exploratory Data Analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Which agent makes the most no. of bookings
plt.rcParams['figure.figsize']=(10,5)
top_agent=df1['agent'].value_counts()
top_agent.iloc[:10].plot(kind='bar')
plt.title('Top agent with most bookings')
plt.xlabel('Agent IDs')
plt.ylabel('No of bookings')

Agent with most number of bookings is "9.0"

##### 1. Why did you pick the specific chart?

Using barplot it's easy to find the max value

##### 2. What is/are the insight(s) found from the chart?

From the above chart it's shown that agent 9 has made most number of bookings ,and he is top agent in suggesting resort to customers

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It'll have an positive Impact as if hotels can encourage the top agents they will bring more customers.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Which meal type is the  most preffered meal of customers
sns.countplot(x=df1['meal'])

The most preferred meal is "BB"

##### 1. Why did you pick the specific chart?

Countplot chart will denote count of each observation in different colors and it's easy to understand 

##### 2. What is/are the insight(s) found from the chart?

From the chart we found that most preferred meal for customers is "BB" , "HB meal" and "SC meal" are second preference

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It'll have an positive impact as hotels can increase the making quantity of 'BB meal' from these insights

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#what is the percentage of bookings in each hotel 
hotel_bookings=df1.groupby('hotel').size()
values=hotel_bookings.values
labels=hotel_bookings.index
plt.pie(values,labels=labels,autopct='%1.2f%%')

##### 1. Why did you pick the specific chart?

While comparing percentage between variables pie chart will provide the best visualization

##### 2. What is/are the insight(s) found from the chart?

From the pie chart I have found that city hotel has booking percentage of 61.13 while resort hotel holds a percentage of 38.87

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can see the percentage of bookings and get better in their maintenance if needed. 

#### Chart - 4

In [None]:
# Chart - 4 visualization code

#from which country most of the guests are coming ?

country=df1['country'].value_counts().sort_values(ascending=False)[0:10].reset_index()
sns.barplot(x='index',y='country',data=country)

##### 1. Why did you pick the specific chart?

Barplot gives good visualiztions while comparing more than 5 rows

##### 2. What is/are the insight(s) found from the chart?

From the chart we have found that most no of guests are from PRT i.e Portugal

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It'll have an positive impact by knowing the country of most no of guests Hotels can prepare special authentic food for those guests

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Which hotel has higher bookings cancellation rate


cancellations=df1.groupby('hotel')['previous_cancellations'].size()
plt.pie(cancellations.values,labels=cancellations.index,autopct='%1.2f%%')

##### 1. Why did you pick the specific chart?

as we are comparing the cancellation percentage in between city and resort hotels

##### 2. What is/are the insight(s) found from the chart?

The city hotel has cancellation percentage of 61.13% 
The resort hotel has cancellation percentage of 38.87%

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It'll help the hotels to look into their service and to think about the ways to reduce the percentage of cancellations.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Which hotel seems to make more revenue
#stays_in_week_nights #stays_in_week_nights
stays_for_revenue=df1.groupby('hotel')['total_stays'].sum().reset_index()
sns.barplot(x='hotel',y='total_stays',data=stays_for_revenue)
plt.title('Revenue based on stays')

##### 1. Why did you pick the specific chart?

as we are comparing total stays of each hotel

##### 2. What is/are the insight(s) found from the chart?

City hotel has more stays which means it generates more revenue than resort hotel

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

by knowning each hotel's revenue it'll help hotels to maintain accounts

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Which hotel has a  high chance that its customer will return for another stay ?
repeated_guests=df1.groupby('hotel')['is_repeated_guest'].value_counts()
plt.pie(repeated_guests.loc[['City Hotel','Resort Hotel'],1].values,labels=['City Hotel','Resort Hotel'],autopct='%1.2f%%')
plt.title("Repeated guests percentage")

##### 1. Why did you pick the specific chart?

pie chart will give better results while performing percentage operations

##### 2. What is/are the insight(s) found from the chart?

Observations we have found from the above chart are that both hotels have almost same percentage of repeated guests yet with a slight difference resort hotel has more percentage of repeated guests

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can get the feedback from repeated guests as they are preferring to book again and again obviously the feedback is going to be positive and it'll help to boost up the business

#### Chart - 8

In [None]:
# Chart - 8 visualization code

#which year has generated most ADR

adr_df=df1.groupby('arrival_date_year')['adr'].sum().reset_index()
sns.barplot(x='arrival_date_year',y='adr',data=adr_df)
plt.title('ADR for Each Year')

##### 1. Why did you pick the specific chart?

to visualize the most ADR generated year

##### 2. What is/are the insight(s) found from the chart?

The most ADR was generated in 2016 and the least in in 2015

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It'll be helpful for hotels in maintaining their accoutns

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Which is the most common channel for booking hotels?
channels=df1['distribution_channel'].value_counts()
plt.figure(figsize = (10,8))
plt.pie(channels.values,labels=channels.index,autopct="%0.2f%%",pctdistance=0.5,explode=[0.05]*5,startangle=30)
plt.title("Booking percentage of Distribution Channels")

##### 1. Why did you pick the specific chart?

to visualize the Booking percentage of Distribution Channels

##### 2. What is/are the insight(s) found from the chart?

The TA/TO distribution channel has more percentage of bookings

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can attract more guests through others channels as well by providing any kind of discounts

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Which are the most busy months?
months_df=df1['arrival_date_month'].value_counts().reset_index()[0:5]
months_df.rename(columns={'index':'months','arrival_date_month':'count'},inplace=True)
sns.barplot(x='months',y='count',data=months_df)
plt.title('Top 5 Busy months')

##### 1. Why did you pick the specific chart?

To compare month variable values

##### 2. What is/are the insight(s) found from the chart?

August is the most busy month and July as well with a slight difference

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can prepare the arrangements accordingly prior to the the busy month by observing the data

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Which types of customers mostly make bookings?

cust_df=df1['customer_type'].value_counts().reset_index()
cust_df.rename(columns={'index':'customer type','customer_type':'count'},inplace=True)
sns.barplot(x='customer type',y='count',data=cust_df)
plt.title('Types of customers')

##### 1. Why did you pick the specific chart?

to visualize the type of customers 

##### 2. What is/are the insight(s) found from the chart?

Transient type of customers has made more number of bookings

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can offer the service according to the customer type and their majority 

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Which significant distribution channel has the highest cancellation percentage?

dist=df1.groupby('distribution_channel')['is_canceled'].sum().sort_values(ascending=False)
plt.pie(dist.values,labels=dist.index,autopct="%1.2f%%")

##### 1. Why did you pick the specific chart?

to find the highest cancellation percentage among the distribution channels

##### 2. What is/are the insight(s) found from the chart?

TA/TO channel has the highest cancellation percentage

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code


list_of_numeric=list(df1.describe())
#we can remove the columns adults,children,babies,stays in week nights and stays in weekend nights as we have added total stays and total people 
columns_to_be_removed=['adults','babies','children','stays_in_weekend_nights','stays_in_week_nights']
corr_list=[x for x in list_of_numeric if x not in columns_to_be_removed]
sns.heatmap(df1[corr_list])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***