<a href="https://colab.research.google.com/github/Amit62039/Exploratory-Data-Analysis-of-Hotel-Booking/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -
Exploratory Data Analysis of Hotel Booking


##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

**Objectives**
The primary objectives of this project are:

Identify Optimal Booking Times: Determine the best times of year to book a hotel room to get the best rates.

Analyze Stay Length Impact: Investigate the optimal length of stay to achieve the best daily rate.

Predict Special Requests: Predict the likelihood of a hotel receiving a disproportionately high number of special requests.

**Dataset Description**
The dataset contains booking information for a city hotel and a resort hotel, including:

Booking details: lead time, booking status, arrival dates.
Guest details: number of adults, children, babies.
Stay details: number of weekend nights, week nights.
Financial details: average daily rate (ADR).
Other details: meal type, market segment, distribution channel, special requests, and more.

**Methodology**

Data Preprocessing:

Loaded the dataset and handled missing values.

Converted relevant columns to appropriate data types.

Created new features such as total_stays (sum of weekend and week nights).

**Exploratory Data Analysis (EDA)**:

Distribution of Lead Time:

Analyzed how far in advance bookings are made.

Monthly ADR Trends: Investigated the variation of ADR across different months.

Booking Status Analysis: Examined the booking cancellation rates across various factors.

Special Requests Distribution: Studied the frequency and patterns of special requests.

**Visualizations**:

Correlation Heatmap: Visualized relationships between various numerical features.

Pair Plots: Explored pairwise relationships between key variables.

Bar Plots and Line Plots: Illustrated trends and distributions of important metrics.

**Key Analysis and Insights**:

Determined the optimal booking lead time to achieve lower ADR.

Identified months with the highest and lowest average daily rates.

Assessed the impact of stay length on daily rates.

Predicted factors contributing to a high number of special requests.

**Key Findings**

Optimal Booking Times:
Bookings made well in advance tend to have lower ADR.
Specific months show a clear trend in ADR variations, with certain off-peak months offering better rates.

Impact of Stay Length:
The analysis revealed that longer stays generally result in a lower average daily rate, but the effect diminishes after a certain number of nights.

Special Requests Prediction:
Identified key features that contribute to a higher number of special requests, such as longer stays and bookings made via certain market segments.


**Future Work**
Advanced Predictive Models: Develop machine learning models to predict cancellations and special requests more accurately.
Customer Segmentation: Perform clustering analysis to identify distinct customer segments for targeted marketing.
Operational Insights: Explore the impact of booking changes and waiting lists on hotel operations.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Problem Statement

The hotel booking industry is highly competitive, with numerous factors influencing guest preferences and booking behaviors. Understanding these factors is crucial for hotels to optimize their operations, improve guest satisfaction, and maximize revenue. This dataset contains booking information for a city hotel and a resort hotel, providing an opportunity to analyze booking patterns, pricing strategies, guest demographics, and more.

#### **Define Your Business Objective?**

The objective of this analysis is to explore and identify key factors that influence hotel bookings, cancellations, and special requests. The insights derived from this analysis can help hotel managers and stakeholders make informed decisions to enhance their operational efficiency and customer satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load the dataset
file_path = '/Hotel Bookings.csv'
hotel_data = pd.read_csv(file_path)



### Dataset First View

In [None]:
# Dataset First Look

# Display the first few rows of the dataset
hotel_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Get the number of rows and columns
num_rows = hotel_data.shape[0]
num_columns = hotel_data.shape[1]

print(f'Number of rows: {num_rows}')
print(f'Number of columns: {num_columns}')


### Dataset Information

In [None]:
# Dataset Info

# general information about the dataset
hotel_data.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = hotel_data.duplicated().sum()

print(f'Number of duplicate rows: {duplicate_count}')


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_values = hotel_data.isnull().sum()

print(missing_values)


In [None]:
# Visualizing the missing values

import matplotlib.pyplot as plt

# Calculate the number of missing values in each column
missing_values = hotel_data.isnull().sum()

# Filter out columns with no missing values
missing_values = missing_values[missing_values > 0]

# Plotting
plt.figure(figsize=(14, 8))
missing_values.plot(kind='bar', color='skyblue')
plt.title('Missing Values per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()



### What did you know about your dataset?

Dataset provides valuable insights into booking trends, guest behavior, pricing strategies, and operational efficiency. These insights can help hotels optimize their marketing strategies, improve guest satisfaction, and enhance revenue management.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

dataset_columns = hotel_data.columns.tolist()

print(dataset_columns)


In [None]:
# Dataset Describe

dataset_description = hotel_data.describe()

print(dataset_description)


### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

unique_values = hotel_data.nunique()

print(unique_values)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import pandas as pd

# Load your dataset
file_path = 'path_to_your_file/Hotel Bookings.csv'
hotel_data = pd.read_csv(file_path)

# Inspect the dataset
print("Initial Dataset Info:")
print(hotel_data.info())
print("\nInitial Dataset Description:")
print(hotel_data.describe())

# Remove duplicate entries
hotel_data = hotel_data.drop_duplicates()

# Handle missing values
# You can choose to fill missing values, drop them, or use other methods based on the context
# Here we will fill missing values with appropriate strategies for each column
hotel_data['children'].fillna(0, inplace=True)
hotel_data['country'].fillna('Unknown', inplace=True)
hotel_data['agent'].fillna(0, inplace=True)
hotel_data['company'].fillna(0, inplace=True)

# Convert data types
# For example, convert 'reservation_status_date' to datetime
hotel_data['reservation_status_date'] = pd.to_datetime(hotel_data['reservation_status_date'])

# Feature engineering
# Create a new column for total guests
hotel_data['total_guests'] = hotel_data['adults'] + hotel_data['children'] + hotel_data['babies']

# Create a new column to indicate if a booking has children or babies
hotel_data['has_kids'] = hotel_data['children'] + hotel_data['babies'] > 0

# Create a new column for booking month and year
hotel_data['arrival_date_month_year'] = pd.to_datetime(
    hotel_data['arrival_date_year'].astype(str) + '-' + hotel_data['arrival_date_month'] + '-01'
)

# Inspect the cleaned and processed dataset
print("\nCleaned Dataset Info:")
print(hotel_data.info())
print("\nCleaned Dataset Description:")
print(hotel_data.describe())

# Save the cleaned dataset for future analysis
cleaned_file_path = 'path_to_your_file/Cleaned_Hotel_Bookings.csv'
hotel_data.to_csv(cleaned_file_path, index=False)

print("\nData preparation complete. Cleaned dataset saved as 'Cleaned_Hotel_Bookings.csv'.")



### What all manipulations have you done and insights you found?

**Data Manipulation**

The dataset was loaded into a pandas DataFrame. Displayed the initial information and description of the dataset to understand its structure and contents. Removed duplicate rows from the dataset to ensure data integrity.

Children: Filled missing values with 0, assuming that missing values indicate no children.
Country: Filled missing values with 'Unknown' to retain rows while acknowledging missing information.

Agent and Company: Filled missing values with 0, assuming that missing values indicate no agent or company involved.
Data Type Conversion:

Converted the 'reservation_status_date' column to datetime format to facilitate time-based analysis.

Total Guests: Created a new column total_guests summing the number of adults, children, and babies in each booking.

Has Kids: Created a new boolean column has_kids indicating whether the booking includes children or babies.

Booking Month and Year: Created a new column arrival_date_month_year representing the arrival date's month and year for time series analysis.
Final Inspection:

Displayed the cleaned dataset's information and description to confirm the applied changes.

Saved the cleaned dataset to a new CSV file for future analysis.

**Insights**


The majority of missing values were in columns related to guests (children), agents, and companies. By filling these with appropriate values, we ensured that the dataset remains comprehensive without losing significant information.

The new total_guests column provides a clear metric to analyze the overall guest volume per booking. This can be useful for understanding the capacity utilization of the hotel.

The has_kids column helps in segmenting bookings into those with children/babies and those without, which can be important for targeted marketing and service provision.

Converting the reservation status date and creating the booking month-year column allows for detailed time series analysis. This can help in understanding seasonal trends, peak booking periods, and the impact of time on booking behaviors.

Initial inspection likely showed the distribution of bookings across different months, lead times, and booking types, providing a foundation for deeper analysis.

By analyzing cancellations, you can identify patterns and factors leading to higher cancellation rates, which can help in developing strategies to minimize cancellations.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Convert arrival_date_month to categorical with ordered months
hotel_data['arrival_date_month'] = pd.Categorical(hotel_data['arrival_date_month'],
                                                  categories=['January', 'February', 'March', 'April', 'May', 'June',
                                                              'July', 'August', 'September', 'October', 'November', 'December'],
                                                  ordered=True)

# Group by month and calculate the mean ADR
monthly_adr = hotel_data.groupby('arrival_date_month')['adr'].mean().reset_index()

# Plotting the mean ADR by month
plt.figure(figsize=(12, 6))
sns.barplot(x='arrival_date_month', y='adr', data=monthly_adr, palette='viridis')
plt.title('Average Daily Rate (ADR) by Month')
plt.xlabel('Month')
plt.ylabel('Average Daily Rate (ADR)')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

It shows distribution of ADR throughout the year and makes monthwise comparision easier.

##### 2. What is/are the insight(s) found from the chart?

Best Time of Year to Book a Hotel Room

Cheapest Months: January and February have the lowest ADRs
Most Expensive Months: August has the highest ADR, followed by July and September.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

January and February have the lowest ADRs, making them the best months to book a hotel room for a lower rate.So people might advacely book the room and reduce the possibility to increase demand

#### Chart - 2

In [None]:
# Chart - 2 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

hotel_data['total_stays'] = hotel_data['stays_in_weekend_nights'] + hotel_data['stays_in_week_nights']
stay_adr = hotel_data.groupby('total_stays')['adr'].mean().reset_index()
plt.figure(figsize=(12, 6))
sns.lineplot(x='total_stays', y='adr', data=stay_adr, marker='o')
plt.title('Average Daily Rate (ADR) by Total Length of Stay')
plt.xlabel('Total Length of Stay (Nights)')
plt.ylabel('Average Daily Rate (ADR)')
plt.xticks(range(0, 20))  # Adjust the range as needed
plt.show()

##### 1. Why did you pick the specific chart?

For daily trend analysis we generally use line chart

##### 2. What is/are the insight(s) found from the chart?

Optimal length of stay for best daily rate

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Optimisation will help in improving business strategy

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(hotel_data['lead_time'], kde=True, bins=30, color='blue')
plt.title('Distribution of Lead Time')
plt.xlabel('Lead Time (days)')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

As it shows distribution in much relative manner.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(x='arrival_date_month', hue='is_canceled', data=hotel_data, palette='viridis')
plt.title('Booking Status by Month')
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45)
plt.legend(title='Canceled', loc='upper right', labels=['Not Canceled', 'Canceled'])
plt.show()



##### 1. Why did you pick the specific chart?

Distribution of booking by months and comparision of booking status.

##### 2. What is/are the insight(s) found from the chart?

Most of the people vists in the month of August.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

August is the peak period for increasing business.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='total_of_special_requests', data=hotel_data, palette='viridis')
plt.title('Distribution of Special Requests')
plt.xlabel('Number of Special Requests')
plt.ylabel('Frequency')
plt.show()



##### 1. Why did you pick the specific chart?

To have count of all categories

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(14, 10))
sns.heatmap(hotel_data.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()



##### 1. Why did you pick the specific chart?

To set up corelation between all factors.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(x='market_segment', data=hotel_data, palette='viridis')
plt.title('Booking Distribution by Market Segment')
plt.xlabel('Market Segment')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

to categorise them seperately.

##### 2. What is/are the insight(s) found from the chart?

Most of the booking is being done by travel agent through onlne mode.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Travel agent are the important factor of increasing business.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
customer_adr = hotel_data.groupby('customer_type')['adr'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(x='customer_type', y='adr', data=customer_adr, palette='viridis')
plt.title('Average Daily Rate (ADR) by Customer Type')
plt.xlabel('Customer Type')
plt.ylabel('Average Daily Rate (ADR)')
plt.show()


##### 1. Why did you pick the specific chart?

It compares the 4 category

##### 2. What is/are the insight(s) found from the chart?

Most of the people coming to hotel are travellers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

all type of customers to be taken care of as there is not much difference in there no.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
deposit_cancel = hotel_data.groupby('deposit_type')['is_canceled'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(x='deposit_type', y='is_canceled', data=deposit_cancel, palette='viridis')
plt.title('Cancellation Rate by Deposit Type')
plt.xlabel('Deposit Type')
plt.ylabel('Cancellation Rate')
plt.show()


##### 1. Why did you pick the specific chart?

it compares the three cases

##### 2. What is/are the insight(s) found from the chart?

most of the cancellation is done by non refundable deposits

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is something wrong

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='booking_changes', data=hotel_data, palette='viridis')
plt.title('Distribution of Booking Changes')
plt.xlabel('Number of Booking Changes')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

It provides count

##### 2. What is/are the insight(s) found from the chart?

Not much

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Its just a representation

#### Chart - 11

In [None]:
# Chart - 11 visualization code
top_countries = hotel_data['country'].value_counts().head(10).index
plt.figure(figsize=(12, 6))
sns.countplot(y='country', data=hotel_data[hotel_data['country'].isin(top_countries)], palette='viridis', order=top_countries)
plt.title('Top 10 Booking Countries')
plt.xlabel('Number of Bookings')
plt.ylabel('Country')
plt.show()


##### 1. Why did you pick the specific chart?

It provide count of data

##### 2. What is/are the insight(s) found from the chart?

Most of the travellers are coming from counttry-PRT

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotel Setvices could attract more travellers.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='required_car_parking_spaces', data=hotel_data, palette='viridis')
plt.title('Parking Space Requirement')
plt.xlabel('Number of Required Parking Spaces')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

It provide count of parking spaces.

##### 2. What is/are the insight(s) found from the chart?

Floating parking space is required

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Not much

#### Chart - 13

In [None]:
# Chart - 13 visualization code
room_adr = hotel_data.groupby('reserved_room_type')['adr'].mean().reset_index()
plt.figure(figsize=(12, 6))
sns.barplot(x='reserved_room_type', y='adr', data=room_adr, palette='viridis')
plt.title('Average Daily Rate (ADR) by Room Type')
plt.xlabel('Reserved Room Type')
plt.ylabel('Average Daily Rate (ADR)')
plt.show()


##### 1. Why did you pick the specific chart?

For comparision it is better

##### 2. What is/are the insight(s) found from the chart?

Average Daily Rate for H type of room is more than others

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

H type of rooms are in demand

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(hotel_data.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select a subset of columns for clarity
subset_columns = ['lead_time', 'adr', 'total_of_special_requests', 'stays_in_weekend_nights',
                  'stays_in_week_nights', 'adults', 'children', 'babies', 'is_canceled']

# Create a pairplot
plt.figure(figsize=(14, 10))
sns.pairplot(hotel_data[subset_columns], diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Selected Features', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Pair plot could be best analysed in this way.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Peak Booking Periods increases the business but higher cancellation rate of online agents negatively affect the business.

High booking rates during holidays and summer months, indicating peak travel periods so prices could be increased. Online travel agents have higher cancellation rates compared to direct bookings so some deductions can be imposed.

# **Conclusion**

The analysis of the hotel booking dataset provided valuable insights into booking patterns, optimal times for booking, and factors influencing special requests. These findings can help hotels optimize their pricing strategies, enhance customer satisfaction, and improve operational efficiency.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***