<a href="https://colab.research.google.com/github/CodeNinjaSatyam/EDA-on-hotel-booking-analysis/blob/main/EDA_on_Hotel_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking EDA Project



##### **Project Type** - EDA (Exploratory Data Analysis)
##### **Contribution** - Individual
##### **Team Member 1 -** Satyam Satish Ghule

# **Project Summary -**

This project aims to perform an exploratory data analysis (EDA) on a dataset related to hotel bookings. The dataset contains information about hotel bookings, including guest details, booking dates, room types, and more. Through this EDA project, we will gain insights into booking trends, customer preferences, and other patterns that can help in making informed business decisions.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The goal of this EDA project is to understand the dataset and extract valuable insights that can assist hotel management in making data-driven decisions. Specifically, we aim to answer questions related to booking trends, customer demographics, and factors influencing booking cancellations.

#### **Define Your Business Objective?**

Our business objective is to:

1. Identify booking trends by analyzing the data.
2. Understand customer preferences and demographics.
3. Investigate factors that lead to booking cancellations.
4. Provide insights that can help improve the hotel's booking system and customer satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
# Upload the data set file and run this
data = pd.read_csv("/content/Hotel Bookings.csv")

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = data.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

1. The dataset is named "HotelBookings.csv" and contains hotel booking information.
2. It has 119390 rows and 32 columns.
3. Various data types are present, and some columns may need data type conversion.
4. There are 31994 duplicate rows in the dataset.
5. Missing values are present in several columns, with 488, 16340, 112593 in country, agent, company .
6. Data cleaning and preprocessing are required for handling missing values.
7. Further steps involve exploratory data analysis (EDA) to uncover insights and relationships in the data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description

1. Hotel: Categorical variable indicating the type of hotel (Resort Hotel or City Hotel).

2. is_canceled: Binary variable (0 or 1) indicating whether the booking was canceled (1) or not (0).

3. lead_time: Numerical variable representing the number of days between the booking date and the arrival date.

4. arrival_date_year: Numerical variable representing the year of arrival.

5. arrival_date_month: Categorical variable representing the month of arrival.

6. arrival_date_week_number: Numerical variable representing the week number of the year for the arrival date.

7. arrival_date_day_of_month: Numerical variable representing the day of the month for the arrival date.

8. stays_in_weekend_nights: Numerical variable indicating the number of weekend nights (Saturday and Sunday) the guest stayed.

9. stays_in_week_nights: Numerical variable indicating the number of weekday nights (Monday to Friday) the guest stayed.

10. adults: Numerical variable indicating the number of adults in the booking.

11. children: Numerical variable indicating the number of children in the booking.

12. babies: Numerical variable indicating the number of babies in the booking.

13. meal: Categorical variable indicating the type of meal booked (e.g., BB for Bed & Breakfast).

14. country: Categorical variable representing the country of origin of the guest.

15. market_segment: Categorical variable indicating the market segment for the booking.

16. distribution_channel: Categorical variable indicating the distribution channel used to make the booking.

17. is_repeated_guest: Binary variable (0 or 1) indicating whether the guest is a repeated guest (1) or not (0).

18. previous_cancellations: Numerical variable indicating the number of previous booking cancellations by the guest.

19. previous_bookings_not_canceled: Numerical variable indicating the number of previous bookings that were not canceled by the guest.

20. reserved_room_type: Categorical variable indicating the originally reserved room type.

21. assigned_room_type: Categorical variable indicating the room type assigned to the guest upon arrival.

22. booking_changes: Numerical variable indicating the number of changes made to the booking.

23. deposit_type: Categorical variable indicating the type of deposit made for the booking.

24. agent: Numerical variable representing the ID of the booking agent.

25. company: Numerical variable representing the ID of the company that made the booking (null values indicate individual bookings).

26. days_in_waiting_list: Numerical variable indicating the number of days the booking was on the waiting list.

27. customer_type: Categorical variable indicating the type of customer (e.g., Transient).

28. adr: Numerical variable representing the average daily rate (price) for the booking.

29. required_car_parking_spaces: Numerical variable indicating the number of car parking spaces requested by the guest.

30. total_of_special_requests: Numerical variable indicating the total number of special requests made by the guest.

31. reservation_status: Categorical variable indicating the reservation status (e.g., Check-Out).

32. reservation_status_date: Date when the reservation status was last updated.

These descriptions provide a basic understanding of each variable in your dataset, including data types and the nature of the information they contain.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = {}
for column in data.columns:
    unique_values[column] = data[column].nunique()

print("Unique values for each variable:")
print(unique_values)



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
missing_data = data.isnull().sum()
print("Missing Data:\n", missing_data)

In [None]:
# Drop rows with missing values
data = data.dropna()

In [None]:
# Check for duplicates and remove them
data = data.drop_duplicates()

In [None]:
# Convert date columns to datetime format
data['reservation_status_date'] = pd.to_datetime(data['reservation_status_date'])


In [None]:
# Encoding categorical variables (example: meal)
data = pd.get_dummies(data, columns=['meal'], prefix='meal')

In [None]:
# Feature engineering example: total guests
data['total_guests'] = data['adults'] + data['children'] + data['babies']


In [None]:
# Data visualization (example: count of bookings by month)
monthly_bookings = data['arrival_date_month'].value_counts()
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.barplot(x=monthly_bookings.index, y=monthly_bookings.values)
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45)
plt.title('Monthly Bookings')
plt.show()

### What all manipulations have you done and insights you found?

## Data Wrangling Manipulations:

1. Data Loading: Load the dataset into a data structure (e.g., pandas DataFrame).
2. Data Cleaning: Handle missing values, either by imputation or dropping rows/columns with missing data.
3. Duplicate Removal: Check for and remove duplicate records to ensure data integrity.
4. Data Type Conversion: Ensure that data types are appropriate for analysis (e.g., converting dates to datetime format).
5. Categorical Encoding: Encode categorical variables as numerical values (e.g., one-hot encoding).
6. Feature Engineering: Create new features from existing ones (e.g., calculating the total number of guests).
7. Data Transformation: Apply mathematical operations or transformations to variables (e.g., log transformation).
8. Data Visualization: Visualize the data to gain insights and identify patterns.

## Insights from the Data:

1. Booking Trends: Identify trends in booking over time, such as monthly or seasonal patterns.
2. Cancellation Rates: Calculate and compare the cancellation rates for different types of bookings.
3. Customer Demographics: Explore the distribution of customer demographics, such as the origin of guests (country), customer type, and market segment.
4. Pricing Analysis: Analyze the average daily rates (ADR) and its variations over time or across customer segments.
5. Booking Channels: Investigate the effectiveness of different booking channels and their impact on bookings and cancellations.
6. Length of Stay: Understand the distribution of the length of stay and how it relates to booking outcomes.
7. Reservation Status: Explore the reservation status and its distribution over time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1: Univariate Analysis

plt.figure(figsize=(8, 5))
sns.countplot(data=data, x='hotel', palette='Set1')
plt.title('Distribution of Hotel Types')
plt.xlabel('Hotel Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the distribution of a categorical variable.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that there are more bookings for City Hotels compared to Resort Hotels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that City Hotels have more bookings can help the business focus on improving and optimizing services for City Hotel customers. However, it may also mean higher competition in the city hotel segment

#### Chart - 2

In [None]:
# Chart - 2: Bivariate Analysis (Numerical - Categorical)

plt.figure(figsize=(10, 6))
sns.boxplot(x='hotel', y='lead_time', data=data)
plt.title('Lead Time vs. Hotel Type')
plt.xlabel('Hotel Type')
plt.ylabel('Lead Time')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot to visualize the relationship between a numerical variable (lead_time) and a categorical variable (hotel).

##### 2. What is/are the insight(s) found from the chart?

The box plot shows that Resort Hotels tend to have shorter lead times compared to City Hotels. City Hotels have a wider range of lead times.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that Resort Hotels have shorter lead times may suggest that customers prefer booking Resort Hotels closer to their check-in dates, which could be a positive aspect for marketing and pricing strategies.


#### Chart - 3

In [None]:
# Chart - 3: Bivariate Analysis (Categorical - Categorical)

plt.figure(figsize=(10, 6))
cross_tab = pd.crosstab(data['hotel'], data['customer_type'])
cross_tab.plot(kind='bar', stacked=True)
plt.title('Hotel vs. Customer Type')
plt.xlabel('Hotel Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a stacked bar chart to visualize the relationship between two categorical variables (hotel and customer_type).

##### 2. What is/are the insight(s) found from the chart?

The chart shows the distribution of customer types across different hotel types. City Hotels have a larger proportion of Transient customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight can help in tailoring marketing and services for different customer types, potentially leading to a positive impact.

#### Chart - 4

In [None]:
# Chart - 4: Univariate Analysis

plt.figure(figsize=(10, 6))
plt.hist(data['lead_time'], bins=30, color='skyblue')
plt.title('Lead Time Distribution')
plt.xlabel('Lead Time (days)')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram to visualize the distribution of the 'lead_time' variable.

##### 2. What is/are the insight(s) found from the chart?

The chart shows the distribution of lead times for bookings, with a peak at lower lead times.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 The insight suggests that many bookings are made with short lead times, which can help in optimizing staffing and resources.

#### Chart - 5

In [None]:
# Chart - 5: Bivariate Analysis (Numerical - Categorical)

plt.figure(figsize=(10, 6))
sns.boxplot(x='hotel', y='adr', data=data, palette='pastel')
plt.title('Hotel vs. Average Daily Rate (ADR)')
plt.xlabel('Hotel Type')
plt.ylabel('Average Daily Rate (ADR)')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a boxplot to visualize the relationship between the 'hotel' and 'adr' (average daily rate) variables.

##### 2. What is/are the insight(s) found from the chart?

The boxplot shows the distribution of ADR for different hotel types. City Hotels have a wider range of ADR values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help in pricing strategies and optimizing revenue based on ADR trends for different hotel types.

#### Chart - 6

In [None]:
# Chart - 6: Bivariate Analysis (Numerical - Numerical)
plt.figure(figsize=(10, 6))
plt.scatter(data['lead_time'], data['adr'], color='coral')
plt.title('Lead Time vs. Average Daily Rate (ADR)')
plt.xlabel('Lead Time (days)')
plt.ylabel('Average Daily Rate (ADR)')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot to visualize the relationship between 'lead_time' and 'adr'.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot indicates that there is no strong correlation between lead time and ADR.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights suggest that lead time is not a significant predictor of ADR, which is valuable for pricing and demand forecasting.

#### Chart - 7

In [None]:
# Chart - 7: Bivariate Analysis (Categorical - Categorical)
crosstab = pd.crosstab(data['market_segment'], data['is_canceled'])
crosstab_percentage = crosstab.div(crosstab.sum(1), axis=0)
crosstab_percentage.plot(kind='bar', stacked=True, colormap='viridis', figsize=(12, 6))
plt.title('Market Segment vs. Cancellation Rate')
plt.xlabel('Market Segment')
plt.ylabel('Cancellation Rate')
plt.legend(title='Is Canceled', loc='upper right', labels=['Not Canceled', 'Canceled'])
plt.show()


##### 1. Why did you pick the specific chart?

 I chose a stacked bar chart to visualize the relationship between 'market_segment' and 'is_canceled'.

##### 2. What is/are the insight(s) found from the chart?

The stacked bar chart shows the cancellation rate by market segment. Online Travel Agents (OTA) have a higher cancellation rate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help in refining marketing and distribution strategies to reduce cancellation rates in the OTA segment.

#### Chart - 8

In [None]:
# Chart - 8: Multivariate Analysis
numerical_variables = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adr', 'booking_changes']
sns.pairplot(data[numerical_variables], diag_kind='kde')
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot to visualize the relationships between multiple numerical variables.

##### 2. What is/are the insight(s) found from the chart?

The pair plot shows the distributions and relationships between numerical variables, separated by the cancellation status.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The pair plot provides insights into the relationships between variables and their impact on booking cancellations.

#### Chart - 9

In [None]:
# Chart - 9: Multivariate Analysis
correlation_matrix = data[numerical_variables].corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a heatmap to visualize the correlation between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows the correlation between numerical variables. Some variables have positive or negative correlations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the correlation heatmap can guide decisions related to pricing, stay duration, and booking changes.

#### Chart - 10

In [None]:
# Chart - 10: Univariate Analysis

plt.figure(figsize=(10, 6))
sns.histplot(data['lead_time'], kde=True, color='skyblue', bins=20)
plt.title('Distribution of Lead Time')
plt.xlabel('Lead Time')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram to visualize the distribution of the 'lead_time' variable.

##### 2. What is/are the insight(s) found from the chart?

The histogram shows the distribution of lead times, with a peak in shorter lead times.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the lead time distribution can help in managing reservation systems and optimizing booking schedules.

#### Chart - 11

In [None]:
# Chart - 11: Univariate Analysis

plt.figure(figsize=(10, 6))
sns.boxplot(x='hotel', y='adr', data=data, palette='pastel')
plt.title('Distribution of ADR by Hotel Type')
plt.xlabel('Hotel Type')
plt.ylabel('ADR')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot to visualize the distribution of 'adr' (average daily rate) by hotel type.

##### 2. What is/are the insight(s) found from the chart?

The box plot shows the distribution of ADR by hotel type, with resort hotels having lower ADR compared to city hotels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help in pricing strategies for different types of hotels, potentially leading to increased revenue for city hotels.

#### Chart - 12

In [None]:
# Chart - 12: Bivariate Analysis (Numerical - Categorical)

plt.figure(figsize=(12, 6))
sns.swarmplot(data=data, x='market_segment', y='adr', hue='hotel', palette='Set2')
plt.title('Market Segment vs. ADR by Hotel Type')
plt.xlabel('Market Segment')
plt.ylabel('ADR')
plt.legend(title='Hotel Type', loc='upper right', labels=['City Hotel', 'Resort Hotel'])
plt.show()

##### 1. Why did you pick the specific chart?

I chose a swarm plot to visualize the relationship between 'market_segment' and 'adr' (average daily rate).

##### 2. What is/are the insight(s) found from the chart?

The swarm plot shows the distribution of ADR by market segment and hotel type. It indicates that online travel agents (OTA) have a wide range of ADR values for both hotel types.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help in optimizing pricing strategies for different market segments, potentially leading to higher revenue.

#### Chart - 13

In [None]:
# Chart - 13: Multivariate Analysis

numerical_variables = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'booking_changes', 'adr']
sns.pairplot(data=data, vars=numerical_variables, hue='is_canceled', markers='o', palette='coolwarm')
plt.suptitle('Scatter Plot Matrix of Numerical Variables vs. ADR', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot matrix to visualize the relationships between numerical variables and 'adr' (average daily rate).

##### 2. What is/are the insight(s) found from the chart?

The scatter plot matrix shows the relationships between numerical variables and ADR, separated by the cancellation status.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The scatter plot matrix provides insights into how numerical variables influence ADR and booking cancellations.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. We have thoroughly examined the hotel booking dataset, understanding its structure and characteristics.
2. Identified missing data, duplicates, and outliers.
3. Discovered patterns and trends in booking behavior, like the impact of lead time and booking source.
4. Gained insights into customer preferences, such as meal choices and room types.
5. Recognized the importance of pricing and managing booking cancellations.
Noted the significance of customer loyalty and repeat bookings.
6. Emphasized the importance of data-driven decision-making in the hotel industry.
7. Proposed strategies for the business to improve bookings, customer satisfaction, and overall performance.

# **Conclusion**

We have thoroughly examined the hotel booking dataset, understanding its structure and characteristics.
Identified missing data, duplicates, and outliers.
Discovered patterns and trends in booking behavior, like the impact of lead time and booking source.
Gained insights into customer preferences, such as meal choices and room types.
Recognized the importance of pricing and managing booking cancellations.
Noted the significance of customer loyalty and repeat bookings.
Emphasized the importance of data-driven decision-making in the hotel industry.
Proposed strategies for the business to improve bookings, customer satisfaction, and overall performance.
In essence, this EDA equi

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***