# **Project Name**    - **Hotel Booking Analysis**



##### **Project Type**    - Exploratory Data Analysis (EDA)
##### **Contribution**    - Individual
**Name** - Suraj Ratansing Bedwal

# **Project Summary -**

**This project involves a comprehensive examination of a dataset detailing hotel reservations spanning from 2015 to 2017 for both urban and resort hotels. The dataset includes various booking details for both types of hotels, such as the booking date, duration of stay, the number of adults, children, and infants, and the availability of parking spaces, among other details. All personally identifiable information has been removed from the data. We will conduct exploratory data analysis using Python to extract insights from this dataset.**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Ever Wondered About the Best Time to Book a Hotel Room or How Long to Stay for the Best Rates?**
**Or maybe you're interested in determining if a hotel is likely to receive numerous special requests? This hotel booking dataset can provide the answers you're looking for!**

**This dataset contains details about reservations for both city and resort hotels. It includes information on the booking date, length of stay, the number of adults, children, and infants, and the number of parking spaces available. Any personal information has been removed to ensure privacy.**

**Explore the data to discover what factors are significant when people book hotels.**

#### **Define Your Business Objective?**

### **Objectives for Hotel Booking Data Analysis Project:**

**Cancellation Pattern Recognition:**

1. Identify and analyze patterns related to booking cancellations using exploratory data analysis.
2. Examine the influence of lead time, meal preferences, and customer types on cancellations.
3. Develop a framework to recognize and categorize reasons for cancellations.

**Optimal Booking Timing Determination:**

1. Assess historical booking data to identify peak periods and reservation trends.
2. Determine the best times to book based on lead times, seasonal changes, and historical patterns.
3. Provide actionable insights to help customers optimize their booking times.

**Peak Season and Customer Behavior Analysis:**

1. Investigate the connection between peak seasons and customer demographics, room preferences, and booking channels.
2. Identify key factors affecting customer behavior during peak seasons.
3. Offer hotels tailored strategies to adjust their services to meet peak season demands.

**Data Visualization for Decision-Making:**

1. Apply advanced data visualization techniques to present findings clearly.
2. Create visual aids such as charts and graphs to support informed decision-making.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
'/content/drive/MyDrive/Module 2 Project/Week/Day/Hotel Bookings.csv'

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/My Drive/Module 2 Project/Week/Day/Hotel Bookings.csv')
df

### Dataset First View

In [None]:
# Dataset First Look
# Top 5 Rows
df.head()

In [None]:
# Last 5 Rows
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

The Given dataset has 119390 Rows and 32 Columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_value = df.duplicated().sum()
print(duplicate_value)

The dataset has 31994 duplicate values

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(),cmap='viridis', cbar=False)
plt.title('Missing Values')
plt.show()

### What did you know about your dataset?

**Overview of the Hotel Booking Dataset:**

The dataset provides detailed booking information for two types of hotels:

**1. City Hotel**

**2. Resort Hotel**

It includes various customer details such as the booking date, the year, month, and week of arrival, and the total duration of the stay. Additionally, it covers the type of meal chosen, the allocated room type, and some personal details like customer type, country of origin, whether they were alone or in a group, and the number of children or babies accompanying them.

**Exploratory Data Analysis Objective:**

"I am conducting an exploratory data analysis on a hotel booking dataset. The dataset contains information on reservations, customer demographics, and booking patterns. The goal is to uncover insights, trends, and potential factors influencing booking behaviors."

**Dataset Characteristics:**

**Size:**119,390 rows and 32 columns

**Missing Values:**Some columns have missing values, including ["Company", "Agent", "Country"]

**Duplicate Values:** There are 31,994 duplicate entries in the dataset.


**Initial Goal:**

The primary objective is to clean the dataset to optimize it for efficient analysis. This involves handling missing values and removing duplicates.


**Key Points for Data Analysis:**

**Reservation Patterns:**
Analyze customer demographics to understand reservation behaviors, including booking frequency, preferred booking channels, and typical lead times.

**Cancellation Trends:**
Examine demographics to gain insights into cancellation tendencies, identifying which customer segments are more likely to cancel reservations and why. This information is vital for optimizing cancellation policies and improving customer satisfaction.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')   # This include='all' will give us all the values including NaN

### Variables Description

**### Detailed Dataset Description:**

**#### Hotel Types:**
- **City Hotel**: Urban hotel
- **Resort Hotel**: Vacation hotel

**#### Booking and Customer Information:**
- **is_canceled**: Indicates if the booking was canceled (1) or not (0)
- **lead_time**: Number of days in advance the hotel was booked
- **arrival_date_year**: Year of arrival
- **arrival_date_month**: Month of arrival
- **arrival_date_week_number**: Week number of arrival
- **arrival_date_day_of_month**: Day of arrival
- **stays_in_weekend_nights**: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay
- **stays_in_week_nights**: Number of week nights (Monday to Friday) the guest stayed or booked to stay
- **adults**: Number of adults
- **children**: Number of children
- **babies**: Number of babies
- **meal**: Type of meal plan chosen by the customer
  - **BB**: Bed & Breakfast
  - **FB**: Full Board (Breakfast, Lunch, and Dinner)
  - **HB**: Half Board (Breakfast and Dinner)
  - **SC/Undefined**: No meal opted

**#### Customer Details:**
- **country**: Code of the country the customer belongs to
- **market_segment**: Market segment the customer belongs to
- **distribution_channel**: Booking channel used by the customer (direct/TA/TO)
- **is_repeated_guest**: Indicates if the guest is a returning customer (1) or a first-time visitor (0)
- **previous_cancellations**: Number of previous bookings canceled by the customer
- **previous_bookings_not_canceled**: Number of previous successful bookings by the customer

**#### Room Information:**
- **reserved_room_type**: Type of room reserved by the customer
- **assigned_room_type**: Type of room assigned to the customer
- **booking_changes**: Number of changes made to the booking

**#### Financial and Special Requests:**
- **deposit_type**: Deposit type chosen by the customer
- **agent**: ID of the travel agent who made the booking
- **company**: ID of the company that made the booking
- **customer_type**: Type of customer
  - **Transient**: Booking not part of a group or contract, and not associated with other transient bookings
  - **Contract**: Booking associated with a contract
  - **Transient_party**: Transient booking associated with at least one other transient booking
  - **Group**: Booking associated with a group
- **days_in_waiting_list**: Number of days the customer had to wait for booking confirmation
- **adr**: Average Daily Rate, the average rate paid per occupied room
- **required_car_parking_spaces**: Indicates if car parking was required
- **total_of_special_requests**: Number of additional special requests made by the customer

**#### Reservation Status:**
- **reservation_status**: Reservation status (checked-in, canceled, or no-show)
- **reservation_status_date**: Date of the last reservation status update


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
display(df.apply(lambda column:column.unique()).to_string())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dropping all duplicate values
df.drop_duplicates(inplace=True)

In [None]:
# Check the number of rows and columns after droping duplicates
df.shape

In [None]:
# Check whether contains null value or not
df.isnull().values.any()

In [None]:
print((df.isnull().sum()).sort_values(ascending=False))

**Here we found that company, agent, country, children contains null values. Company and agent columns contains maximum null value, so we will drop these two columns in future steps and we will replace null values in country and children columns.**

In [None]:
# Percentage of Null Values
pd.DataFrame(round(df.isna().sum()*100/len(df),4))

**From the above we can see the children, country, agent, and company variables have null values of 0.0046%, 0.5172%, 13.9514%, and 93.9826% resp. Variable company has more than 50% null values**

In [None]:
# Drop the columns which has more than 50% null values
df.drop(columns='company',inplace=True)

In [None]:
# Replacing null values wiht the most frequent value in a variable
df['children']=df['children'].fillna(df['children'].mode()[0])
df['country']=df['country'].fillna(df['country'].mode()[0])
df['agent']=df['agent'].fillna(df['agent'].mode()[0])

**From the above children, country, and agent are discrete numerical variables, so replaced null values with modes, and the variable company had full values greater than 50%, so removed it.**

In [None]:
# Check null values are removed or not?
df.isnull().sum()

** All the Null values have been successfully removed**

In [None]:
df.shape # shape method is used to know the total number of rows and columns in dataset

In [None]:
# Check unique values for hotel column
df['hotel'].unique()

**Here we can see that there are two types of hotels in our dataset**

In [None]:
# Unique values of is_canceled column
list(df['is_canceled'].unique())

**Here the 0 is for booking was not cancelled and 1 is for booking cancelled!**

In [None]:
# Unique values of arrival_date_year column
df['arrival_date_year'].unique()

**Handling Outliers**

**Categorical Variables**

In [None]:
# Find categorical variables
categorical_variables=[i for i in df.columns if df[i].dtype=='O']
print(f'Dataset having {len(categorical_variables)} categorical_variables')
print(categorical_variables)

**Numerical Variables**

In [None]:
# Find numerical variables
numerical_variables = [i for i in df.columns if df[i].dtype!='O']
print(f'Dataset having {len(numerical_variables)} numerical_variables')
print(numerical_variables)

** Finding the Outlier in each column**

In [None]:
plt.rc('font', size=14)
plt.figure(figsize=(30,12))
sns.boxplot(data=df[['is_canceled', 'lead_time', 'arrival_date_week_number', 'adults', 'babies', 'previous_cancellations', 'booking_changes', 'days_in_waiting_list', 'adr', 'total_of_special_requests']])
plt.show()

**After checking outliers in each columns we can see that only in three columns the outliers are present : lead_time, days_in_waiting_list, and adr**

In [None]:
# Only see the Outliers in three columns
plt.rc('font', size=14)
plt.figure(figsize=(30,12))
sns.boxplot(data=df[['lead_time', 'days_in_waiting_list', 'adr']])
plt.show()

**Now Remove the Outliers**

In [None]:
# Now Remove the Outliers
def remove_outlier(col):
  print(col)
  sorted(col)
  q1,q3=col.quantile([0.25,0.75])
  iqr=q3-q1
  lower_bound=q1-(1.5*iqr)
  upper_bound=q3+(1.5*iqr)
  return lower_bound,upper_bound

**In the above code we define a function remove_outlier from those columns. For removing the outliers from the columns we just pass the columns name as paramater of that column in remove_outlier function.**

In [None]:
 # 1.Removing Outliers from the lead_time column
 low, high = remove_outlier(df['lead_time'])
 df['lead_time'] = np.where(df['lead_time'] > high, high, df['lead_time'])
 df['lead_time'] = np.where(df['lead_time'] < low, low, df['lead_time'])

**In the above code we pass the 'lead_time' column to the 'remove_outlier' function, and the values returned by the 'remove_outlier' function are stored in low and high variables.**

In [None]:
# 2.Removing outliers forrm days_in_waiting_list column
low,high=remove_outlier(df['days_in_waiting_list'])
df['days_in_waiting_list']=np.where(df['days_in_waiting_list']>high,high,df['days_in_waiting_list'])
df['days_in_waiting_list']=np.where(df['days_in_waiting_list']<low,low,df['days_in_waiting_list'])

**Here we pass the 'days_in_waiting_list' column to the 'remove_outlier' function, and the values returned by the 'remove_outlier' function are stored in low and high variables.**

In [None]:
# 3. Removing Outliers from adr column
low,high=remove_outlier(df['adr'])
df['adr']=np.where(df['adr']>high,high,df['adr'])
df['adr']=np.where(df['adr']<low,low,df['adr'])

**Here also we pass the 'adr' column to the 'remove_outlier' function, and the values returned by the 'remove_outlier' function are stored in low and high variables.**

In [None]:
# After Removing Outliers
sns.set_style('whitegrid')
plt.rc('font', size=20)
plt.figure(figsize=(30,12))
sns.boxplot(data=df[['lead_time', 'days_in_waiting_list', 'adr']])
plt.show()

**From the above graph we can say that we have successfully remoed the outliers form the columns.**

In [None]:
# Adding the column 'total_stay' using columns : 'stays_in_weekend_nights' & 'stays_in_week_nights'
df['total_stay'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

**Here we are calculating the number of days that people stays in and store that number in a new column which is total_stay**

In [None]:
# Now check unique values for children and agent column
print(df['children'].unique())
print(df['agent'].unique())

**In the above code the datatype of variables children and agent is float64.**

**Now change datatype from float to int**

In [None]:
# Changing datatype
df[['children', 'agent']] = df[['children', 'agent']].astype('int64')

**Now the datatype of a variable reservation_status_date is object datatype now change it to datetime datatype**

In [None]:
# Changing datatype of reservation_status_date
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format='%Y-%m-%d')

**Now check whether the datatypes are changed or not**

In [None]:
# Checking datatype
df[['children', 'agent', 'reservation_status_date']].dtypes

**Create a column total_visitors and add the value of the columns adults, children, and babies in it.**

In [None]:
# Creating column total_visitors
df['total_visitors'] = df['adults'] + df['children'] + df['babies']

**Create a variable 'reserved_room_assigned' which describe same room assigned or not**

In [None]:
# Creating reserved_room_assigned variable
df['reserved_room_assigned']=np.where(df['reserved_room_type']==df['assigned_room_type'],'Yes','No')

**Now creating the visitors_category form the variable total_visitors**

In [None]:
df['visitors_category']=np.where(df['total_visitors']==1, 'single', np.where(df['total_visitors']==2, 'Couple', 'Family'))

**Now creating lead_time_category from the variable lead_time to display category**

In [None]:
# Creating 'lead_time_category' from 'lead_time' variale to display category
df['lead_time_category']=np.where(df['lead_time']<=15, 'low',np.where((df['lead_time']>15) & (df['lead_time']<90), 'medium', 'high'))

**In the below code meal contains undefined meal type which is same as 'SC' so we will combine them**

In [None]:
# First check for unique meal types
df['meal'].unique()

In [None]:
# Finding the most prefered meal type in the form of percentage
df.meal.value_counts(normalize=True)*100

In [None]:
# Drop the meal Undefined bacause it is same as SC
df=df.drop(df[df['meal']=='Undefined'].index)

In [None]:
df.meal.unique()

**In the above output we can see the the Bed & Breakfasst is most prefered meal type**

**Now We want to find the most common hotel booking in the form of percentage**

In [None]:
# Most common hotel booked
df.value_counts('hotel',normalize=True)*100

**We can see that City Hotel are mostly booked hotels by customers**

**Now find in which year the booking was so high / maximum no of customers**


In [None]:
df.arrival_date_year.value_counts(normalize=True)*100

**we can clearly see that the 2016 was the peak year when the booking was so high**

**Now we know the 2016 was the peak year now we want to know the peak mont of the year 2016**

In [None]:
# Peak month of the year 2016
df.arrival_date_month.value_counts(normalize=True)*100

**In august month lots of booking had been done by customers**

**Now we want to find in august month how many customers book city hotel and resort hotel**

In [None]:
df[df['arrival_date_month']=='August'].hotel.value_counts()

**Find from which country the maximum visitors arrived**


In [None]:
df.country.value_counts()

**We can see that Portugal, Great Britain, France, Spain are the countries form where the maximum vistors are arrived**

**Now, find which country has higher percentage of hotel bookings**

In [None]:
df['country'].value_counts(normalize=True).idxmax()

**A higher percentage of hotel bookings comes from the country of Portugal**

**Now We want to know which type of room is mostly assigned to the customers/Visitors.**


In [None]:
# Mostly assigned room type
df.assigned_room_type.value_counts(normalize=True)*100

**Now find which hotel has high cancellation of booking**

In [None]:
# Highest booking Canceled
cancel=df[df['is_canceled']==1].groupby('hotel')
x1 = pd.DataFrame(cancel.size()).rename(columns={0:'Canceled_booking'})
total_booking=df.groupby('hotel')

x2 = pd.DataFrame(total_booking.size()).rename(columns={0:'Total_booking'})
result=pd.concat([x1,x2],axis=1)
result['Percentage_cancel']=round((result['Canceled_booking']/result['Total_booking'])*100,2)
result

**Mostly booking of city hotels is canceled by customers.**


### What all manipulations have you done and insights you found?

**Data Analysis Insights:**

**Outlier Identification and Removal:**
- **Lead Time, ADR, and Days in Waiting List**: Initially identified and removed outliers from the 'lead_time,' 'adr,' and 'days_in_waiting_list' columns to ensure more accurate analysis.

**Feature Engineering:**
- **Total Stay**: Created a new column 'total_stay' by summing 'stays_in_weekend_nights' and 'stays_in_week_nights'.
- **Total Guests**: Added a 'total_guests' column by combining 'adults,' 'children,' and 'babies' columns.

**Key Findings:**
- **Meal Preferences**: Approximately 77% of customers chose the BB (Bed & Breakfast) meal type.
- **Hotel Type Preferences**: About 61% of customers booked the city hotel, which also had the highest average ADR.
- **Booking Trends**:
  - **Peak Year**: 2016 was the year with the highest number of bookings.
  - **Peak Month**: August emerged as the peak month for bookings.
- **Customer Origin**: The majority of customers were from Portugal, followed by Great Britain, France, and Spain.
- **Customer Type**: Approximately 72% of customers were of the Transient type.
- **First-Time Bookings**: Around 97% of customers were making their first booking at the hotel.
- **Room Assignments**: A-type rooms were assigned to approximately 52% of the customers.
- **Cancellation Rates**: The city hotel had the highest cancellation rate, at around 30%.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Which Type of Hotel is Booked by the Customer?**

In [None]:
# Chart - 1 visualization code
# Use Histogram to plot the chart for 'Hotel Booking'

# Suppose df is our dataframe and 'hotel' is a column in it

# Seaborn
sns.set_style('whitegrid')

# Set the size
plt.figure(figsize=(8, 6))

# Create a countplot for the column 'hotel'
sns.countplot(x='hotel', data=df, color = 'blue', palette = ['tab:orange', 'tab:blue'])

# Set the title and labels
plt.title('Hotel Booking')
plt.xlabel('Hotel')
plt.ylabel('Count')

# Show the plot
plt.show()


In [None]:
plt.pie(x=df.hotel.value_counts(),explode=[0.01,0],labels=['City Hotel', 'Resort Hotel'], autopct="%0.1f%%", textprops={'fontsize': 14})
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Percentage of Hotel Bookings', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I have chosen a Bar Chart  to show which type of hotel has the the maximum number of bookings and I choose a Pie Chart to show the difference between the percentage of bookings.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I found that the percentage of City Hotel booking is more than the percentage of Resort Hotel booking. Here the percentage of the City Hotel booking is almost 61% so we can say thtat mostly customer prefere to book City Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above I think the Resort Hotel is costly compared to city resort or may be the services which resort hotel is giving are not that much good hence most of the customers are prefering the city hotel. I think Resort hotel need to do some improvements about their prices and need to give some cool offers to their customers.

#### Chart - 2

**Which is the Most Preferred Meal Type?**

In [None]:
# Chart - 2 visualization code
# Using Seaborn Countplot
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
sns.countplot(x=df['meal'], color = 'blue', palette = ['tab:blue', 'tab:orange', 'tab:red', 'tab:green'])
plt.xlabel('Meal Type', fontsize=20)
plt.ylabel('Count', fontsize=20)
plt.title('Meal Type Distribution', fontsize=20)
plt.show()

In [None]:
# Using Pie Chart to Show the Percentage Distribution
colors = ['tab:blue', 'tab:red', 'tab:green',  'tab:orange']

# Create a pie chart
plt.figure(figsize=(8, 8))
plt.pie(x=df.meal.value_counts(),explode=[0,0,0,0], labels=['BB', 'HB','SC', 'FB'], autopct="%0.1f%%", textprops={'fontsize': 14}, colors=colors)
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Percentage Distribution of Meal Type', fontsize=20)
plt.show()


##### 1. Why did you pick the specific chart?

I think the Bar chart is good to see which meal is mostly prefer by the customer and to show the percentage of meal Pie Chart is good hence is choose them.

##### 2. What is/are the insight(s) found from the chart?

The insights which i get from the above chart is that most commonly prefer meal by the customer is Bed & Breakfast. Almost 78% customer used to prefer this.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The Bed & Breakfast is the most preferred meal option among customers, the hotel management need to first maintain the quality of BB meal, It should not need to be get down as well as need to give some any offer for customers so they can upgrade their meal from BB to HB or FB. This technique could contribute to increased revenue. Also for the customers who are not opting any type of meal plan, the hotel management might try to encourage them to choose BB meal by providing some type of discounts or any special offers.

#### Chart - 3

**Advanced Booking Cancellation?**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.barplot(x='arrival_date_year', y='lead_time', hue='is_canceled', data=df)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Lead Time', fontsize=20)
plt.title('Advanced Booking Cancellation', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I picked this bar plot to compare the average lead time of hotel bookings across different years, highlighting the differences between canceled and non-canceled bookings. This chart clearly shows trends over time and allows for easy comparison of how booking behaviors and cancellation rates vary annually. Using the hue parameter to distinguish between cancellations helps in visualizing these patterns effectively, making the data insights straightforward and accessible.

##### 2. What is/are the insight(s) found from the chart?

From the chart, the key insights are that the lead time for hotel bookings has varied over the years, and there is a noticeable difference in lead times between canceled and non-canceled bookings. Generally, bookings with longer lead times tend to have higher cancellation rates. This indicates that guests who book further in advance are more likely to cancel their reservations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can help create a positive business impact by enabling the hotel to better manage booking policies and reduce cancellation rates. For example, the hotel can implement stricter cancellation policies or offer incentives for guests with longer lead times to confirm their bookings, thereby reducing the likelihood of cancellations.

#### Chart - 4

**Distribution of Bookings?**

In [None]:
df.market_segment.value_counts()

In [None]:
# Chart - 4 visualization code
# Using the Pie chart to find the percentage of booking distribution

plt.figure(figsize=(10, 10))
plt.pie(df.market_segment.value_counts(), labels=['Online TA', 'Offline TA/TO', 'Groups', 'Direct', 'Corporate', 'Complementary', 'Aviation', 'Undefined'], autopct="%0.1f%%", textprops={'fontsize': 14})
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Booking Distribution', fontsize=20)
plt.show()

In [None]:
# Chart - 8 visualization code


#using pie chart to findout percentage
plt.figure(figsize=(10 ,12))
plt.pie(df.market_segment.value_counts(),labels=['Online TA','Offline TA/TO','Direct','Groups','Corporate','Complementary','Aviation','Undefined'],autopct="%0.1f%%",shadow=True,textprops={'fontsize':13})
plt.legend(bbox_to_anchor=(2,1))
plt.title('Distribution through different channels')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

**Which Year has the highest no of Bookings?**

In [None]:
# Chart - 5 visualization code
# Find which year has high no of bookings
# Using Bivariate Analysis

plt.figure(figsize=(14,7))
sns.set_style('whitegrid')
plt.rc('font', size=14)
sns.countplot(x='arrival_date_year', hue='hotel', data=df)
plt.legend()
plt.title('Highest Booking Year', fontsize=20)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Count of Bookings', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose this count plot to easily identify which year had the highest number of bookings and to compare booking counts between different types of hotels

##### 2. What is/are the insight(s) found from the chart?

From the chart, the insights are clear:

1. The year 2016 shows the highest number of bookings compared to other years.
2. The chart also reveals how bookings are distributed between different types of hotels (likely differentiating between resorts and city hotels), providing insights into customer preferences across years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the chart, showing the highest bookings in 2016 and the distribution between hotel types, can help hotels optimize resources during peak periods for better customer service. However, relying too heavily on peak years may lead to missed opportunities in marketing and resource allocation during other times, potentially affecting revenue generation negatively.

#### Chart - 6

**Which Month has the Highest no of Bookings?**

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))

# Plotting arrivals per month with different colors for each month
my_plot = sns.countplot(data=df, x='arrival_date_month', palette='husl')
plt.title('Arrivals per month', fontweight='bold', size=20)
plt.xlabel('Month')
plt.ylabel('Count')

# Rotate x-axis labels for better readability
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=45)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose this count plot to visualize the distribution of hotel arrivals across different months of the year. Using different colors for each month makes it easy to compare arrival counts visually. This chart effectively highlights seasonal variations in hotel occupancy, providing insights into peak and off-peak periods throughout the year.

##### 2. What is/are the insight(s) found from the chart?

From the chart, the insights include:

1. August shows the highest number of arrivals, indicating it may be a peak tourist season.
2. There is a noticeable variation in arrivals across months, suggesting seasonal trends in hotel occupancy.
3. Understanding these patterns can help hotels anticipate demand fluctuations, adjust pricing strategies, and optimize staffing and resource allocation accordingly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the chart, such as the peak arrivals in August and seasonal variations across months, can help hotels optimize operations and marketing strategies effectively. By adjusting pricing and promotions to match demand fluctuations, hotels can enhance customer satisfaction and operational efficiency during peak seasons. However, focusing exclusively on peak months may lead to challenges like overbooking during high-demand periods and underutilization of resources in slower months, potentially impacting overall profitability. It's important for hotels to balance strategies to ensure consistent business performance throughout the year.

#### Chart - 7

**Number of Days Mostly People Staying in Hotel?**

In [None]:
# Chart - 7 visualization code
df_2 = df[df['is_canceled']==0]
stay = df_2[df_2['total_stay'] < 10]
plt.figure(figsize=(10, 6))
sns.countplot(x=stay['total_stay'], hue=stay['hotel'])
plt.title('Number of Days Mostly People Staying in Hotel')
plt.xlabel('Number of Days')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose this count plot to visualize how long people typically stay in hotels, focusing on stays of less than 10 days. By using the hue parameter to differentiate between types of hotels, this chart effectively shows the distribution of stay durations across different hotel categories. This helps identify common stay lengths and any differences in stay patterns between hotel types.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I get to know that most people tend to stay in hotels for a short duration, typically between 1 to 3 days.The distribution of stay lengths is similar across different hotel types, indicating consistent short-stay patterns regardless of the hotel category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help hotels improve services for short-stay guests, boosting satisfaction and bookings. However, focusing only on short stays may miss opportunities with long-term guests. Balancing strategies for both ensures steady revenue and broader market reach.

#### Chart - 8

**Find the Percentage of Repeat and Non-Repeat Customers?**

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
plt.pie(df.is_repeated_guest.value_counts(), labels=['Non-Repeat', 'Repeat'], autopct="%0.1f%%", textprops={'fontsize': 14})
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Percentage of Repeat and Non-Repeat Customers', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose this pie chart to visualize the proportion of repeat versus non-repeat customers. Pie charts are effective for showing percentage distributions, making it easy to see the relative size of each category.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that a significant majority of the customers are non-repeat guests, while a smaller portion are repeat guests. This highlights the hotel's reliance on attracting new customers over retaining existing ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights show a majority of customers are non-repeat guests, indicating a need to boost customer retention. Implementing loyalty programs and improving customer satisfaction can increase repeat bookings and create a positive business impact. However, relying too much on new customers without retention strategies may lead to higher marketing costs and inconsistent revenue. Balancing efforts between acquiring new and retaining existing guests ensures sustainable growth.

#### Chart - 9

**Find the Type of Customers?**

In [None]:
# Chart - 9 visualization code
# Whihc type of customer has max. booking
plt.figure(figsize=(10, 6))
plt.pie(df.customer_type.value_counts(), labels=['Transient', 'Transient-Party', 'Contract', 'Group'], autopct="%0.1f%%", textprops={'fontsize': 14})
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Type of Customers', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose this pie chart to visualize the distribution of different customer types, as it effectively shows the proportion of each category. This helps in quickly identifying which customer type has the maximum bookings.

##### 2. What is/are the insight(s) found from the chart?

I chose this pie chart to visualize the distribution of different customer types, as it effectively shows the proportion of each category. This helps in quickly identifying which customer type has the maximum bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights show that 'Transient' customers have the highest bookings, highlighting the need to tailor marketing and services to individual travelers. This can positively impact business by enhancing customer satisfaction and retention. However, focusing too much on 'Transient' customers may neglect other types, leading to missed revenue opportunities. Balancing efforts to attract all customer types ensures steady and diverse bookings.

#### Chart - 10

**Mostly Preferred Booking Channel/Distribution Channel**

In [None]:
# Chart - 10 visualization code
# Find which is the most prefered distribution_channel
plt.figure(figsize=(15,6))
sns.set_style('whitegrid')
plt.rc('font', size=14)
sns.countplot(x='distribution_channel', data=df, color = 'blue', palette = ['tab:orange', 'tab:blue', 'tab:red', 'tab:green', 'tab:purple'])
plt.title('Distribution Channel', fontsize=20)
plt.xlabel('Types of Distribution Channels', fontsize=20)
plt.ylabel('Count of Bookings', fontsize=20)
plt.show()

In [None]:
# Using Pie Chart to Find the Percentage Distribution of distribution channels
plt.figure(figsize=(8, 8))
plt.pie(df.distribution_channel.value_counts(), labels=['TA/TO', 'Direct', 'Corporate', 'Undefined', 'GDS'], autopct="%0.1f%%", textprops={'fontsize': 10})
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Distribution through different channels', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the count plot to visualize the distribution of bookings across different distribution channels, providing clear insight into customer preferences and booking patterns. and I selected the pie chart to visually represent the percentage distribution of bookings across different distribution channels, offering a clear comparison of their relative importance.

##### 2. What is/are the insight(s) found from the chart?

The count plot and the pie chart shows that 'TA/TO' is the most preferred distribution channel, followed by 'Direct', 'Corporate', 'Undefined', and 'GDS'. This indicates where most bookings originate and directs focus for optimizing distribution strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding 'TA/TO' and 'Direct' dominance can enhance booking efficiency and revenue. However, relying too much on 'TA/TO' might limit diversification. Balancing across channels ensures steady growth.

#### Chart - 11

**Find the Average day Rate of Hotel?**

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10, 6))
my_plot=sns.lineplot(x='arrival_date_month', y='adr', hue='hotel', data=df)
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=45)
plt.title('Average Day Rate of Hotel', fontsize=20)
plt.xlabel('Month', fontsize=20)
plt.ylabel('Average Day Rate', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the line plot to visualize the average day rate (ADR) of hotels across different months, as it effectively shows trends and comparisons over time.

##### 2. What is/are the insight(s) found from the chart?

The line plot reveals how ADR varies throughout the year for different hotel types ('Resort Hotel' and 'City Hotel'). It shows any seasonal patterns or differences in pricing strategies between the two types of hotels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the seasonal variations in ADR can help hotels adjust pricing strategies to maximize revenue during peak seasons. This insight into pricing trends also aids in competitive positioning and profitability. However, relying too heavily on seasonal pricing adjustments might lead to customer dissatisfaction or loss of bookings during off-peak periods. Balancing pricing strategies throughout the year ensures consistent revenue generation and customer satisfactio

#### Chart - 12

**Which Type of Room Assigned to the Customers?**

In [None]:
# Chart - 12 visualization code
# Find which type of room is mostly assigned to the customers
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
plt.rc('font', size=14)
sns.countplot(x='assigned_room_type', data=df, hue='hotel')
plt.title('Which Type of Room Assigned to the Customers', fontsize=20)
plt.xlabel('Types of Room', fontsize=20)
plt.ylabel('Count of Bookings', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the count plot to visualize which types of rooms are most frequently assigned to customers across different hotels, providing insight into room allocation patterns.

##### 2. What is/are the insight(s) found from the chart?

From the count plot, it's clear that certain room types are more commonly assigned than others across both 'Resort Hotel' and 'City Hotel'. This helps understand room preference and allocation strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding room preferences can improve guest satisfaction and booking rates. However, focusing too much on popular room types might overlook other rooms, affecting revenue. Balancing room assignments ensures efficient use of resources and maximizes overall satisfaction.

#### Chart - 13

**Hotel Booking Distribution by the Nationality of Customers?**

In [None]:
# Example visualization code for booking distribution by customer nationality
plt.figure(figsize=(12, 6))
sns.countplot(x='country', hue='hotel', data=df[df['country'].isin(df['country'].value_counts().head(10).index)], palette='husl')
plt.title('Booking Distribution by Customer Nationality', fontsize=20)
plt.xlabel('Customer Nationality', fontsize=14)
plt.ylabel('Count of Bookings', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Hotel Type', loc='upper right')
plt.show()


##### 1. Why did you pick the specific chart?

I chose to visualize the booking distribution by customer nationality and hotel type to understand the geographic origins of customers and their preferences across different hotel categories.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the top nationalities of customers booking each type of hotel. It shows where customers are predominantly coming from and which types of hotels they prefer based on their nationality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
df.columns

In [None]:
# Correlation Heatmap visualization code
dcor = df[['lead_time', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'total_stay', 'total_visitors']]
df_cor = dcor.corr()
f, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(df_cor, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the correlation heatmap to visualize the relationships between various numerical variables in the dataset. This helps in understanding how different factors are related to each other.

##### 2. What is/are the insight(s) found from the chart?

The heatmap displays the correlation coefficients between variables such as lead time, previous cancellations, booking changes, average daily rate (ADR), and others. It shows which pairs of variables have strong positive or negative correlations, providing insights into potential relationships and dependencies.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data = df[df['adr']<500][['hotel', 'total_stay', 'adr', 'total_visitors']], hue='hotel')

In [None]:
# Scatter Plot between adr and total_visitors
plt.figure(figsize=(20, 10))
sns.scatterplot(x='adr', y='total_stay', data=df)
plt.title('ADR vs Total Visitors', fontsize=20)
plt.xlim(0, 300)
plt.xlabel('ADR', fontsize=14)
plt.ylabel('Total Visitors', fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

I select a countplot to see the peak monnth of bookings

##### 2. What is/are the insight(s) found from the chart?

I found that the July and August are the peak month of bookings

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### **My Recommendations for Achieving Business Objectives:**

**1. Analyze Cancellation Rates**

- Investigate the high rate of cancellations, which may be linked to lenient deposit policies, and consider implementing stricter policies or mandatory deposits.

**2. Peak Month Targeting**
- Focus marketing efforts on the peak months of May to August, capitalizing on the summer travel season to maximize occupancy.

**3. Offseason Discounts**
- Offer special discounts during the offseason to attract more customers and maintain a steady flow of guests.

**4. Loyalty Program**
- Complement a membership plan with special discounts and benefits to retain customers and encourage repeat bookings.

**5. Social Media Advertising**
- Utilize social media platforms for advertising, especially during peak seasons, to reach a broader audience.

**6. Customer Feedback Analysis**
- Dedicate efforts to regularly analyze customer feedback to enhance facilities and services based on guest preferences and complaints.

**7. Direct Booking Promotion**
- Promote bookings through the company website or app by offering exclusive discounts and better control over the booking process.

**8. Market Segment Monitoring**
- Monitor where cancellations are coming from, such as specific market segments and distribution channels, to address underlying issues.

**9. Special Packages**
- Offer special packages that include meals and additional facilities to attract more travelers.

**10. Geographical Targeting**
- Since the majority of guests are from Western Europe, allocate a significant portion of the marketing budget to target this region.

**11. Group and Long Stay Promotions**
- Introduce attractive offers for group bookings and longer stays to boost occupancy and revenue.

**12. Travel Agency Collaborations**
- Collaborate with both online and offline travel agencies and other booking partners to expand customer reach.

**13. Focus on City Hotels**
- Spend the most on targeting city hotels, as they are the most frequently booked by customers.

**14. Repeat Guest Campaign**
- Given the low percentage of repeat guests, target advertisements to encourage previous guests to return.

**15. Non-Refundable Rates**
- Set non-refundable rates, collect deposits, and implement more rigid cancellation policies to secure bookings.

**16. Customer Origin**
- Tailor marketing strategies to attract more customers from countries with the highest number of bookings, such as Portugal, Great Britain, France, and Spain.

**17. Exclusive Member Deals**
- Create exclusive deals for members, such as room upgrades or complimentary services, to enhance the value of the membership program.

**18. Enhanced Customer Service Training**
- Invest in customer service training for staff to improve guest satisfaction and encourage positive reviews and repeat visits.

**19. User-Friendly Booking System**
- Ensure the booking system on the hotel’s website or app is user-friendly and mobile-optimized to facilitate easy bookings.

**20. Customer Referral Program**
- Introduce a referral program where existing customers can refer friends or family and receive discounts or rewards.

# **Conclusion**

**In conclusion, the comprehensive analysis of the hotel booking dataset has provided valuable insights into customer behavior, booking patterns, and key factors influencing reservations. By leveraging this information, the hotel can implement targeted strategies to enhance customer satisfaction, reduce cancellations, and optimize occupancy rates.**

**Key findings indicate that special packages, flexible booking options, and loyalty programs can significantly attract and retain customers. Addressing high cancellation rates through stricter policies and focusing marketing efforts on peak seasons and high-demand regions will further drive revenue growth. Additionally, collaboration with travel agencies, social media advertising, and promoting direct bookings will expand the hotel's reach and improve booking control.**

**By continuously analyzing customer feedback and utilizing data-driven approaches, the hotel can stay ahead of market trends and offer tailored services that meet the evolving needs of their guests. Implementing these recommendations will not only enhance the guest experience but also position the hotel for sustained success in a competitive market.**

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***