# **Project Name**    -  **Hotel Booking Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

## **Exploratory Data Analysis (EDA) of Hotel Booking Data: Summary**

## Introduction
The hotel booking dataset provides a comprehensive view of various aspects related to hotel reservations, including customer demographics, booking patterns, and cancellation trends. This analysis aims to uncover insights that can help reduce customer churn, improve guest satisfaction, and optimize hotel operations.

## Dataset Overview
The dataset comprises several key columns: 'hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', and 'agent'. These variables offer a detailed view of each booking, from the type of hotel to customer behaviors and booking details.

## Key Findings

**Booking Patterns and Lead Time:**

The lead_time column reveals the average time between booking and actual stay. Longer lead times are often associated with lower cancellation rates, as guests tend to be more certain about their plans. Conversely, shorter lead times might indicate last-minute bookings, which could be more prone to cancellations.
Cancellation Trends:

Analysis of the is_canceled column shows significant cancellation rates, which vary between city hotels and resort hotels. Understanding these trends can help in formulating better cancellation policies and strategies to mitigate losses from canceled bookings. High cancellation rates often correlate with specific market segments or distribution channels.
Guest Demographics:

The dataset includes variables such as adults, children, and babies, which provide insights into the composition of guests. Families with children may have different needs and preferences compared to solo travelers or business travelers, affecting their satisfaction and likelihood of returning.
Seasonality and Arrival Dates:

Columns like arrival_date_year, arrival_date_month, and arrival_date_week_number help identify peak booking periods and seasonal trends. This information is crucial for resource planning, marketing strategies, and dynamic pricing models to maximize occupancy and revenue.
Meal Preferences and Services:

The meal column indicates the type of meal plan booked. Analysis shows trends in meal plan preferences, which can inform menu planning and service improvements to better cater to guest preferences.
Market Segmentation:

The market_segment and distribution_channel columns highlight where bookings are coming from (e.g., direct bookings, travel agencies, online platforms). This helps in understanding the effectiveness of different marketing channels and adjusting strategies accordingly.
Repeat Guests and Loyalty:

The is_repeated_guest column identifies loyal customers. Hotels can leverage this information to create personalized offers and loyalty programs, enhancing customer retention. Repeat guests typically have lower cancellation rates and higher lifetime value.
Booking Changes and Flexibility:

The booking_changes column captures modifications made to bookings. High rates of booking changes might indicate dissatisfaction or changes in guest plans. Offering flexible booking options and clear communication can improve guest satisfaction.
Room Allocation and Preferences:

Comparing reserved_room_type and assigned_room_type helps assess the effectiveness of room allocation policies. Discrepancies between these columns might indicate overbooking issues or guest dissatisfaction with room assignments.
Regional Analysis:

The country column allows for geographical segmentation of guests, revealing which regions or countries contribute most to bookings and which might have higher churn rates. This insight can guide targeted marketing and service improvements in specific regions.
Recommendations


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


BUSINESS PROBLEM OVERVIEW: Hotel Booking Analysis

Customer booking prediction is extremely important for any hotel business as it identifies potential customers who are likely to cancel their bookings.

In the hotel industry, customers can choose from multiple hotels and actively switch their bookings from one establishment to another. In this highly competitive market, the hotel industry experiences a significant rate of booking cancellations. Given that it costs significantly more to acquire a new customer than to retain an existing one, customer retention and minimizing cancellations have now become even more important than customer acquisition.

For many hotel operators, retaining high-value customers and reducing booking cancellations is the number one business goal. To achieve this, hotels need to predict which customers are at high risk of canceling their bookings. In this project, you will analyze customer-level booking data of a leading hotel chain, perform exploratory data analysis to identify the main indicators of why customers are canceling their bookings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import plotly.express as px
%matplotlib inline

Dataset *Loading*

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
df.isnull().sum() / len(df)*100

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull())

### What did you know about your dataset?

The given dataset is a Hotel industry

The above dataset has 119390 rows and 32 columns.

## ***2. Understanding Your Variables***

In [None]:
df.dtypes

In [None]:
# Dataset Columns
df.columns

In [None]:
cat_type = df.select_dtypes(include='object').columns

In [None]:
cat_type

In [None]:
num_type = df.select_dtypes(include=['int64','float64']).columns

In [None]:
num_type

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.describe(include='object')

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df['hotel'].unique()

In [None]:
df['hotel'].value_counts()

In [None]:
df['is_canceled'].unique()

In [None]:
df['arrival_date_year'].unique()

In [None]:
df['meal'].unique()

In [None]:
df['market_segment'].unique()

In [None]:
df['distribution_channel'].unique()

In [None]:
df['children'].unique()

In [None]:
df['children'].value_counts()

## 3. ***Data Wrangling***

### Data Wrangling Code



Handle Missing value

In [None]:
df.isnull().sum().sort_values(ascending = False)

In [None]:
df.isnull().sum().sort_values(ascending = False)[:4]

In [None]:
  # drop company column
df.drop('company', axis = 1, inplace = True)

In [None]:
df['agent'].median()

In [None]:
# fill null values in agent column using median
df['agent'].fillna(df['agent'].median(), inplace = True)

In [None]:
df['country'].mode()

In [None]:
# fill null values in country column using mode
df['country'].fillna(df['country'].mode()[0], inplace = True)

In [None]:
df['children'].mean()

In [None]:
# fill null values in children column using mean
df['children'].fillna(df['children'].mean(), inplace = True)

Remove Duplicates

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
df.shape

In [None]:
df.isnull().sum()

Converting columns to appropriate datatypes.

In [None]:
df[['children' ,'agent']] = df[['children','agent']].astype('int64')

In [None]:
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], errors='coerce')

In [None]:
df['reservation_status_date']

Add new column

In [None]:
df['total_stay'] = df['stays_in_weekend_nights']+df['stays_in_week_nights']

In [None]:
df.head(3)

In [None]:
df['total_people'] = df['adults']+df['children']+df['babies']

sert data

In [None]:
#sort data base on reservation status date
df.sort_values('reservation_status_date',inplace=True)

In [None]:
df.reset_index(inplace=True)

In [None]:
#drop index column
df.drop('index',axis=1,inplace=True)

In [None]:
df.head(3)

Filter data based on conditions

In [None]:
df[df['hotel']=='City Hotel'].shape

In [None]:
df[df['hotel']=='Resort Hotel'].shape

In [None]:
df[df['reservation_status']=='Canceled'].shape

Group BY

In [None]:
df.groupby('arrival_date_year')['total_people'].sum().reset_index()

In [None]:
df.groupby('hotel')['total_people'].sum().reset_index()

In [None]:
df.groupby(['hotel','arrival_date_year'])['total_people'].sum().reset_index()

In [None]:
df.head(1)

In [None]:
df.groupby(['hotel','arrival_date_month','arrival_date_year'])['total_people'].sum().reset_index()

In [None]:
df.groupby(['hotel','arrival_date_month','arrival_date_year'])['total_people'].max().reset_index().sort_values('arrival_date_month')

### What all manipulations have you done and insights you found?

Hotel Booking Analysis involves examining data related to hotel reservations to understand patterns and behaviors of customers who cancel their bookings. The goal is to identify the reasons behind these cancellations and extract insights that can help hotels improve their services, reduce cancellation rates, and increase customer satisfaction. This analysis typically includes exploring factors such as booking lead time, average daily rate (ADR), stay duration, special requests, and booking channels.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
df.head(1)

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.lineplot(y = 'adr',x = 'total_stay',data=df)
plt.title('ADR vs Total Stay')
plt.show()

##### 1. Why did you pick the specific chart?

select lineplot because show data in linearly

##### 2. What is/are the insight(s) found from the chart?

this chart give the data information in a connect point using line

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.scatterplot(y = 'adr',x = 'total_stay',data=df)
plt.title('ADR vs Total Stay')
plt.show()

In [None]:
df1= df.copy()

In [None]:
df1.drop(df1[df1['adr'] > 5000].index, inplace = True)

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = df1)
plt.show()

##### 1. Why did you pick the specific chart?

use scatterplot because shoe the point between two variable

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.countplot(x = 'hotel', hue = 'is_canceled', data = df)
plt.show()

##### 1. Why did you pick the specific chart?

use count plot for identify count

##### 2. What is/are the insight(s) found from the chart?

This chart gives the count of different hotel cancel count

#### Chart - 4

In [None]:
plt.figure( figsize=(10, 8))

sns.countplot(x = df1['meal'])
plt.show()

#### Chart - 5

In [None]:
# Chart - 5 visualization code What is percentage of bookings in each hotel?
grouped_by_hotel = df.groupby('hotel')
d1 = pd.DataFrame((grouped_by_hotel.size()/df.shape[0])*100).reset_index().rename(columns = {0:'Booking %'})      #Calculating percentage
plt.figure(figsize = (8,5))
sns.barplot(x = d1['hotel'], y = d1['Booking %']  )
plt.title('Booking Percentage by Hotel', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

use bar plot for check booking percentenge

##### 2. What is/are the insight(s) found from the chart?

give insight of city hotel and resort hotel booking percentenge

#### Chart - 6

In [None]:
# Chart - 6 which hotel seems to make more revenue?
d3 = grouped_by_hotel['adr'].agg(np.mean).reset_index().rename(columns = {'adr':'avg_adr'})   # calculating average adr
plt.figure(figsize = (8,5))
sns.barplot(x = d3['hotel'], y = d3['avg_adr'] )
plt.title('which hotel seems to make more revenue', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

This bar plot give revenue of hotel type

#### Chart - 7

In [None]:
# Chart - 7  What is preferred stay length in each hotel?
not_canceled = df[df['is_canceled'] == 0]
s1 = not_canceled[not_canceled['total_stay'] < 15]
plt.figure(figsize = (10,5))
sns.countplot(x = s1['total_stay'], hue = s1['hotel'])
plt.title(' What is preferred stay length in each hotel', fontsize=16)
plt.show()

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# -Which hotel has longer waiting time?
d5 = pd.DataFrame(grouped_by_hotel['days_in_waiting_list'].agg(np.mean).reset_index().rename(columns = {'days_in_waiting_list':'avg_waiting_period'}))
plt.figure(figsize = (8,5))
sns.barplot(x = d5['hotel'], y = d5['avg_waiting_period'] )
plt.title(' Which hotel has longer waiting time?', fontsize=16)
plt.show()

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Which is the most common channel for booking hotels
group_by_dc = df.groupby('distribution_channel')
d1 = pd.DataFrame(round((group_by_dc.size()/df.shape[0])*100,2)).reset_index().rename(columns = {0: 'Booking_%'})
plt.figure(figsize = (8,8))
data = d1['Booking_%']
labels = d1['distribution_channel']
plt.pie(x=data, autopct="%.2f%%", explode=[0.05]*5, labels=labels)
plt.title("Booking % by distribution channels", fontsize=14);

##### 1. Why did you pick the specific chart?

pie chart use for identify ratio

##### 2. What is/are the insight(s) found from the chart?

This pie chart give the insight of ratio given by companies wise

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Which channel is mostly used for early booking of hotels?
group_by_dc = df.groupby('distribution_channel')
d2 = pd.DataFrame(round(group_by_dc['lead_time'].median(),2)).reset_index().rename(columns = {'lead_time': 'median_lead_time'})
plt.figure(figsize = (7,5))
sns.barplot(x = d2['distribution_channel'], y = d2['median_lead_time'])
plt.show()

#### Chart - 11

In [None]:
stay = df1.groupby(['total_stay', 'hotel']).agg('count').reset_index()
stay = stay.iloc[:, :3]
stay = stay.rename(columns={'is_canceled':'Number of stays'})
stay

In [None]:
# Chart - 11 visualization code
plt.figure(figsize = (10,5))
sns.barplot(x = 'total_stay', y = 'Number of stays',data= stay,hue='hotel')

#### Chart - 12

In [None]:
df.columns

In [None]:
# Chart - 12 visualization code
df['kids'] = df['children']+df['babies']
sns.barplot(x="kids", y="total_of_special_requests",data= df,palette='rainbow')
fig = plt.gcf()
# fig.set_size_inches(10,6)
plt.show()

##### 1. Why did you pick the specific chart?

use barplot for information regarding children

##### 2. What is/are the insight(s) found from the chart?

This barplot give insights of children percentenge for regarding to special requests

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.barplot(x="adults", y="total_of_special_requests",data= df)
fig = plt.gcf()
fig.set_size_inches(15,10)

##### 1. Why did you pick the specific chart?

This bar plot used for adult in special requests

##### 2. What is/are the insight(s) found from the chart?

give the insights of adult percentenge who requires special requests

#### Chart - 14 - Correlation Heatmap

In [None]:
num_df = df[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','total_stay','total_people']]

In [None]:
# Correlation Heatmap visualization code
corrmat = num_df.corr()
f, ax = plt.subplots(figsize=(12, 7))
sns.heatmap(corrmat,annot = True,fmt='.2f', annot_kws={'size': 10},  vmax=.8, square=True);

##### 1. Why did you pick the specific chart?

Since, columns like 'is_cancelled', 'arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'is_repeated_guest', 'company', 'agent' are categorical data having numerical type. So we wont need to check them for correlation.

##### 2. What is/are the insight(s) found from the chart?

Also, we have added total_stay and total_people columns. So, we can remove adults, children, babies, stays_in_weekend_nights, stays_in_week_nights columns.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

perform pairplot between numerical column

##### 2. What is/are the insight(s) found from the chart?

this give insight of  relation between all numerical column of chart

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Solution to Reduce Customer Churn of Hotel Booking**


* **Personalize Offers and Discounts:**

    Use the lead_time to identify guests who book in advance and offer them early bird discounts.

* **Proactive Communication:**

    Use arrival_date_year, arrival_date_month, and arrival_date_day_of_month to send timely reminders and personalized messages before arrival.

* **Analyze Cancellation Trends:**

    Analyze the is_canceled, previous_cancellations, and previous_bookings_not_canceled columns to identify patterns and address common reasons for cancellations.
    
    Implement flexible cancellation policies and communicate them clearly to potential guests.
* **Reward Loyalty:**

    Identify is_repeated_guest and offer loyalty rewards or special discounts to encourage repeat bookings.

* **Monitor and Improve Service Quality:**

    Regularly analyze booking_changes to understand why guests modify their bookings and address any underlying issues.
  
    Ensure consistent and high-quality service, especially for assigned_room_type and reserved_room_type.
* **Optimize Room Allocation:**

    Analyze data on reserved_room_type and assigned_room_type to ensure guests get the rooms they prefer.

    Offer upgrades or room changes proactively when possible.
  
* **Enhance Family-Friendly Services:**

    Use adults, children, and babies columns to identify families and offer them special packages, amenities, and services.

* **Improve Dining Options:**

    Analyze meal preferences and ensure a variety of options to cater to different tastes and dietary requirements.

* **Regular Maintenance and Upgrades:**

    Ensure all facilities and services are well-maintained to prevent negative experiences that might lead to churn.
  
    Periodically renovate and upgrade rooms and common areas to meet guest expectations.


* **Stay Competitive:**

    Regularly analyze market trends and competitors to ensure your pricing and offerings remain attractive.


# **Conclusion**

The exploratory data analysis of the hotel booking dataset provides valuable insights into customer behavior, booking patterns, and cancellation trends. By leveraging these insights, hotels can develop targeted strategies to reduce customer churn, enhance guest satisfaction, and optimize operational efficiency. Implementing personalized offers, flexible policies, and robust loyalty programs, along with maintaining high service standards and targeted marketing, can significantly improve the overall guest experience and retention rates

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***