<a href="https://colab.research.google.com/github/AnjanaAnoop/Hotel-Booking-Analysis-EDA-Project/blob/main/Hotel_Booking_Analysis_EDA_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Hotel Booking Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Anjana K



# **Project Summary -**

The aim of **Hotel Booking Analysis** project is to perform Exploratory Data Analysis (EDA) on a dataset that contains hotel booking information for the period from 2015 to 2017.


First, I have loaded the given Hotel Booking dataset (Hotel Bookings.csv) to the Google Colab and got a glimpse of the data by using the head() and tail() functions. The Hotel Booking data set has 119390 rows and 32 columns.I further explored the data set and found that it contains booking information for a City Hotel and a Resort Hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.


The info() and describe() methods were used to get a concise summary of the dataframe and some basic statistical details of the numerical columns respectively. The duplicated 31994 rows were dropped from the dataset. The custom function dataframe_info() was used to return a dataframe displaying the column datatypes, count of unique values and count & percent of missing values in each column of the dataframe. From this output, I found that there are 4 columns (company, agent, country, children) with missing values. The null values in "agent" and "company" columns is replaced with 0, null values in "children" column is replaced with mode value and missing values in "country" column is replaced with "others". Then the rows where sum of "adults","children" and "babies" columns is zero was dropped. The unique values for the 32 variables in the data set was identified. The outliers in the data set were imputed. Hence the given data is clean now.


In the data wrangling phase, the data types of some of the columns were corrected and added some new columns that are relevant for the analysis of the data set. Different charts were used for data visualization to explore and analyse the data inorder to derive valuable insights of the factors that govern the hotel bookings.

# **GitHub Link -**

https://github.com/AnjanaAnoop

# **Problem Statement**


Hotel Booking data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. The objective of this project is to perform data analysis and visualization to explore the key factors that govern the hotel bookings.

#### **Define Your Business Objective?**

The main objective of this project is to explore the key factors driving the hotel bookings such as :



*   To know which type of hotel has highest booking percentage and highest booking cancellation percentage.
*   To know which agent has made the most number of bookings.
*   To identify which is the most preferred meal type by the customers.
*   To identify which is the most preferred room type by the customers.
*   To know how lead time affect the hotel booking.
*   To know what is the optimal length of stay in both types of hotel.
*   To understand which year has the highest number of bookings and which is the busiest month of the year.
*   To identify which is the most preferred hotel type for weekend and weekday stay.
*   To observe how does bookings varies along year for different types of customers.
*   To identify which is the most common channel for booking hotels.
*  To know which hotel type has the highest ADR and longer waiting time.
*  To find which hotel has high chance that its customer will return for another stay.
*   To identify which distribution channel brings better revenue generating deals for hotels.
*   To find which is the origin country of most of the customers.










# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

First,we import the libraries and modules which we have to use in this analysis.

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look - Reading and viewing the csv file
hotel_df = pd.read_csv('/content/drive/MyDrive/Hotel Booking Analysis EDA/Hotel Bookings.csv')

In [None]:
# First 5 rows of the dataset
hotel_df.head()

In [None]:
# Last 5 rows of the dataset
hotel_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
row,column = hotel_df.shape
print("Total number of rows in dataset :",row)
print("Total number of columns in dataset :",column)


### Dataset Information

In [None]:
# Dataset Info - To get a concise summary of the dataframe
hotel_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
hotel_df.duplicated().value_counts()

In [None]:
# Dropping Duplicate values
hotel_df.drop_duplicates(inplace = True)

In [None]:
# Dataset shape after removing the duplicates
hotel_df.shape

In [None]:
# Resetting the index of the dataframe after duplicate removal
hotel_df.reset_index(drop=True,inplace=True)

#### Missing Values/Null Values

In [None]:
# Defining the function "dataframe_info"
def dataframe_info(df):
  '''
  Returns a dataframe displaying the column datatypes,
  count of unique values and count & percent of missing values in each column of the dataframe
  '''
  info_df=df.isnull().sum().reset_index()
  info_df.rename(columns={'index':'Column Name',0:'NaN Count'},inplace=True)
  info_df['% of  NaN']=round((info_df['NaN Count']/len(df))*100,2)
  info_df['Data Type']=df.dtypes.values
  info_df['Unique Count']=df.nunique(axis=0).values
  info_df.sort_values(by=['NaN Count'], ascending=False, inplace = True, ignore_index= True)
  return(info_df)

In [None]:
# Calling the function "dataframe_info" with hotel_df as the input parameter
dataframe_info(hotel_df)

We can see that we have 4 columns (company, agent, country, children) with missing values. The columns “agent” and “company” have a high percentage of missing values whereas the columns “children” and “country” have a low percentage of missing values.

In [None]:
# Replacing null values in "agent" and "company" columns with 0
hotel_df[['company', 'agent']] = hotel_df[['company', 'agent']].fillna(0)

In [None]:
# Replacing null values in "children" column with mode value
hotel_df['children'].fillna(hotel_df['children'].mode()[0], inplace = True)

In [None]:
# Replacing missing values in "country" column with "others"
hotel_df['country'].fillna('others',inplace = True)

Let's again check the missing values.

In [None]:
# Re-check if the hotel dataframe has any null values to ensure that all modifications are in place
dataframe_info(hotel_df)

Perfect! now we don’t have any missing value.

In [None]:
# Getting the number of rows where sum of "adults","children" and "babies" columns is zero
hotel_df[hotel_df['adults'] + hotel_df['children'] + hotel_df['babies'] == 0].shape

In [None]:
# Dropping the rows where sum of "adults","children" and "babies" columns is zero
hotel_df.drop(hotel_df[hotel_df['adults'] + hotel_df['children'] + hotel_df['babies'] == 0].index, inplace = True)

### What did you know about your dataset?

The Hotel Booking dataset has 119390 rows and 32 columns.The duplicate value count is 31994 rows. After removing the duplicates, there are 87396 rows and  32 columns in the dataset. We found that there are 4 columns (company, agent, country, children) with missing values. We replaced null values in "agent" and "company" columns with 0, null values in "children" column with mode value and missing values in "country" column with "others". Then we dropped the rows where sum of "adults","children" and "babies" columns is zero.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_df.columns

In [None]:
# Dataset Describe - To get some basic statistical details of the numerical columns
hotel_df.describe()

### Variables Description

**The columns and the data it represents are listed below :**

1.   **hotel** : Type of hotel (City Hotel or Resort Hotel)

2.   **is_canceled** : (1) for cancelled, (0) for not cancelled

3.   **lead_time** : Number of days between a guest confirming a reservation at the hotel, and their arrival date.

4.   **arrival_date_year** : Year of arrival date (2015, 2016, 2017)

5.   **arrival_date_month** : Month of arrival date (January to December)

6.   **arrival_date_week_number** : Week number of arrival date (1 - 53)

7.   **arrival_date_day_of_month** : Day of arrival date (1 - 31)

8.   **stays_in_weekend_nights** : Number of weekend nights spent at the hotel by the guests

9.   **stays_in_week_nights** : Number of week nights spent at the hotel by the guests

10.  **adults** : Number of adults among the guests

11.  **children** : Number of children among the guests

12.  **babies** : Number of babies among the guests

13.  **meal** : Type of meal booked by the guests. (**BB** - Bread & Breakfast, **FB** - Full Board(breakfast, lunch & dinner), **HB** - Half Board, (breakfast and one other meal — usually dinner), **SC** - Self Catering (no meal package).

14.  **country** : Origin country of guests

15.  **market_segment** : Designation of the market segment.(Direct, Corporate, Online TA, Offline TA/TO, Complementary, Groups, Aviation)

16.  **distribution_channel** : Name of booking distribution channel.(Direct, Corporate, TA/TO , GDS)

17.  **is_repeated_guest** : (1) if the booking is from a repeated guest else (0)

18.  **previous_cancellations** : Number of previous bookings that were cancelled by the customer prior to the current booking

19.  **previous_bookings_not_canceled** : Number of previous bookings that were not cancelled by the customer prior to the current booking

20.  **reserved_room_type** : Code of room type reserved

21.  **assigned_room_type** : Code of room type assigned. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request.

22.  **booking_changes** : Number of changes/amendments made to the booking from the moment the booking was entered on the system until the moment of check-in or cancellation.

23. **deposit_type** : Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:No Deposit, Non Refund, Refundable.

24. **agent** : ID of travel agent who made the booking.

25. **company** : ID of the company that made the booking.

26. **days_in_waiting_list** : Number of days the booking was in the waiting list.

27. **customer_type** : This variable shows 4 categories of guests, namely - Contract, Group,Transient and Transient-party

28. **adr** : This variable shows the Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights.

29. **required_car_parking_spaces** : Number of car parking spaces required by the customer.

30. **total_of_special_requests** : Number of special requests made by the customer (e.g. twin bed, high floor).

31. **reservation_status** : Reservation status of the customer. (Check-Out, Canceled, No-Show)

32. **reservation_status_date** : Date on which the reservation status got last updated




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in hotel_df.columns:
  if hotel_df[col].nunique()<500:
    print(f"The unique values in {col} column are :")
    print(hotel_df[col].unique())
    print('\n')

### **Outlier treatment**

In [None]:
# Creating a boxplot for Outlier detection
columns = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'required_car_parking_spaces', 'adr', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes']
n = 1
plt.figure(figsize=(20,15))

for column in columns:
  plt.subplot(4,4,n)
  n = n+1
  sns.boxplot(hotel_df[column])
  plt.title('Checking for outliers in {}'.format(column))
  plt.tight_layout()

The dataset contains outliers. Now let's remove the outliers by using conditionals to get and update these values.

In [None]:
hotel_df.loc[hotel_df.lead_time > 500, 'lead_time'] = 500
hotel_df.loc[hotel_df.stays_in_weekend_nights >=  5, 'stays_in_weekend_nights'] = 5
hotel_df.loc[hotel_df.adults > 4, 'adults'] = 4
hotel_df.loc[hotel_df.previous_bookings_not_canceled > 0, 'previous_bookings_not_canceled'] = 1
hotel_df.loc[hotel_df.previous_cancellations > 0, 'previous_cancellations'] = 1
hotel_df.loc[hotel_df.stays_in_week_nights > 10, 'stays_in_week_nights'] = 10
hotel_df.loc[hotel_df.booking_changes > 5, 'booking_changes'] = 5
hotel_df.loc[hotel_df.babies > 8, 'babies'] = 0
hotel_df.loc[hotel_df.required_car_parking_spaces > 5, 'required_car_parking_spaces'] = 0
hotel_df.loc[hotel_df.children > 8, 'children'] = 0
hotel_df.loc[hotel_df.adr > 1000, 'adr'] = 1000

In [None]:
hotel_df.describe()

We removed the outliers. Our data is clean now.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Correcting the datatype of "agent", "children" and "company" columns
hotel_df[['children', 'company', 'agent']] = hotel_df[['children', 'company', 'agent']].astype('int64')

In [None]:
# Adding "stays_in_weekend_nights" & "stays_in_week_nights" to get total stay duration
hotel_df['total_stay'] = hotel_df['stays_in_weekend_nights'] + hotel_df['stays_in_week_nights']

In [None]:
# Adding "adults","children" and "babies" columns into "total_people" column
hotel_df['total_people'] = hotel_df['adults'] + hotel_df['children'] + hotel_df['babies']

### What all manipulations have you done and insights you found?

The Hotel Booking dataset has 119390 rows and 32 columns.The duplicate value count is 31994 rows. After removing the duplicates, there are 87396 rows and  32 columns in the dataset. We found that there are 4 columns (company, agent, country, children) with missing values. We replaced null values in "agent" and "company" columns with 0, null values in "children" column with mode value and missing values in "country" column with "others". Then we dropped the rows where sum of "adults","children" and "babies" columns is zero. The columns in the dataset was identified and the data that each column represents is described.

The unique values for all the 32 variables were identified. The outliers in the numerical variables were imputed. The datatype of "agent", "children" and "company" columns were changed from float to integer. New column 'total_stay' which is the sum of 'stays_in_weekend_nights' and 'stays_in_week_nights' was added. Another new column 'total_people' which is the sum of 'adults','children' and 'babies' column was added. The data is clean now and is ready for analysis.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Q1) Which type of hotel has highest booking percentage and highest booking cancellation percentage?**

In [None]:
# Chart - 1 visualization code
grouped_by_hotel = hotel_df.groupby('hotel')
booking_df = pd.DataFrame(grouped_by_hotel.size()).reset_index().rename(columns = {0: 'total_bookings'})
booking_df['book_%'] = round((booking_df['total_bookings'] / hotel_df['hotel'].count()) * 100,2)
plt.figure(figsize = (8,5))
sns.barplot(x = booking_df['hotel'], y = booking_df['book_%'] )
plt.title('Highest Booking Percentage by Hotel Type', fontsize=15)
plt.xlabel('Hotel Type', fontsize=10)
plt.ylabel('Booking %', fontsize=10)
plt.show()
booking_df

In [None]:
# Chart - 1 visualization code
cancel = hotel_df[hotel_df['is_canceled'] == 1]
cancel_grp = cancel.groupby('hotel')
cancellation_df = pd.DataFrame(cancel_grp.size()).reset_index().rename(columns = {0:'total_cancelled_bookings'})
cancellation_df['cancel_%'] = round((cancellation_df['total_cancelled_bookings'] / booking_df['total_bookings']) * 100,2)
plt.figure(figsize = (8,5))
sns.barplot(x = cancellation_df['hotel'], y = cancellation_df['cancel_%'] )
plt.title('Highest Booking Cancellation Percentage by Hotel Type', fontsize=15)
plt.xlabel('Hotel Type', fontsize=10)
plt.ylabel('Cancellation %', fontsize=10)
plt.show()
cancellation_df

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Hotel Type) being compared, while the Y axis represents the measured values(Booking % and Cancellation %) corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.


##### 2. What is/are the insight(s) found from the chart?


*   There are two types of hotel in the dataset - City Hotel and Resort Hotel. City Hotel is most preferred over Resort Hotel.
*   The booking percentage in City Hotel is 61.07% whereas in Resort Hotel is 38.93%.
*   The booking cancellation percentage in City Hotel is 30.10% whereas in Resort Hotel is 23.48%.
*   Hence we can conclude that City Hotel has the highest booking percentage and highest booking cancellation percentage.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Customers are the lifeblood of every hotel business. Customers make up a more significant percentage of a hotel’s revenue generation.The goal of every hotel owner is to attract customers (both new and old) to their property because it increases the occupancy rate and the hotel’s revenue. Know your guest and offer them free vouchers to visit the cinema, museum, and other exciting places. Give them free upgrades, accept early check-in, offer them pickup and shuttle, maintain excellent room services and embrace special events.One way to attract more customers to your hotel business is to ensure that every member of your staff provides guests with an unbeatable and unforgettable experience.
 City hotel can find more services to attract more guests to increase their revenue.They can also analyse the probable reason for the high booking cancellation percentage.
Resort hotel can find a solution to the low booking percentage and also try to find which facilities provided by City hotel attracts the guest.

#### Chart - 2

**Q2) Which agent has made the most number of bookings?**

In [None]:
# Chart - 2 visualization code
top_bookings_by_agent = pd.DataFrame(hotel_df['agent'].value_counts()).reset_index().rename(columns = {'index':'agent','agent':'num_of_bookings'}).sort_values(by = 'num_of_bookings', ascending = False)
top_bookings_by_agent.drop(top_bookings_by_agent[top_bookings_by_agent['agent'] == 0].index, inplace = True)#Dropping the rows where the agent number is 0
top_bookings_by_agent = top_bookings_by_agent[:10] #Selecting top 10 performing agents
plt.figure(figsize = (10,5))
sns.barplot(x = 'agent', y = 'num_of_bookings', data = top_bookings_by_agent, order = top_bookings_by_agent.sort_values('num_of_bookings', ascending = False).agent)
plt.title('Most Number of Bookings by Agent', fontsize=15)
plt.xlabel('Agent number', fontsize=10)
plt.ylabel('Number of bookings', fontsize=10)
plt.show()
top_bookings_by_agent

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Agent Number) being compared, while the Y axis represents the measured values(Number of bookings) corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

*   Agent number 9 has made the most number of bookings (28721).
*   Agent number 6 has made the least number of bookings (1117).



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The good performance of Agent number 9 can create a positive business impact. Agent number 241, 28, 8, 1 and 6 has to put more effort to increase the number of bookings so that the business is not impacted in a negative manner.

#### Chart - 3

**Q3) Which is the most preferred meal type by the customers?**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(14,7))
sns.countplot(x=hotel_df['meal'],order=hotel_df['meal'].value_counts().index)
plt.title("Most Preferred Meal Type", fontsize = 20)
plt.xlabel('Meal Type', fontsize = 15)
plt.ylabel('Count', fontsize = 15)

##### 1. Why did you pick the specific chart?

A count plot is used here to show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

Bed & Breakfast(BB) is the most preferred meal type of the customers, followed by SC(no meal package), HB(Half Board), Undefined and FB(Full Board).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels with Bed & Breakfast(BB) package can attract more guests as this is the most preferred meal type.They can consider giving some interesting discounts and offers on BB package in order to attract more guests. Hotels with no BB package can try to include this meal type in order to attract more guests.

#### Chart - 4

**Q4) Which is the most preferred room type by the customers?**

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(18,8))
plt.subplot(1, 2, 1)
sns.countplot(x=hotel_df['assigned_room_type'],order=hotel_df['assigned_room_type'].value_counts().index, palette ='husl')
plt.title("Most Preferred Room Type", fontsize = 20)
plt.xlabel('Room Type', fontsize = 15)
plt.ylabel('Count', fontsize = 15)

plt.subplot(1, 2, 2)
sns.boxplot(x = 'assigned_room_type', y='adr', data = hotel_df)
plt.title('ADR for each room type',fontweight="bold", size=20)
plt.subplots_adjust(right=1.7)

##### 1. Why did you pick the specific chart?

A count plot is used here to show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

"A" is the most preferred room type and "L" is the least preferred room type.But better ADR generating rooms are H, G and F.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

"A" type are the most preferred rooms. This makes positive impact on the business. "H","I","K","L" type rooms are the less preferred ones and can make a negative impact. Hotels can try to include more "A" type rooms to maximize their revenue.

#### Chart - 5

**Q5) How lead time affect the hotel booking?**

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(12,6))
sns.barplot(x='arrival_date_year', y='lead_time',hue='is_canceled', data= hotel_df, palette='vlag')
plt.title('Arriving year, Leadtime and Cancellations')
plt.title("Arriving year, Leadtime and Cancellations", fontsize = 20)
plt.xlabel('Arrival Year', fontsize = 15)
plt.ylabel('Lead Time', fontsize = 15)

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Year) being compared, while the Y axis represents the measured values corresponding to those categories and is_canceled variable is used as the hue parameter.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

For all the 3 years(2015-17), bookings with a lead time less than 100 days have fewer chances of getting cancelled, and bookings with lead time more than 100 days have more chances of getting cancelled.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can formulate certain policies on the lead time as it is observed that bookings with lead time more than 100 days have more chances of being cancelled.

#### Chart - 6

**Q6) What is the optimal length of stay in both types of hotel?**

In [None]:
# Chart - 6 visualization code
confirmed_bookings = hotel_df[hotel_df['is_canceled'] == 0]
stay_length = confirmed_bookings.groupby(['total_stay','hotel']).agg('count').reset_index()
stay_length = stay_length.iloc[:, :3]
stay_length = stay_length.rename(columns={'is_canceled':'Number of stays'})
stay_length

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x='total_stay',y='Number of stays', data=stay_length,hue='hotel')
plt.title('Optimal Stay Length in Both Hotel Types', fontsize=20)
plt.xlabel('Total Stay in days',fontsize=15)
plt.ylabel('Count of stays', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Total Stay in days) being compared, while the Y axis represents the measured values corresponding to those categories and hotel type variable is used as the hue parameter.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?


*   Optimal length of stay in both hotel types is less than 7 days.
*   For short stay City Hotel is preferred and for long stay Resort Hotel is preferred.
*   The most common optimal length of stay in City Hotel is 3 days whereas in Resort Hotel is 1 day.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More discounts and offers can be given for long stays in order to increase the revenue by both the hotel types.

#### Chart - 7

**Q7) Which year has the highest number of bookings and which is the busiest month of the year?**

In [None]:
# Chart - 4 visualization code
confirmed_bookings = hotel_df[hotel_df['is_canceled'] == 0]
bookings_by_months = confirmed_bookings.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"Count of Booking"})
sequence_of_months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
bookings_by_months['arrival_date_month']=pd.Categorical(bookings_by_months['arrival_date_month'],categories=sequence_of_months,ordered=True)
bookings_by_months=bookings_by_months.sort_values('arrival_date_month')

plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
sns.countplot(x='arrival_date_year',hue='hotel', data=confirmed_bookings,palette='husl')
plt.title("Arrivals per year in Both Hotels ",fontweight="bold", size=20)
plt.xlabel('Year')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.barplot(data=bookings_by_months, x="arrival_date_month", y="Count of Booking")
plt.title("Number of Bookings in Months")
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.subplots_adjust(right=1.7)
plt.show()
bookings_by_months


##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories being compared, while the Y axis represents the measured values corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

* We can see most of the bookings were in the year 2016 and most bookings were done in City hotel.
* August is the most occupied month and January is the most unoccupied month.
* The confirmed bookings goes from their lower value (3648) in January to their highest value (7620) in August.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The high number of confirmed bookings in July and August months can have a positive impact on the business.
* The less number of confirmed bookings in January ,November and December months have negative impact on the business.
* It is ideal to allocate more marketing budget to the busiest seasons.

#### Chart - 8

**Q8) Which is the most preferred hotel type for weekend and weekday stay?**

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
sns.countplot(x='stays_in_weekend_nights',hue='hotel', data=hotel_df, palette='cool')
plt.title("Number of stays on weekend nights",fontweight="bold", size=20)
plt.subplot(1, 2, 2)
sns.countplot(x = 'stays_in_weekend_nights', hue='is_canceled', data = hotel_df, palette='rocket')
plt.title('WeekendStay vs Cancellation',fontweight="bold", size=20)
plt.subplots_adjust(right=1.7)
plt.show()

In [None]:
plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
sns.countplot(x='stays_in_week_nights',hue='hotel', data = hotel_df, palette='rainbow_r')
plt.title("Number of stays on weekday nights",fontweight="bold", size=20)
plt.subplot(1, 2, 2)
sns.countplot(x = 'stays_in_week_nights', hue='is_canceled', data = hotel_df, palette='magma_r')
plt.title('WeekStay vs Cancellation',fontweight="bold", size=20)
plt.subplots_adjust(right=1.7)

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories being compared, while the Y axis represents the measured values corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

We can see that most of the weekend nights were booked in City Hotel. Also, most of weekend nights which were booked were not cancelled. Weekday night stays was also more in City Hotel. Less cancellations was observed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Resort Hotel can give more services to impress the guests to book their hotel during the weekend and weekday stay.

#### Chart - 9

**Q9) How does bookings varies along year for different types of customers?**

Let us divide the customers into three categories -'Single', 'Couple' and 'Family/Friends' and then check their booking percentage.

In [None]:
# Chart - 9 visualization code
confirmed_bookings = hotel_df[hotel_df['is_canceled'] == 0]
single   = confirmed_bookings[(confirmed_bookings['adults']==1) & (confirmed_bookings['children']==0) & (confirmed_bookings['babies']==0)]
couple   = confirmed_bookings[(confirmed_bookings['adults']==2) & (confirmed_bookings['children']==0) & (confirmed_bookings['babies']==0)]
family   = confirmed_bookings[confirmed_bookings['total_people']>2]
reindex = ['January', 'February','March','April','May','June','July','August','September','October','November','December']
fig, ax = plt.subplots(figsize=(12, 8))

for type in ['single', 'couple', 'family']:
  d1 = eval(type).groupby(['arrival_date_month']).size().reset_index().rename(columns = {0:'arrival_num'})
  d1['arrival_date_month'] = pd.Categorical(d1['arrival_date_month'],categories=reindex,ordered=True)
  sns.lineplot(data=d1, x= 'arrival_date_month', y='arrival_num', label=type, ax=ax)

plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

A line plot is used to display information as a series of data points connected by straight line segments. It is used to show the trend of data over time.In this visualization, line plot is used to visualize the trend of booking over the year for different customer types(single,couple and family).

##### 2. What is/are the insight(s) found from the chart?

Moslty bookings are done by couple (even though data does not specify, we assume that if the number of adult is 2 then they are couple). It is clear from the graph that there is a sudden surge in arrival of couple and family in the months of August and July. So better plans can be made during these months to attract them.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Allocate more marketing budget in the busiest seasons(July and August).

#### Chart - 10

**Q10) Which is the most common channel for booking hotels?**

In [None]:
# Chart - 10 visualization code
group_by_dc = hotel_df.groupby('distribution_channel')
d1 = pd.DataFrame(round((group_by_dc.size()/hotel_df.shape[0])*100,2)).reset_index().rename(columns = {0: 'Booking_%'})
plt.figure(figsize = (8,8))
data = d1['Booking_%']
labels = d1['distribution_channel']
plt.pie(x=data, autopct="%.2f%%", explode=[0.05]*5, labels=labels, pctdistance=0.5)
plt.title("Booking % by Distribution Channels", fontsize=14);


##### 1. Why did you pick the specific chart?

A pie chart is used here to illustrate the numerical proportion. The above pie chart shows a simple and easy to understand picture of booking percentage through different distribution channels.

##### 2. What is/are the insight(s) found from the chart?

TA/TO has the highest booking percentage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Use of TA/TO distribution channel has a positive impact on the business.
* Use of GDS and undefined channels has a negative impact on the business.
* The booking percentage of TA/TO is 79.13% whereas GDS is 0.21%.

#### Chart - 11

**Q11) Which hotel type has the highest ADR?**

In [None]:
# Chart - 11 visualization code
adr_df = grouped_by_hotel['adr'].agg(np.mean).reset_index().rename(columns = {'adr':'avg_adr'}) # calculating average adr
plt.figure(figsize = (8,5))
sns.barplot(x = adr_df['hotel'], y = adr_df['avg_adr'] )
plt.title("Average ADR by Hotel Type", fontsize = 20)
plt.xlabel('Hotel Type', fontsize = 15)
plt.ylabel('Average ADR', fontsize = 15)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Hotel Type) being compared, while the Y axis represents the measured values(Average ADR) corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

Average ADR of City Hotel is slightly higher than that of Resort Hotel. Hence, City hotel seems to be making slightly more revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The high ADR of City Hotel can have a positive impact on the business.
* The low ADR of Resort Hotel can have a negative impact on the business.
* High ADR of City Hotel indicates that they make more revenue when compared to Resort Hotel.
* Resort Hotel should consider to enhance their facilities which can increase their revenue.

#### Chart - 12

**Q12) Which hotel has longer waiting time?**

In [None]:
# Chart - 11 visualization code
d5 = pd.DataFrame(grouped_by_hotel['days_in_waiting_list'].agg(np.mean).reset_index().rename(columns = {'days_in_waiting_list':'avg_waiting_period'}))
plt.figure(figsize = (8,5))
sns.barplot(x = d5['hotel'], y = d5['avg_waiting_period'] )
plt.title("Average Waiting Period by Hotel Type", fontsize = 20)
plt.xlabel('Hotel Type', fontsize = 15)
plt.ylabel('Average Waiting Period', fontsize = 15)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Hotel Type) being compared, while the Y axis represents the measured values(Average Waiting Period) corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

City hotel has significantly longer waiting time. Hence City Hotel is much busier than Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Longer waiting time of City Hotel indicates that it is much busier than Resort Hotel. This means than City Hotel is in most demand and is attracting more guest which in turn contributes to their revenue. Resort Hotel can enhance their facilities to attract more guests.

#### Chart - 13

**Q13) Which hotel has high chance that its customer will return for another stay?**

In [None]:
# Chart - 11 visualization code
# Selecting and counting repeated customers bookings
repeated_guest_data = hotel_df[hotel_df['is_repeated_guest'] == 1]
repeat_guest_grp = repeated_guest_data.groupby('hotel')
total_repeated_df = pd.DataFrame(repeat_guest_grp.size()).rename(columns = {0:'total_repeated_guests'})

# Counting total bookings
total_booking = grouped_by_hotel.size()
total_booking_df = pd.DataFrame(total_booking).rename(columns = {0: 'total_bookings'})
combined_df = pd.concat([total_repeated_df,total_booking_df], axis = 1)

# Calculating repeat %
combined_df['repeat_%'] = round((combined_df['total_repeated_guests']/combined_df['total_bookings'])*100,2)

plt.figure(figsize = (10,5))
sns.barplot(x = combined_df.index, y = combined_df['repeat_%'])
plt.show()
combined_df

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Hotel Type) being compared, while the Y axis represents the measured values(Repeat %) corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

Both hotels have very small percentage that customer will repeat, but Resort hotel has slightly higher repeat % than City Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help in creating a positive business impact. The hotels which are not booked repeatedly by guests can take feedbacks from the guests and can try to impove there services. Post-stay communication increases the chances of a repeat booking. Staying in contact with your hotel visitors after they leave effectively brings customers to a facility. It shows them you care about the experience they got from patronizing your hotel. Be in contact with them quickly, offer them discount for future visits, a promotion, or something else, to keep your hotel in the back of their mind and address their complaints quickly. Loyalty programs are one strategy that can help ensure your customers revisit your property. These programs aim to deliver significant discounts and unique bonuses to returning customers. It allows guests to reach a certain level or point per year for visiting your hotel.

#### Chart - 14

**Q14) Which distribution channel brings better revenue generating deals for hotels?**

In [None]:
# Chart - 11 visualization code
group_by_dc_hotel = hotel_df.groupby(['distribution_channel', 'hotel'])
d5 = pd.DataFrame(round((group_by_dc_hotel['adr']).agg(np.mean),2)).reset_index().rename(columns = {'adr': 'avg_adr'})
plt.figure(figsize = (7,5))
sns.barplot(x = d5['distribution_channel'], y = d5['avg_adr'], hue = d5['hotel'])
plt.ylim(40,140)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Distribution Channel) being compared, while the Y axis represents the measured values(Average ADR) corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

* GDS channel brings higher revenue generating deals for City hotel, in contrast to the fact that most of the bookings come via TA/TO. City Hotel can work to increase outreach on GDS channels to get higher revenue generating deals.
* Resort hotel has more revenue generating deals by direct and TA/TO channel. Resort Hotel need to increase outreach on GDS channel to increase revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* GDS distribution channel contributed more to adr for City Hotel and Undefined distribution channel contributed more to adr for resort hotel. More ADR is a positive impact on the business.
* GDS distribution channel has no contribution to the adr of Resort Hotel and Undefined distribution channel contributed less to adr of City Hotel. Less ADR is a negative impact on the business.

#### Chart - 15

**Q15) Which is the home country of most of the customers?**

In [None]:
# Chart - 15 visualization code
import plotly.express as px
country_visitors = hotel_df[hotel_df['is_canceled'] == 0].groupby(['country']).size().reset_index(name = 'count')
px.choropleth(country_visitors,
                    locations = "country",
                    color= "count" ,
                    hover_name= "country", # column to add to hover information
                    color_continuous_scale="Viridis",
                    title="Home country of visitors")

In [None]:
grouped_by_country = hotel_df.groupby('country')
d1 = pd.DataFrame(grouped_by_country.size()).reset_index().rename(columns = {0:'Count'}).sort_values('Count', ascending = False)[:10]
sns.barplot(x = d1['country'], y  = d1['Count'])
plt.show()

##### 1. Why did you pick the specific chart?

Choropleth map is used to plot the origin country of most of the customers . Also the bar chart is used for fast data exploration and to describe the comparisons between the discrete categories. The X axis of the plot represents the specific categories(Country) being compared, while the Y axis represents the measured values corresponding to those categories.The advantage of bar charts over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

##### 2. What is/are the insight(s) found from the chart?

More visitors are from Western Europe, namely France, UK and Portugal being the highest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More marketing campaigns can be run on Western Europe, particularly France, UK and Portugal to attract more guests.

#### Chart - 16- Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
new_hotel_df = hotel_df[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','total_stay','total_people']]
plt.figure(figsize=(12,8))
sns.heatmap(new_hotel_df.corr(),annot=True,cmap='RdYlGn')

##### 1. Why did you pick the specific chart?

Heat map is used in the above visualization to understand the correlation between the variables.

##### 2. What is/are the insight(s) found from the chart?

* Total stay and the lead time have positive correlation. This may be interpreted that for longer hotel stays, usually people plan in advance.
* ADR is positively correlated with total_people, which defines that as more number of people increases more revenue.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

* Customers are the lifeblood of every hotel business. Customers make up a more significant percentage of a hotel’s revenue generation.The goal of every hotel owner is to attract customers (both new and old) to their property because it increases the occupancy rate and the hotel’s revenue. Know your guest and offer them free vouchers to visit the cinema, museum, and other exciting places.

* Give the guests free upgrades, accept early check-in, offer them pickup and shuttle, maintain excellent room services and embrace special events.One way to attract more customers to your hotel business is to ensure that every member of your staff provides guests with an unbeatable and unforgettable experience.

* Interesting discounts and offers can be given on BB package in order to attract more guests.

* Increase room types A, H inorder to increase adr and maximise profits.

* As the peak season is from May to August, with August as the highest booked month, management should consider to utilise staff effectively.

* As majority of the customers are from Portugal and western Europe, management should plan marketing activities in those regions respectively.

* Most preferred stay in hotels is 3 days, hotel management should introduce loyalty service, offers, tourism package in order to increase the stay of customers and to generate more revenue.


* As the City Hotel has more bookings, it generates more revenue and it has more cancellations as well. Management may consider to provide customers with hourly booking option as most of the customers prefer short stay at City Hotel.

* The hotels which are not booked repeatedly by guests can take feedbacks from the guests and can try to impove there services.

* Post-stay communication increases the chances of a repeat booking. Staying in contact with your hotel visitors after they leave effectively brings customers to a facility. It shows them you care about the experience they got from patronizing your hotel. Be in contact with them quickly, offer them discount for future visits, a promotion, or something else, to keep your hotel in the back of their mind and address their complaints quickly.

* Loyalty programs are one strategy that can help ensure your customers revisit your property. These programs aim to deliver significant discounts and unique bonuses to returning customers. It allows guests to reach a certain level or point per year for visiting your hotel.

# **Conclusion**

*   **City Hotel** has the highest booking percentage (61.07%) and highest booking cancellation percentage (30.10%). Average ADR of City Hotel is slightly higher than that of Resort Hotel. Hence, City hotel seems to be making slightly more revenue.


*   **Agent number 9** has made the most number of bookings (28721).


*   **Bed & Breakfast(BB)** is the most preferred meal type of the customers, followed by SC(no meal package), HB(Half Board), Undefined and FB(Full Board).


*   "**A**" is the most preferred room type and "**L**" is the least preferred room type.


*  Bookings with a lead time less than 100 days have fewer chances of getting cancelled, and bookings with lead time more than 100 days have more chances of getting cancelled.


* Optimal length of stay in both hotel types is less than 7 days.


*   Most demanded room type is **A**, but better ADR generating rooms **H, G and F**. Hotels should increase the no. of room types A and H to maximise revenue.


*  We can see most of the bookings were in the year 2016 and most bookings were done in City hotel.August is the most occupied month and January is the most unoccupied month.


* City Hotel is the most preferred hotel type for weekend and weekday stay.

* City hotel has significantly longer waiting time. Hence City Hotel is much busier than Resort Hotel.


* Both hotels have very small percentage that customer will repeat, but Resort hotel has slightly higher repeat % than City Hotel.


* Guests use different channels for making bookings, out of which most preferred one is **TA/TO**.

* **GDS** channel brings higher revenue generating deals for City hotel.


* More visitors are from Western Europe, namely France, UK and **Portugal** being the highest.

* Couples are the most common guests for hotels, hence hotels can plan services according to couples needs to increase revenue.






### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***