# **Project Name**    - **Hotel Booking Analysis**



##### **Project Type**    - Capstone project / EDA
##### **Contribution**    - Individual
##### **Presented by** - Abhishek Dehankar


#**Project Summary** -

### The hotel booking analysis project involved exploring a dataset of booking information for a city hotel and a resort hotel. The goal was to understand factors influencing hotel bookings, cancellations, length of stay, meal preferences, customer types, and more.

### Using Python's pandas library, the dataset was loaded and examined to gain an initial understanding. Data cleaning was performed to handle missing values, outliers, and inconsistencies.

### Visualizations using matplotlib and seaborn were created to analyze the data. Key questions explored included the distribution of bookings between resort and city hotels, cancellation rates, seasonal patterns, lead time's impact on cancellations, and the relationship between guest count and meal preferences.

### The project also investigated market segments, distribution channels, room types, deposit types, booking changes, customer types, and car parking requirements to understand their influence on bookings and cancellations.

###The project's findings highlighted the dominance of city hotel bookings, lower cancellation rates for resort hotels, seasonal variations in booking volumes, and the importance of early bookings to reduce cancellations. It also revealed insights into guest demographics, preferences, and the impact of various factors on booking behavior.

###The analysis provided valuable insights for hotel managers and marketers to optimize pricing strategies, improve guest experiences, and manage cancellations effectively.

###In conclusion, the hotel booking analysis project utilized Python's pandas, matplotlib, and seaborn libraries to explore the dataset, uncover patterns, and provide data-driven insights for the hotel industry.

# **GitHub Link -**

https://github.com/Abhishek-Dehankar/Capstone-project.git

# **Problem Statement**


Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.
Explore and analyze the data to discover important factors that govern the bookings.

#### **Define Your Business Objective?**

Gain insights into hotel booking factors to optimize operations, improve customer satisfaction, and enhance revenue in the hotel industry.

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset from Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Read Hotel Booking Analysis CSV
# HBA-Hotel Booking Analysis
Hotel_booking_analysis_data ='/content/drive/MyDrive/1 My project/Hotel Bookings.csv'
HBA_df=pd.read_csv(Hotel_booking_analysis_data)

### Dataset First View

In [None]:
# HBA-Hotel Booking Analysis
HBA_df.head()

In [None]:
HBA_df.tail()

### Dataset Rows & Columns count

In [None]:
# with the help of shape we count Rows & Columns in Datasets
HBA_df.shape #(Rows -119390 ,Columns -32 )

### Dataset Information

In [None]:
# Dataset Info
HBA_df.info()

In [None]:
HBA_df.describe()

### Creating copy of Dataset

In [None]:
#creating copy of original data
# name change HBA_df to H_df
H_df=HBA_df.copy()
H_df.head()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values = H_df.duplicated().value_counts()
duplicate_values

In [None]:
#removing duolicate values
H_df.drop_duplicates(inplace=True)

In [None]:
# View unique data
unique_num_of_rows = H_df.shape[0]

In [None]:
unique_num_of_rows

#### Missing Values/Null Values

In [None]:
#missing values count
missing_value =H_df.isnull().sum().sort_values(ascending=False)[:5]
missing_value

In [None]:
# Create the bar plot
sns.barplot(x=missing_value.values, y=missing_value.index)

# Set plot title and axis labels
plt.title("Missing Values")
plt.xlabel("Count")
plt.ylabel("Columns")

# Display the plot
plt.show()

### What did you know about your dataset?

This data set contains a single file which compares various booking information between two hotels: a city hotel and a resort hotel. Includes information such as when the booking was made , length of stay, the number of adults, children and/or babies and the number of available parking spaces, among other things. The dataset contains a total of 119390 rows and 32 columns. Dataset contains duplicated items i.e 31944 which is removed later. In this dataset we find data types of every columns i.e (int,float,string) and observe that some columns data types is not accurate and remove later. We find unique value of every columns it means what actual values in every columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
H_df.columns

In [None]:
# Dataset Describe
H_df.describe()

### Variables Description

The columns and the data it represents are listed below:

    1.hotel : Name of the hotel (Resort Hotel or City Hotel)

    2.is_canceled : If the booking was canceled (1) or not (0)

    3.lead_time: Number of days before the actual arrival of the guests

    4.arrival_date_year : Year of arrival date

    5.arrival_date_month : Month of month arrival date

    6.arrival_date_week_number : Week number of year for arrival date

    7.arrival_date_day_of_month : Day of arrival date

    8.stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.

    9.stays_in_week_nights : Number of weeknights (Monday to Friday) spent at the hotel by the guests.

    10.adults : Number of adults among guests

    11.children : Number of children among guests

    12.babies : Number of babies among guests

    13.meal : Type of meal booked

    14.country : Country of guests

    15.market_segment : Designation of market segment

    16.distribution_channel : Name of booking distribution channel

    17.is_repeated_guest : If the booking was from a repeated guest (1) or not (0)

    18.previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

    19.previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

    20.reserved_room_type : Code of room type reserved

    21.assigned_room_type : Code of room type assigned

    22.booking_changes : Number of changes/amendments made to the booking

    23.deposit_type : Type of the deposit made by the guest

    24.agent : ID of travel agent who made the booking

    25.company : ID of the company that made the booking

    26.days_in_waiting_list : Number of days the booking was in the waiting list

    27.customer_type : Type of customer, assuming one of four categories

    28.adr : Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights

    29.required_car_parking_spaces : Number of car parking spaces required by the customer

    30.total_of_special_requests : Number of special requests made by the customer

    31.reservation_status : Reservation status (Canceled, Check-Out or No-Show)

    32.reservation_status_date : Date at which the last reservation status was updated **bold text**

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(H_df.apply(lambda col: col.unique()))

In [None]:
# we have count total unique values as per column
for i in H_df.columns.tolist():
   print('Number of unique values', i, 'is', H_df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
missing_value[:4]

In [None]:
#percent of unique value after removing null
percentage_company_null = missing_value[0] / unique_num_of_rows*100
percentage_company_null

In [None]:
# It is better to drop the column 'company' altogether since the number of missing values is extremely high compared to the number of rows.
H_df.drop(['company'], axis=1, inplace=True)

In [None]:
# now let's check for after removing company

percentage_agent_null = missing_value[1] / unique_num_of_rows*100
percentage_agent_null

In [None]:
# As we have seen, there is minimul null values in agent, Lets fill these value by taking mode of the all values

H_df['agent'].fillna(value = 0, inplace = True)
H_df['agent'].isnull().sum() # we re-check that column has no null value

In [None]:
#Check the percentage null value in country col

percentage_country_null = missing_value[2] / unique_num_of_rows*100
percentage_country_null

In [None]:
# We have less null vlues in country col, so we will replace null from 'other' as country name.

H_df['country'].fillna(value = 'others', inplace = True)
H_df['country'].isnull().sum() # we re-check that column has no null value


In [None]:
#Check the percentage null value in children col

percentage_children_null = missing_value[3] / unique_num_of_rows*100
percentage_children_null

In [None]:
# We have less null vlues in country col, so we will replace null from 0 as country name.

H_df['children'].fillna(value = 0, inplace = True)
H_df['children'].isnull().sum() # we re-check that column has no null value


In [None]:
#let's check whether database having any other null value

H_df.isnull().sum() # As we have seen, no column has any null value

In [None]:
#showing the info of the data to check datatype
H_df.info()

In [None]:
# We have seen that childer & agent column as datatype as float whereas it contains only int value, lets change datatype as 'int64'
H_df[['children', 'agent']] = H_df[['children', 'agent']].astype('int64')

# Adding new columns for data understanding

In [None]:
#total stay in nights
H_df['total_stay_in_nights'] = H_df ['stays_in_week_nights'] + H_df ['stays_in_weekend_nights']
pd.DataFrame(H_df['total_stay_in_nights'] )# We have created a col for total stays in nights by adding week night & weekend nights stay col.

In [None]:
# We have created a col for revenue using total stay * adr
H_df['revenue'] = H_df['total_stay_in_nights'] *H_df['adr']
pd.DataFrame(H_df['revenue'])

In [None]:
# Also, for information, we will add a column with total guest coming for each booking
H_df['total_guest'] = H_df['adults'] + H_df['children'] + H_df['babies']
H_df['total_guest'].sum()

In [None]:
# for understanding, from col 'is_canceled': we will replace the value from (0,1) to not_canceled, is canceled.
H_df['is_canceled'] = H_df['is_canceled'].replace([0,1], ['not canceled', 'is canceled'])
pd.DataFrame(H_df['is_canceled'])

In [None]:
#Same for 'is_repeated_guest' col
H_df['is_repeated_guest'] = H_df['is_repeated_guest'].replace([0,1], ['not repeated', 'repeated'])
pd.DataFrame(H_df['is_repeated_guest'])

In [None]:
H_df[['hotel', "revenue"]]

In [None]:
#Now, we will check overall revenue hotel wise
hotel_wise_total_revenue = H_df.groupby('hotel')['revenue'].sum()
hotel_wise_total_revenue

### What all manipulations have you done and insights you found?

We have done few manipulations in the Data.

**----Addition of columns----**

We have seen that there are few columns required in Data to analysis purpose which can be evaluated from the given columns.

  a) Total Guests: This columns will help us to evaluate the volumes of total guest and revenue as well. We get this value by adding total no. of Adults, Children & babies.

  b) Revenue: We find revenue by multiplying adr & total guest. This column will use to analyse the profit and growth of each hotel.

**----Delete of columns----**

  a)company: As we have seen that this columns has almost Null data. so we have delete this column as this will not make any impact in the analysis.

**----Replace of Values in columns----**

  a)is_canceled, is_not_canceled & is_repeated_guest: We have seen, that these columns contains only 0,1 as values which represent the status of booing cancellation. We replace these values (0,1) from 'Canceled' & 'Not canceled. In the same way for column 'is_repeated_guest', we replace 0,1 from 'Repeated' & 'Not repeated'. Now this values will help to make better understanding while visulization.

**----Changes in data type of values in columns----**

  a)Agent & Children: We checked that these columns contains float values, which is not making any sense in data as this values repreasent the count of guest & ID of agent. So we have changed the data type of these columns from 'float' to 'Integer'.

**----Removed is_null values & duplicate entries---**-

  a)Before visualize any data from the data set we have to do data wrangling. For that, we have checked the null value in all the columns. After checking, when we are getting a column which has more number of null values, dropped that column by using the 'drop' method. In this way, we are dropped the 'company' column. When we are find minimal number of null values, filling thse null values with necesary values as per requirement by using .fillna().

  b) In the same, we have checked if there is any duplicacy in data & we found that there are few rows have duplicate data. So we have removed those row from data set by using .drop_duplicates() method.

In this way, we have removed unneccesary data & make our data clean and ready to analyse.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

1) Which type of hotel is mostly prefered by the guests?

In [None]:
# Chart - 1 visualization code
hotel_value_counts = H_df['hotel'].value_counts()
hotel_value_counts

In [None]:
# piechart is used for visualization
hotel_value_counts.plot.pie(explode=[0.05, 0.05], autopct='%1.2f%%', figsize=(6,6),fontsize=10,colors = ('lightblue', 'lightcoral'))
plt.title('Hotel Booking Percentages',fontsize = 15)

##### 1. Why did you pick the specific chart?

pie chart is simple and easy to understand data in percentage ( % ) format and that shows which hotel has more bookings.

##### 2. What is/are the insight(s) found from the chart?

We found that city hotel Booking more as compare to resort hotel.

city hotel-61.13%

resort hotel-38.87%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, gained insights help creating a positive business impact.

City hotel can find more services to attract more guests to increase more revenue.

Resort hotel can find solution to attract guest and also find which facilities provided ny city hotel to attract the guest.

#### Chart - 2

2) hotel booking percentage of guests?

In [None]:
# Chart - 2 visualization code
hotel_cancellation = H_df['is_canceled'].value_counts()
H_df['is_canceled'].replace({0: 'not canceled :', 1: 'canceled :'}, inplace=True)
hotel_cancellation

In [None]:
# piechart is used for visualization
# round figuring autopct by %1.f%%.
hotel_cancellation.plot.pie(autopct='%1.f%%', figsize=(6,6),fontsize=10,colors = ('lightblue', 'lightcoral'))
plt.title('Cancellation percentage rate of Hotel',fontsize =15)

##### 1. Why did you pick the specific chart?

with the help of pie chart showing Cancellation percentage rate of Hotel

##### 2. What is/are the insight(s) found from the chart?

Here, we found that overall more than 27% of booking got cancelled

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Here, we can see, that more than 27% booking getting cancelled.

Solution:

1.Identify common cancellation reasons

2.Improve pricing and offers

3.Enhance customer satisfaction

4.Flexible cancellation policy

5.Improve communication

6.Monitor competitor trends

7.Loyalty programs

#### Chart - 3

3) What is the most prefered room type by the customers?

In [None]:
# Chart - 3 visualization code
room_type = H_df['assigned_room_type'].value_counts()
room_type

In [None]:
# countlot is used for visualization of most preferred room type
plt.figure(figsize=(14,7))
sns.countplot(x=H_df['assigned_room_type'],order=H_df['assigned_room_type'].value_counts().index)
plt.title("Most preferred Room type", fontsize = 20)
plt.xlabel('Type of the Room', fontsize = 15)
plt.ylabel('Room type count', fontsize = 15)

##### 1. Why did you pick the specific chart?

We have choose countplot to visualize most prefferd roomtype because countplot display the count of each observation for each category and here we have to represent room type vs room type count.

##### 2. What is/are the insight(s) found from the chart?

The insighte found from the chart is A type rooms are most prefered rooms and the count is 46313 and after that D type rooms are prefered by the guest and count is 22432.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can provide those facilities in other room types which provided in room type A

#### Chart - 4

4) What type of food is mostly prefered by the guests?

In [None]:
# Chart - 4 visualization code
preferred_food = H_df['meal'].value_counts()
preferred_food

In [None]:
# Visualization of most preferred food using countplot
plt.figure(figsize=(14,7))
sns.countplot(x=H_df['meal'],order=H_df['meal'].value_counts().index)
plt.title("Most preferred Food", fontsize = 20)
plt.xlabel('Type of the food')
plt.ylabel('Food type count')

##### 1. Why did you pick the specific chart?

We have choose countplot to visualize most preferred food because countplot display the count of each observation for each category and here we have to represent food type vs food type count.

##### 2. What is/are the insight(s) found from the chart?

The insight found here is BB type food is most preferred.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Using this insight hotels can improve quality of other type of meals

#### Chart - 5

5) In which month most of the bookings happened?

In [None]:
# Chart - 5 visualization code
# using groupby on arrival_date_month and taking the hotel count
bookings_by_months_df = H_df.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns={'hotel':"counts"})

# create list of months in order
months = ['January','February','March','April','May','June','July','August','September','October','November','December']

# creating df which will map the order of above months list without changing its values.
bookings_by_months_df['arrival_date_month']=pd.Categorical(bookings_by_months_df['arrival_date_month'],categories=months,ordered=True)

#sorting by arrival date month
bookings_by_months_df=bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:
# set plot size
plt.figure(figsize=(20,8))

#plotting lineplot on x-months & y-booking counts
sns.barplot(x=bookings_by_months_df['arrival_date_month'],
            y=bookings_by_months_df['counts'])

# set title for the plot
plt.title('Number of booking across each month', fontsize=15)

# set x labels
plt.xlabel('Month', fontsize=15)

# set y lables
plt.ylabel('Number of bookings', fontsize=15)

##### 1. Why did you pick the specific chart?

We choose barplot here because it gives data visualization in pictorial form. So comparison becomes easy.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is August month has maximum number of bookings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It is clear that yes,this provides a good insights that hotels should be well prepared for month of july and August as maximum bookings takes place for this months.So,better the prepartion and good approach will definetely adds to the growth of Hotels.

#### Chart - 6

6) which year had highest bookings?

In [None]:
# Chart - 6 visualization code
year_count = H_df['arrival_date_year'].value_counts().sort_index()
year_count

In [None]:
# Visualization of year wise booking using countplot chart
plt.figure(figsize=(14,7))
sns.countplot(x=H_df['arrival_date_year'],hue=H_df['hotel'])
plt.title('Year wise Bookings', fontsize = 20)
plt.xlabel('Arrival_date_year', fontsize = 15)
plt.ylabel('Count of bookings', fontsize = 15)

##### 1. Why did you pick the specific chart?

Because countplot is easy to understand.

##### 2. What is/are the insight(s) found from the chart?

2016 had highest bookings and 2015 had lowest bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Year 2016 had highest bookings this makes positive impact.

* Year 2015 had lowest bookings this makes negative impact

* In 2016 there were 42391 bookings and In 2015 there were 13313 bookings.




#### Chart - 7

7) Which distribution channel is mostly used for hotel booking?

In [None]:
# Chart - 7 visualization code
# distribution channel value count
distribution_channel_counts = H_df['distribution_channel'].value_counts()
distribution_channel_counts

In [None]:
pd.DataFrame(distribution_channel_counts).reset_index().rename(columns={'index':"distribution_channel",'distribution_channel':'count'})

In [None]:
#shape of dataframe
dataframe_shape= H_df.shape[0]
dataframe_shape

In [None]:
# booking by distribution channel in percent
distribution_channel_df_percent = pd.DataFrame(round((distribution_channel_counts/dataframe_shape)*100,2)).reset_index().rename(columns={'index':'distribution_channel','distribution_channel':'% booking'})
distribution_channel_df_percent

In [None]:
#Visualization of mostly used distribution channels using barplot
plt.figure(figsize=(14,7))
sns.barplot(data=distribution_channel_df_percent, x="distribution_channel", y="% booking")
plt.title("Mostly used distribution Channels", fontsize = 20)
plt.xlabel('Distribution Channel', fontsize = 15)
plt.ylabel('Booking by distribution channel in percent', fontsize = 15)

##### 1. Why did you pick the specific chart?

Because barplot gives simple and easy to understand pictorial chart.

##### 2. What is/are the insight(s) found from the chart?

Mostly used distribution channel is TA/TO channel.The total count of booking is 69141 and booking in percent is 79.11.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight help to other channels to improve their services.

#### Chart - 8

8) percent of repeated hotel booked by guest?

In [None]:
# Chart - 8 visualization code
hotel_guest = H_df['is_repeated_guest'].value_counts()
hotel_guest

In [None]:
# piechart is used for visualization
hotel_guest.plot.pie(explode=[0.05, 0.05], autopct='%1.2f%%', figsize=(6,6),fontsize=10,startangle=90)
plt.title('Hotel_repeated_guest_status',fontsize = 15)

##### 1. Why did you pick the specific chart?

To show the percentage share of repeated & non-repeated guests.

##### 2. What is/are the insight(s) found from the chart?

Here, we can see that the number of repeated guests is very less as compared to overall guests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can give alluring offers to non-repetitive customers during Off seasons to enhance revenue.

#### Chart - 9

9) Which hotel type has the highest ADR?

ADR=Average Daily Rate

In [None]:
# Chart - 9 visualization code
highest_adr = H_df.groupby('hotel')['adr'].mean().reset_index()
highest_adr

In [None]:
# hotel revenues
plt.figure(figsize = (5,3))
hotel_wise_revenue = H_df.groupby('hotel')['revenue'].sum()
hotel_wise_revenue
ax = hotel_wise_revenue.plot(kind = 'bar',color = ('lightblue', 'lightcoral'))
plt.xlabel("Hotel", fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel("Total Revenue", fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Brown'} )
plt.title("Total Revenue", fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Green'} )

In [None]:
# Visualization of highest adr using barplot
plt.figure(figsize=(14,7))
sns.barplot(x=highest_adr['hotel'],y=highest_adr['adr'])
plt.title('Average ADR for each Hotel type', fontsize=20)
plt.xlabel('Type of hotel',fontsize=15)
plt.ylabel('ADR', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

I choose bar plot because it gives simple pictorial diagram and it also easy to understand.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is City hotel has highest adr that means city hotel generate more revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* City hotel has high adr this makes positive impact.
* Resort hotel has less adr as compaire to city hotel this makes negative impact.
* City hotel has adr 110.98 means more revenue and resort hotel has 99.02 adr means less revenue than city hotel.
* Resort hotel should have increase there facilitis which increase revenue.

#### Chart - 10

10) which hotel has longer waiting time?

In [None]:
# Chart - 10 visualization code
Waiting_time = H_df.groupby('hotel')['days_in_waiting_list'].mean().reset_index()
Waiting_time

In [None]:
# Visualization of hotel which has longer waiting time by using barplot
plt.figure(figsize=(14,7))
sns.barplot(x=Waiting_time['hotel'],y=Waiting_time['days_in_waiting_list'])
plt.title('Waiting time for each hotel type', fontsize=20)
plt.xlabel('Type of hotel',fontsize=15)
plt.ylabel('Waiting time', fontsize=15)

##### 1. Why did you pick the specific chart?

I choose barplot bacuase it gives easy to understand pictorial diagram for the visualization of which hotel has longer waiting time.

##### 2. What is/are the insight(s) found from the chart?

City hotel has longer waiting time.Therefore city hotel is much busier than Resort hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* City hotel has longer waiting time this makes positive impact on business.
* Resort hotel has less waiting time this makes negative impact on business.
* The mean of days in waiting list for city hotel is about 1.02 and for resort hotel is about 0.32.
* Resort hotel need to increase their facilities so that their bookings increases

#### Chart - 11

11) Which distribution channel contributed more to adr in order to increase the income?

In [None]:
# Chart - 11 visualization code
distribution_channel = H_df.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()
distribution_channel

In [None]:
# Visualization of contribution of distribution channel in adr using barplot
plt.figure(figsize=(14,7))
sns.barplot(x='distribution_channel',y='adr', data=distribution_channel,hue='hotel')
plt.title('ADR across Distribution channel', fontsize=20)
plt.xlabel('Distribution channel',fontsize=15)
plt.ylabel('ADR', fontsize=15)

##### 1. Why did you pick the specific chart?

I choose here barplot to visualise ADR across distribution channel beacuse it give easy to undertand visualization to large data.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the above chart is that GDS channel contributed most in ADR in city hotel and Direct and TA/TO has nearly equal contribution in adr in both hotel types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* GDS distribution channel contributed more to adr for city hotel and Undefined distribution channel contributed more to adr for resort hotel this makes positive impact.
* GDS distribution channel has no any contribution to adr for resort hotel and undefined distribution channel contributed less to adr for city hotel this makes neative impact.
* GDS distribution channel must have increase bookings for resort hotels therefore there contribtuion to adr will increase and income will increase and undefined distribution channel must have increase bookings for city hotels therefore there contribution to adr will increase and income will increase.

#### Chart - 12

12) What is optimal stay length in both types of hotel?

In [None]:
# Chart - 12 visualization code
stay_length = H_df.groupby(['total_stay_in_nights','hotel']).agg('count').reset_index()
stay_length = stay_length.iloc[:, :3]
stay_length = stay_length.rename(columns={'is_canceled':'Number of stays'})
stay_length

In [None]:
# Barplot is used for visualization of optimal stay length in hotel type
plt.figure(figsize=(14,8))
sns.barplot(x='total_stay_in_nights',y='Number of stays', data=stay_length,hue='hotel')
plt.title('Optimal Stay Length in Both hotel types', fontsize=20)
plt.xlabel('total_stay in days',fontsize=15)
plt.ylabel('count of stays', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Because it gives simple visualization.

##### 2. What is/are the insight(s) found from the chart?

Optimal stay length in both hotel type is less than 7 days.



### Chart -13



13) Data in histgram format?

In [None]:
H_df.hist(figsize=(24,18))
plt.show()

#####1. Why did you pick the specific chart?

To understanding the data in clear way with proper insights,I have used the histogram here. it is used to summarize discrete or continued data that are measured on an interval scale . It is often used to ilustrate the major feautures of the distrubtion of the data in convenient form.it is alsouseful when dealing with alrger data set (greater than 100 observation).It can help detect any unusual observation(outliers) or any gaps in the data.Thus we have used the histogramplot to analysis the variable distribution over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

We can see that the maximum guest came in the year 2016 Maximum arrival week number is 30. Maximum arrival happens in the last of the month.Maximum guests comes with no children.There is very less requriement of car parking spaces.

#####3. Will the gained insights help creating a positive business impact?

Just histogram cannot define business impact it's done just to see the distribution of the column data over the dataset.

#### Chart - 14 - Correlation Heatmap

14) create correlation heatmap?

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(20,10))
cmap = sns.color_palette("Blues")
sns.heatmap(H_df.corr(),annot=True,cmap=cmap)
plt.title('Co-relation of the clumns',fontsize = 15)

##### 1. Why did you pick the specific chart?

To understand the relationsip between different numerical value

##### 2. What is/are the insight(s) found from the chart?

Highest corelation value between axis is 95% positive & lowest corelation value between the axis is -51% negative.

#### Chart - 15 - Pair Plot

15) create pair plot?

In [None]:
# Pair Plot visualization code
plt.figure(figsize=(400,500))
pd.plotting.scatter_matrix(H_df, figsize=(10,10),color='pink')
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between varibles or to from the most separated clusters if also helps to from some simple classification models by drawing some simple or make linear separation in our data set.

##### 2. What is/are the insight(s) found from the chart?

We have found the realtionship of its reapated guest with different types of columns so, generally this chart effects the realtionship of a particular column with a other column.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**1) To increase hotel business some factors are important like high revenue, generation, customers satisfaction, facilities provided by hotel etc.**

**2) I am able to achieve the same things by showing to client which hotel is most preferred , percentage of repeated guests, mostly
preferred food by guests, then which hotel has highest adr etc.**

**3) Most preferred room type is achieved by countplot so the client can be well prepare in advance and this insight help client for further enhancement of their hospatility.**

**4) I am able to show which food type is mostly preferred so client can offer the mostly preferred food to the guests.**

**5) Most preferred month are shown by barplot so client can be well prepared in advanced so that minimum grivances would be faced by client.**

**6) Using barplot I am able to show which hotel type has high adr so client can analyse which hotel has high income.**

**7) I am able to show which hotel is busiest hotel sp client can do relatable changes in facilities in less busy hotel type.**

**8) I am able to show the relationship between repeated guests and previous bookings not cancelled so client can preferred repeated guests.**

**9) Using barplot relationship between adr and total number of people is shown so client can preferred maximum number of people.**

# **Conclusion**

* **City Hotel seems to be more preferred among travellers and it also generates more revenue & profit.**

* **Most number of bookings are made in July and August as compared rest of the months.**

* **Room Type A is the most preferred room type among travellers.**

* **Most of the guest stays for 1-4 days in the hotels.**

* **City Hotel retains more number of guests.**

* **Around one-fourth of the total bookings gets cancelled. More cancellations are from City Hotel.**

* **New guest tends to cancel bookings more than repeated customers.**

* **Lead time, number of days in waiting list or assignation of reserved room to customer does not affect cancellation of bookings.**

* **Corporate has the most percentage of repeated guests while TA/TO has the least whereas in the case of cancelled bookings TA/TO has the most percentage while Corporate has the least.**

* **The length of the stay decreases as ADR increases probably to reduce the cost.**



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***