<a href="https://colab.research.google.com/github/MukundP2/Hotel-Booking-analysis/blob/main/Hotel_Booking_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

## <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

In [None]:
# import required modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick
import plotly 
import plotly.express as px

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Path of the raw dataset of Hotel Booking
data_path = '/content/drive/MyDrive/Capstone  Hotel/project/Copy of Hotel Bookings.csv'

# **Exploring the dataset**

In [None]:
# Loading the dataset
hotel_booking_raw_ds = pd.read_csv(data_path)
hotel_booking_raw_ds

In [None]:
# creating a copy of the dataset
ds = hotel_booking_raw_ds

ds.head()



In [None]:
ds.tail()


In [None]:
ds.shape

In [None]:
ds.info()

In [None]:
# checking the null values
ds.isnull().sum().sort_values(ascending=False)

In [None]:
# null values percentage
percentage_miss = (ds.isnull().sum()/ds.isnull().count()*100).sort_values(ascending=False)
percentage_miss

**In agent column we have id_number of agents, So we will replace the null values to zero in those columns.**

In [None]:
# Filling 0.0 in place of null values in agent column
ds[['agent']] = ds[['agent']].fillna(0.0)
ds[['children']] = ds[['children']].fillna(0.0)

# comfirming the change

ds.isnull().sum().sort_values(ascending=False)

In [None]:
ds[(ds.adults + ds.babies + ds.children) == 0].shape

There are 180 rows where adults + babies + children is equal to 0. The number of guests can't be 0, so we will drop this rows.

In [None]:
# droping row where adults + babies + children = 0

ds= ds.drop(ds[(ds.adults + ds.babies + ds.children) == 0].index)

Country column contains the country codes of the guests, it is categorical feature so we will also replace it with the mode value.

In [None]:
# replacing mode value in place of null values in country column

ds['country'] = ds['country'].fillna(ds.country.mode().to_string())

In [None]:
ds.dtypes

Converting Agent and children float datatype to integer      
      
    

In [None]:
ds[['children', 'agent']] = ds[['children', 'agent']].astype('int64')
ds[['children', 'agent']] 

# **What is the percentage of canceled booking?**





In [None]:
# canceled bookings 
canceled_bookings= ds['is_canceled'].value_counts()
canceled_bookings



In [None]:
# Percentage of canceled bookings 


ds['is_canceled'].value_counts(normalize=True)*100

In [None]:
canceled_bookings.plot(kind='pie',autopct ='%1.1f%%',figsize =(8,8),fontsize= 15,colors=['red' ,'blue'],radius=1,labels=['not canceled','canceled'])

plt.title('Percentage of canceled bookings ',fontsize = 20)
plt.show()

# As we can see more than 37% of bookings where canceled.

# **Which segment brings in the most of the bookings?**




In [None]:
# creating a plot function

def plot(x, y, x_label=None,y_label=None, title=None, figsize=(7,5), type='bar',colors=[]):
  
    sns.set_style('darkgrid')
    
    fig, ax = plt.subplots(figsize=figsize)
    
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    
    if x_label != None:
        ax.set_xlabel(x_label)
    
    if y_label != None:
        ax.set_ylabel(y_label)
        
    if title != None:
        ax.set_title(title)
    
    if type == 'bar':
        sns.barplot(x,y, ax = ax)
    elif type == 'line':
        sns.lineplot(x,y, ax = ax, sort=False)
        
    
    plt.show()

In [None]:
# creating a function get_count


def get_count(series, limit=None):
 
    if limit != None:
        series = series.value_counts()[:limit]
    else:
        series = series.value_counts()
    
    x = series.index
    y = series/series.sum()*100
    
    return x.values,y.values

In [None]:
# plot bar chart 
x,y = get_count(ds['market_segment'])
plot(x,y, x_label='market_segment', y_label='Total Booking (%)', title='market_segment-wise booking', figsize=(15,7),colors=['black','red','green','blue','orange','yellow','pink','brown'])

# Online Travel Agents followed by Ofline Travel Agents brings in most of the booking

# **Which is the top country from where most booking are coming?**

In [None]:
# select the bookings which was not cancelled
confirm_bookings = hotel_booking_raw_ds[hotel_booking_raw_ds['is_canceled']==0]

In [None]:
# Number of bookings fro each country store in one variable
get_count= confirm_bookings['country'].value_counts() 

In [None]:
# Top 15 counties which has highest amount of hotel bookings
get_count.head(15)

In [None]:
Bookings_from_country = get_count.head(15)
country_names = ['Portugal','United Kingdom','France','Spain','Germany','Ireland','Italy','Belgium','Netherlands','USA','Brazil','Switzerland','Austria','China','Sweden']

In [None]:
plt.figure(figsize=(20,7))
plt.bar(country_names,Bookings_from_country,color = ['palegreen','mediumpurple','palevioletred','cadetblue','salmon','lightskyblue','palegreen','navajowhite','rosybrown','springgreen','coral','slategray','plum'])
plt.xlabel('Country')
plt.ylabel('Number of Bookings')
plt.title('Hotel Bookings across the contries')

### or

In [None]:
ds_not_canceled = ds[ds['is_canceled'] == 0]

In [None]:
# show on map
temp = ds_not_canceled['country'].value_counts().reset_index().rename(columns={'index':'country','country':'count'})
guest_map = px.choropleth(temp,locations=temp['country'],color=np.log(temp['count']), hover_name=temp['country'], 
                          color_continuous_scale=px.colors.sequential.Plasma,title="Home country of guests")

guest_map.show()


# Portugal is the top country from where most hotel bookings are coming.

# **What is most preffered Meal type ?**

In [None]:
meal_type = ['Bed and Breakfast', 'Half Board','Self Catering','Undefined','Full Board']
meal_count= confirm_bookings['meal'].value_counts()

In [None]:
# Percentage of meal type count.
percentage_meal_count = confirm_bookings['meal'].value_counts(normalize= True)*100
percentage_meal_count

In [None]:
# Visualisation of percentage meal count
plt.figure(figsize=(10,5))
plt.bar(meal_type, meal_count)
plt.xlabel('Meal Type')
plt.ylabel('Number of Bookings')
plt.title('Most demandding Meal Types')

# Bed and Breakfast id most preffered meal

# **Which is the most trending months for hotel booking?**

In [None]:
# creating a data frame with the bookings which did not get canceled

ds_not_canceled = ds[ds['is_canceled'] == 0]

In [None]:
# plot a line plot for month of arrival

new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
             'October', 'November', 'December']

sorted_months = ds_not_canceled['arrival_date_month'].value_counts().reindex(new_order)

x = sorted_months.index
y = sorted_months/sorted_months.sum()*100

In [None]:
#sns.lineplot(x, y)
plot(x, y, x_label='Months', y_label='Booking (%)', title='Booking Trend (Monthly)', type='line', figsize=(18,6))

In [None]:
# Bar plot for months vs Number of Bookings
x_axis_data = sorted_months.index
y_axis_data = sorted_months
plt.figure(figsize=(12,7))
plt.bar(x_axis_data,y_axis_data, color = ['red','mediumpurple','palevioletred','cadetblue','salmon','lightskyblue','palegreen','green','springgreen','coral','slategray','plum'])
plt.xlabel('Months')
plt.ylabel('Number of Bookings')
plt.title('Number of Bookings in each Months')

# Month of August is the most trending months for the hotel booking.

# **How long the peoples stayed in the hotel?**#

In [None]:
def get_count(series, limit=None):
 
    if limit != None:
        series = series.value_counts()[:limit]
    else:
        series = series.value_counts()
    
    x = series.index
    y = series/series.sum()*100
    
    return x.values,y.values

In [None]:
# plotting a bar plot for booking percentage to night stay duration.

total_nights = ds_not_canceled['stays_in_weekend_nights']+ ds_not_canceled['stays_in_week_nights']
x,y = get_count(total_nights, limit=10)

plot(x,y, x_label='Number of Nights', y_label='Booking Percentage (%)', title='Night Stay Duration (Top 10)', figsize=(15,7))


# More than 60% of guests comes under 1, 2 and 3 night stays options.

# **Which was the most booked accommodation type (Single, Couple, Family)?**

In [None]:
# for single, couple, multiple and family

single   = ds_not_canceled[(ds_not_canceled.adults==1) & (ds_not_canceled.children==0) & (ds_not_canceled.babies==0)]
couple   = ds_not_canceled[(ds_not_canceled.adults==2) & (ds_not_canceled.children==0) & (ds_not_canceled.babies==0)]
family   = ds_not_canceled[ds_not_canceled.adults + ds_not_canceled.children + ds_not_canceled.babies > 2]


# the list of Category names, and their total percentage
names = ['Single', 'Couple (No Children)', 'Family / Friends']
count = [single.shape[0],couple.shape[0], family.shape[0]]
count_percent = [x/ds_not_canceled.shape[0]*100 for x in count]


# plot
plot(names,count_percent,  y_label='Booking (%)', title='Accommodation Type', figsize=(10,7))

# Couple (or 2 adults) is the most popular accommodation type. So hotels can make plans accordingly

# **Ploting the heatmap**

In [None]:
corr_matrix = hotel_booking_raw_ds.corr()
sns.set(style='white',font_scale=2.2)
fig = plt.figure(figsize=[35,30])
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(150, 0, as_cmap=True)
sns.heatmap(corr_matrix,cmap='seismic',linewidth=3,linecolor='white',vmax = 1, vmin=-1,mask=mask, annot=True,fmt='0.2f')
plt.title('Correlation Heatmap', weight='bold',fontsize=30)
plt.savefig('heatmap.png',transparent=True, bbox_inches='tight')

### or

In [None]:
# Ploting the heatmap to see correlation with columns


fig, ax = plt.subplots(figsize=(35,25))
sns.heatmap(ds.corr(), annot=True, ax=ax);

#Correlation Conclusion:
# 1. adr and children are positive correlated by 33%
#2. It can be observed arrival_date_week_number and arrival_date_year are 54% negative correlated
#3. previous_bookings_not_cancelled and is_repeated_guest are 42% positive correleated


# **Average Daily Rate (ADR) comparision of city hotel and resort hotel.**

In [None]:
# lineplot of ADR

plt.figure(figsize=(12,8))
sns.lineplot(x = 'arrival_date_month', y = 'adr', hue= 'hotel', data = ds_not_canceled)

# Average Daily Rate (ADR) for the months of July and August are strikingly more for the Resort Hotel than the City Hotel.

# **What is the relation of deposits to the booking cancellation?**

In [None]:
# counting refundable deposits

ds[ds.deposit_type == 'Refundable'].deposit_type.count()

In [None]:
# ploting count plot

fig = plt.gcf()
fig.set_size_inches(12, 8)
plt.title("Booking Canceled or not by Deposit type")
sns.countplot(x='deposit_type',data=ds ,hue='is_canceled')

# No deposit cancellations are high compared to other categories

# **What is the relationship between lead time and cancellation?**

In [None]:
a = ds.groupby("lead_time")['is_canceled'].describe()
a

In [None]:
fig = plt.gcf()
fig.set_size_inches(12, 8)
a = ds.groupby("lead_time")['is_canceled'].describe()
sns.scatterplot(x=a.index, y=a["mean"] * 100)

# **Conclusion :**
1. More than 37% of bookings where canceled.
2. Online Travel Agents followed by Ofline Travel Agents brings in most of the bookings.
3. Portugal is the top country from where most hotel bookings are coming.
4.Bed and Breakfast id most preffered meal
4. Month of August is the most trending months for the Hotel Booking.
5. More than 60% of guests comes under 1,2 and 3 night stays options.
6. Couple (or 2 adults) is the most popular accommodation type. So hotels can make plans accordingly.
7. ### Ploting the heatmap
   -Adr and children are positive correlated by 33%.

   -It can be observed arrival_date_week_number and arrival_date_year are 54%   negative correlated.

  -Previous_bookings_not_cancelled and is_repeated_guest are 42% positive    correleated.
8. Average Daily Rate (ADR) for the months of July and August are strikingly more for the Resort Hotel than the City Hotel.
9. No deposit cancellations are high compared to other categories but these should not be discouraged per se as bookings in this category are also very high compared to non refundable type bookings.
10. it is observed that lead time has a positive correlation with cancellation.