## **Project Name**    - **Hotel Booking Analysis**



# **Project Type  - EDA**
# **Contribution    - Individual**


# **Project Summary -**

### By analyzing a vast dataset of hotel bookings, the project seeks to uncover key factors influencing booking behavior, such as seasonal trends, booking lead times, and popular amenities. The dashboard will provide a real-time overview of booking trends, customer preferences, and facilitating the identification of areas for improvement and opportunities for growth within the hotel's operations.By leveraging predictive analytics, the project will forecast future booking demands and occupancy rates, resource allocation, and marketing initiatives to maximize revenue and customer satisfaction.

# **GitHub Link -**

https://github.com/AshwiniSuryakar09/Hotel-Booking-Analysis

# **Problem Statement**


### In the rapidly evolving hospitality industry, there exists a pressing need to leverage techniques to comprehensively analyze hotel booking patterns and customer preferences. Moreover, the inability to address factors contributing to booking cancellations , impacting overall profitability and hindering the ability to provide a seamless and personalized customer experience.

### To address these challenges, this project aims to develop a comprehensive data analysis framework that can extract meaningful insights from the available booking data.

#### **Define Your Business Objective?**

### Our primary goal is to conduct EDA on the provided dataset and derive valuable conclusions about broad hotel booking trends and how various factors interact to affect hotel bookings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
#Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#To ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# Load Dataset
from google.colab import drive                    # Mounting drive
drive.mount('/content/drive')

In [None]:

filepath="/content/Hotel Bookings.csv"
hotel_df=pd.read_csv(filepath)

### Dataset First View

In [None]:
# Dataset First Look
hotel_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_df.columns


In [None]:
hotel_df.describe()

### Dataset Information

In [None]:
# Dataset Info

hotel_df.info()

In [None]:
df = hotel_df.copy()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df[df.duplicated()].shape

In [None]:

df.drop_duplicates(inplace = True)

In [None]:

df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_df=pd.DataFrame(df.isna().sum()).rename(columns={0:'number_of_nulls_values'})
null_df

### Null values for the features children, country, agent, and company are 4, 452, 12193, and 82137, respectively.

In [None]:
# Ploting number of null values with its variable
plt.figure(figsize=(7,7))
null_df.plot(kind='bar')
plt.title('Graph of Number of null values with respect to its Variable')

In [None]:
# Percentage of null values
percentage_null_df=pd.DataFrame(round(df.isna().sum()*100/len(df),4)).rename(columns={0:'percentage_null_values'})
percentage_null_df

### Children, country, agent, and company variables have null values of 0.0046%, 0.5172%, 13.9514%, and 93.9826%, respectively. Variable companies having more than 50% null values.

In [None]:
#checking category of features whoes having null values
df.country.value_counts()

In [None]:
df.agent.value_counts()

In [None]:
df.children.value_counts()

In [None]:
df.company.value_counts()

In [None]:
# Dropping variable having more than 50% null values
df.drop(columns='company', inplace=True)

In [None]:
# Check for null values are removed
df.isna().sum()

In [None]:
df.shape

In [None]:
# Check for null values are removed
df.isna().sum()

In [None]:
df.shape

### All the null values have been successfully removed.

In [None]:
# Visualizing the missing values
df['children'].fillna(df['children'].mean(), inplace = True)
df['country'].fillna(df['country'].mode()[0], inplace = True)
df['agent'].fillna(df['agent'].mode()[0], inplace = True)



In [None]:
# Besic statistical description fo Dataset
df.describe()

### What did you know about your dataset?

### From above, children, country, and agent are discrete numerical variables, so replaced null values with modes, and the variable company had null values greater than 50%, so removed it.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

hotel : Name of the hotel (Resort Hotel or City Hotel)

is_canceled : If the booking was canceled (1) or not (0)

lead_time: Number of days before the actual arrival of the guests

arrival_date_year : Year of arrival date

arrival_date_month : Month of arrival date

arrival_date_week_number : Week number of year for arrival date

arrival_date_day_of_month : Day of arrival date

stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.

stays_in_week_nights : Number of weeknights (Monday to Friday) spent at the hotel by the guests.

adults : Number of adults among guests

children : Number of children among guests

babies : Number of babies among guests





meal : Type of meal booked

country : Country of guests

market_segment : Designation of market segment

distribution_channel : Name of booking distribution channel

is_repeated_guest : If the booking was from a repeated guest (1) or not (0)

previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

reserved_room_type : Code of room type reserved

assigned_room_type : Code of room type assigned

booking_changes : Number of changes/amendments made to the booking

deposit_type : Type of the deposit made by the guest

agent : ID of travel agent who made the booking

company : ID of the company that made the booking

days_in_waiting_list : Number of days the booking was in the waiting list

customer_type : Type of customer, assuming one of four categories

adr : Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights

required_car_parking_spaces : Number of car parking spaces required by the customer

total_of_special_requests : Number of special requests made by the customer

reservation_status : Reservation status (Canceled, Check-Out or No-Show)

reservation_status_date : Date at which the last reservation status was updated

### Check Unique Values for each variable.

## **Categorical veriables**

In [None]:
# Obtaining categorical veriables
categorical_veriables=[i for i in df.columns if df[i].dtypes=='O']
print(f'Dataset having {len(categorical_veriables)} categorical variables')
print('--'*39)
print(categorical_veriables)

## **Numerical variables**

In [None]:
# Obtaining Numerical varibles
numerical_variables=[i for i in df.columns if df[i].dtypes!='O']
print(f'There are {len(numerical_variables)} numerical variables.')
print('--'*39)
print(numerical_variables)

### There are 19 numerical variables.

In [None]:
# Obtaining Descrete varibles from Numerical varibles
# Variables having less than 150 categories are consider as descrete variable
descrete_variavles=[]
for i in numerical_variables:
  if len(df[i].value_counts())<=150:
    descrete_variavles.append(i)
    print(i,':',df[i].unique())
    print('__'*39)
  else:
      pass

print(f'Dataset having {len(descrete_variavles)} descrete variables')

In [None]:
# Obtaining contineous variables from numerical variables
contineous_variables=[i for i in numerical_variables if i not in descrete_variavles]
print(f'Dataset having {len(contineous_variables)} contineous variables')
print('--'*39)
print(contineous_variables)

### Dataset having 3 contineous variables.

In [None]:
# Checking for outliers for contineous variables

from scipy.stats import norm
for i in contineous_variables:
  plt.figure(figsize=(15,6))
  plt.subplot(1,2,1)
  ax=sns.boxplot(data=df[i])
  ax.set_title(f'{i}')
  ax.set_ylabel(i)

  plt.subplot(1,2,2)
  ax=sns.distplot(df[i], fit=norm)
  ax.set_title(f'skewness of {i} : {df[i].skew()}')
  ax.set_xlabel(i)
  print('__'*39)
  plt.show()

## Outliers were found in the variables lead_time and adr, but not in the variable agent.

In [None]:
# Using Inter Quartile range in skew symmetric curve for removing outliers

# Outlier columns
outliers_columns=['lead_time','adr']

# Copy dataset as new dataset
new_df=df.copy()

# Capping dataset
for i in outliers_columns:
    #Findng IQR
    Q1=new_df[i].quantile(0.25)
    Q3=new_df[i].quantile(0.75)
    IQR=Q3-Q1

    # Defining lower and upper limit
    lower_limit =new_df[i].quantile(0.25)-1.5*IQR
    upper_limit =new_df[i].quantile(0.75)+1.5*IQR

    # Applying lower and upper limit to each variables
    new_df.loc[(new_df[i] > upper_limit),i] = upper_limit
    new_df.loc[(new_df[i] < lower_limit),i] = lower_limit

In [None]:
# Checking for outliers for contineous variables
from scipy.stats import norm
for i in outliers_columns:
  plt.figure(figsize=(15,6))
  plt.subplot(1,2,1)
  ax=sns.boxplot(data=new_df[i])
  ax.set_title(f'{i}')
  ax.set_ylabel(i)

  plt.subplot(1,2,2)
  ax=sns.distplot(new_df[i], fit=norm)
  ax.set_title(f'skewness of {i} : {new_df[i].skew()}')
  ax.set_xlabel(i)
  print('__'*50)
  plt.show()

### Ouliers in the lead_time and adr variables were removed.

In [None]:
# Describe outlier free new_df
new_df.describe()

In [None]:
# Check Unique Values for each variable.
df.nunique()
df['is_canceled'].unique()

In [None]:
df['arrival_date_year'].unique()

In [None]:
df['arrival_date_year'].unique()

In [None]:
hotel_df['arrival_date_month'].unique()

In [None]:

hotel_df['arrival_date_week_number'].unique()

In [None]:
df['meal'].unique()

In [None]:
df['market_segment'].unique()

In [None]:
df['distribution_channel'].unique()

In [None]:

hotel_df['adults'].unique()

In [None]:
df['children'].unique()

In [None]:

hotel_df['babies'].unique()

In [None]:

hotel_df['reserved_room_type'].unique()

In [None]:
hotel_df['assigned_room_type'].unique()

In [None]:

hotel_df['deposit_type'].unique()

In [None]:

hotel_df['agent'].unique()

In [None]:
for elem in hotel_df.columns:
  print('Number of unique values in',elem,'column is',hotel_df[elem].nunique())

In [None]:
hotel_df['lead_time'].unique()

In [None]:
hotel_df['customer_type'].unique()

In [None]:
hotel_df['reservation_status_date'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dataset
new_df

In [None]:
#checking unique values in each variable
for i in new_df.columns:
    print(f'{i}:{new_df[i].unique()}')
    print('__'*50)

In [None]:
# Checking info of newly formed dataset after removed outliers
new_df.info()

In [None]:
# Change datatype of variables children and agent to correct format from float64 to int64
new_df[['children','agent']]=new_df[['children','agent']].astype('int64')

# Change datatype of variable reservation_status_date to correct format from object to datetime64
new_df['reservation_status_date']=pd.to_datetime(new_df['reservation_status_date'], format='%Y-%m-%d')

In [None]:
# Checking datatype
new_df[['children','agent','reservation_status_date']].info()

In [None]:
# Adding night stays on week night and weekend night in one variable to 'total_stays'
new_df['total_stays']=new_df['stays_in_weekend_nights']+ new_df['stays_in_week_nights']

# Converting "adults," "children," and "babies" to total_people by adding it
new_df['total_people']= new_df['adults']+ df['children']+ new_df['babies']

# Creating 'total_childrens' variable by adding 'chldrens' and 'babies' variables
new_df['total_childrens']= df['children']+ new_df['babies']

# Creating 'reserved_room_assigned' variable which describe same room assigned or not
new_df['reserved_room_assigned']=np.where(new_df['reserved_room_type']==new_df['assigned_room_type'], 'yes', 'no')

# Creating 'guest_category' from variable 'total_people'
new_df['guest_category']=np.where(new_df['total_people']==1, 'single',
                                 np.where(new_df['total_people']==2, 'couple', 'family'))

# Creating 'lead_time_category' from 'lead_time' variale to display category
new_df['lead_time_category']=np.where(new_df['lead_time']<=15, 'low',
                                 np.where((new_df['lead_time']>15) & (new_df['lead_time']<90), 'medium', 'high'))

#checking dataset
new_df.head()

In [None]:
new_df.shape

In [None]:
# Remove observations having value 0 in total_people variable
new_df.drop(new_df[new_df['total_people']==0].index, inplace=True)

### Because observations of the variable total_people cannot be zero, observations with 0 values are removed, reducing the number of observations to 87230 from 87396.

In [None]:
# Checking info of new dataset
new_df.info()

In [None]:
# Coverting datatype of variables total_people and total_childrens to int64 from float64
new_df['total_people']=new_df['total_people'].astype('int64')
new_df['total_childrens']=new_df['total_childrens'].astype('int64')


#Checking datatype of total_column
new_df[['total_people','total_childrens']].info()

### What all manipulations have you done and insights you found?

* The variables "children," "agent," "reservation_status_date," "total_people," and "total_children" do not have the correct datatype format. So they are transformed from float64 datatypes to int64. The variable "reservation_status_date" was transformed from object datatype to datetime64.
* For more convenience in the dataset, "total_stays" and "total_people" variables are created. By adding the variables "stays_in_weekend_nights" and "stays_in_weeknights," the variable "total_stays" is created. The variables "adults," "children," and "babies" are combined, the variable "total_people" is created.
* The variable "reserved_room_assigned" is made up of the variables "reserved_room_type" and "assigned_room_type," which describe whether or not a reserved room for a customer has been assigned. From variables "children" and "babies," a new "total_children" variable is created by adding both of them.
* The variable "total_people" was used to create "guest_category," which describes bookings made for individuals, couples, or families. The variable 'lead_time_category' was created from the variable 'lead_time' to display lead time as low, medium, and high.
Variable "total_people" cannot be 0, so observations in total_people having 0 values are dropped.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Obtaing target variable
excluded_variables=[var for var in new_df.columns if len(new_df[var].value_counts()) > 15]
target_variables=[var for var in new_df.columns if var not in excluded_variables]

# Defining r to autofit the number and size of plots
r = int(len(target_variables)/3 +1)

In [None]:
# Defining a function to Notate the percent count of each value on the bars
def annot_percent(axes):
    '''Takes axes as input and labels the percent count of each bar in a countplot'''
    for p in plot.patches:
        total = sum(p.get_height() for p in plot.patches)/100
        percent = round((p.get_height()/total),2)
        x = p.get_x() + p.get_width()/2
        y = p.get_height()
        plot.annotate(f'{percent}%', (x, y), ha='center', va='bottom')

In [None]:
# Plotting the countplots for each variable in target_variables
plt.figure(figsize=(18,r*3))
for n,var in enumerate(target_variables):
    plot = plt.subplot(r,3,n+1)
    sns.countplot(x=new_df[var]).margins(y=0.15)
    plt.title(f'{var.title()}',weight='bold')
    plt.tight_layout()
    annot_percent(plot)

##### 1. Why did you pick the specific chart?

This approach is chosen because it provides a concise yet informative view of the categorical variables in your dataset, allowing you to quickly assess their distributions and relative frequencies. It's particularly useful for initial exploratory data analysis (EDA) to understand the data's composition and potential patterns.

##### 2. What is/are the insight(s) found from the chart?

To derive insights from the countplots , I would need to analyze the visual representation of each categorical variable in dataset.
Here’s a general approach to what you might look for and the kinds of insights you could obtain:
1.Distribution Patterns

2.Rare Categories

3.Potential Outliers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, The insights gained from the countplots of categorical variables in your dataset can significantly impact business decisions, both positively and negatively.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

#Heatmap

num_df = df[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests']]



In [None]:
#correlation matrix
corrmat = num_df.corr()
f, ax = plt.subplots(figsize=(12, 7))
sns.heatmap(corrmat,annot = True,fmt='.2f', annot_kws={'size': 10},  vmax=.8, square=True);

### The heatmap visualization of the correlation matrix is a popular choice for exploratory data analysis because it efficiently communicates the relationships between numerical variables, facilitating data-driven decision-making and further analysis. Heatmaps make it easier to identify patterns and relationships in the data and present information in a compact and visually appealing format. This makes them suitable for presentations.

##### 1. Why did you pick the specific chart?

### The heatmap visualization of the correlation matrix is a popular choice for exploratory data analysis because it efficiently communicates the relationships between numerical variables, facilitating data-driven decision-making and further analysis. Heatmaps make it easier to identify patterns and relationships in the data and present information in a compact and visually appealing format. This makes them suitable for presentations.

##### 2. What is/are the insight(s) found from the chart?

### First thing is we examine the strength and direction of correlations between numerical variables. These insights provide a starting point for further analysis and decision-making, such as refining pricing strategies, improving customer segmentation, or enhancing service offerings based on observed correlations between different variables in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Are there any insights that lead to negative growth? Justify with specific reason.

### The insights gained from the correlation heatmap can potentially contribute to positive business impacts, but they also hold the possibility of leading to negative growth if not properly interpreted or addressed.

### Businesses can adjust rates for longer stays accordingly, potentially increasing revenue.By identifying the relationship between booking changes and previous bookings not canceled could lead to strategies aimed at enhancing customer engagement and loyalty, potentially resulting in repeat business and positive word-of-mouth.

### If the misconception happens that longer lead times always warrant lower rates, customers booking last-minute might feel unfairly charged higher rates, potentially leading to dissatisfaction and loss of business. While the correlation between booking changes and previous bookings not canceled may suggest customer loyalty, it could also signal operational inefficiencies if the changes are due to errors or inadequate booking management systems, potentially leading to negative reviews and reduced bookings.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Visualizing by pie chart
hotel_df['hotel'].value_counts().plot.pie(explode=[0.05, 0.05], autopct ='%1.1f%%', shadow = True, figsize =(10,9), fontsize = 20)

# Set labels
plt.title('Pie Chart for Most Preferred Hotel', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

### A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Wherever different percentage comparison comes into action, pie chart is used frequently. So, i have used Pie Chart and which helped us to get the percentage comparison more clearly and precisely.

##### 2. What is/are the insight(s) found from the chart?

### From the above chart, we got to know that City Hotel is most preferred hotel by the guests. Thus City Hotel has maximum bookings. 61.1% guests are preferred City Hotel, while only 38.9% guests have shown interest in Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Yes, for both types of Hotels, this graph and data will make some positive business impacts.

### Hotel are doing well so they are providing more services to attract more guests to increase more revenue. But, in case of Resort Hotel, guests have shown less interest than City Hotel. So, Resort Hotel need to find solution to attract guests and find what City Hotel have done to attract guests. So, there is an scope of tremendous growth in Resort Hotel, if they upgrade their services and adopt the path of growth and success learning from the success strategies of City Hotels and adding new ideas of themselves.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Group by Hotel
group_by_hotel = hotel_df.groupby('hotel')

# Grouping by Hotel adr
highest_adr = group_by_hotel['adr'].mean().reset_index()

# Set plot size
plt.figure(figsize = (10,8))

# Create the figure object
ax = sns.barplot(x= highest_adr['hotel'], y= highest_adr['adr'])

# Set labels
ax.set_xlabel("Hotel type", fontsize = 20)
ax.set_ylabel("ADR", fontsize = 20)
ax.set_xticklabels(['City Hotel', 'Resort Hotel'], fontsize = 16)
ax.set_title('Average ADR of each Hotel type', fontsize = 20)

# To show
plt.show(ax)

##### 1. Why did you pick the specific chart?

### Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics such as percentages.

### To show the average adr of each hotel type in a clear and feasible way, i have used Bar chart here.

##### 2. What is/are the insight(s) found from the chart?

### City hotel has the highest ADR. This means City Hotels are generating more revenues than the Resort Hotels. More the ADR, more will be the revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Yes, Hotel can do more advertising to get more customers that will ultimately add up to their revenue. Thus, the City Hotels are already enjoying high ADR, but a bit more of positive efforts towards growth will definitely adds a lot to their growth and overall revenue.



#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Groupby adr, total_stay, hotel
adr_vs_total_stay = hotel_df.groupby(['adr','hotel']).agg('count').reset_index()
adr_vs_total_stay = adr_vs_total_stay.iloc[:, :3]
adr_vs_total_stay = adr_vs_total_stay.rename(columns = {'is_canceled':'number_of_stays'})
adr_vs_total_stay = adr_vs_total_stay[:18000]
adr_vs_total_stay

In [None]:

adr_vs_total_stay.groupby('hotel').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

From this chart, we have found that as the total stay increases the ADR is also getting high. So, ADR is directly proportional to total stay.

##### 2. What is/are the insight(s) found from the chart?

From this line chart, we have found that as the total stay increases the ADR is also getting high. So, ADR is directly proportional to total stay.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The hotels should focus on increasing their ADR and the more advertisement and better facilities and good offers will let the guests to stay more, that will directly result in increasing ADR. So, Hotels should offer more attractive offers and facilities, so that total stay can be increased that will directly multiply their ADR and ultimately revenue will increase.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Visualizing by pie chart
hotel_df['is_repeated_guest'].value_counts().plot.pie(explode=[0.05, 0.05], autopct ='%1.1f%%', shadow = True, figsize =(10,9), fontsize = 20)

# Set labels
plt.title('Percentage (%) of repeated guests', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are used to represent the data or relative data in a single chart. The concept of pie slices is used to show the percentage of a particular data from the whole pie. Thus, i have used to show the percentage of repeated guests or not (where 0 is not repeated guest and 1 is repeated guest) through pie chart with different colored area under a circle.

##### 2. What is/are the insight(s) found from the chart?

Repeated guests are very few which is only 3.9% while 96.1% guests are not returning to the same hotel. So, it's a matter of deep thinking and taking proper steps to increase the repeated guests numbers for both type of hotels. In order to retained the guests management should take feedbacks from guests and try to improve the servic

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the proportion of repeated guests is very much low, so if the Hotels work well in this side also, then the increase in number of repeated guests will ultimately boost their revenue. So Hotels can give alluring offers to non-repetitive customers during off seasons to enhance revenue. So, right steps should be taken like taking feedbacks, solving problems of customers within time limit and offering best offers to the customers.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Visualizing by pie chart
hotel_df['required_car_parking_spaces'].value_counts().plot.pie(explode=[0.05]*5, autopct ='%1.1f%%', shadow = False, figsize =(12,8), fontsize = 20, labels = None)

# Create the figure object
labels = hotel_df['required_car_parking_spaces'].value_counts().index

# Set labels
plt.title('% Distribution of\nrequired car parking spaces', fontsize = 20)
plt.legend(bbox_to_anchor = (0.85, 1), loc = 'upper left', labels = labels)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have used pie chart here because it gives the output in a more understanding manner as here we can clearly see the different two colors reflecting the demand of car parking spaces by guests. So, it's a very useful chart to get proper insights as we can use other charts also but i have found it more relevent here.

##### 2. What is/are the insight(s) found from the chart?

This chart shows that 91.6% guests did not required the parking space. Only 8.3% guests required the parking space.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from here definitely help the hotels to provide better services. It can be said that hotels need to work less on car parking spaces as only 1 car parking space was required by 8.3% of guests. SO, it's better to focus on other areas to increase quality of hotel rather than focusing mainly on car parking area only. The demand for car parking area is less. This might be due to the reason as many guests prefers to use public vehicles for travel.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Set plot size
plt.figure(figsize=(10,6))

# Create the figure object
sns.countplot(x = hotel_df['meal'])

# Set labels
plt.xlabel('Meal Type', fontsize = 16)
plt.ylabel('Count', fontsize = 16)
plt.title('Preferred Meal Type', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have used the count plot here, because it shows the counts of observations in each categorical bin using bars. Bar plots look similar to count plots. But instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category. So, to get clear insights about the counts of different types of meal, i have used this count plot.

##### 2. What is/are the insight(s) found from the chart?

The insights that i have found from the above graph is that the most preferred meal type by the guests is BB (Bed and Breakfast) while HB (Half Board) and SC (Self Catering) are equally preferred. Types of meal in hotels are as follows:-

BB - (Bed and Breakfast)

HB - (Half Board)

FB - (Full Board)

SC - (Self Catering)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

So, the insights here also have positive impact as hotels need to focus more on the BB meal type so that the majority of customers are satisfied while others types of meals should be given equal importance with proper management of food services so as to offer best services to customers.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
bookings_by_months_df = hotel_df.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns = {'hotel':'Counts'})

# Creating list of months in order
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Creating dataframe which will map the order of above months list without changing its values
bookings_by_months_df['arrival_date_month'] = pd.Categorical(bookings_by_months_df['arrival_date_month'], categories = months, ordered = True)

# Sorting by arrival_date_month
bookings_by_months_df = bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:
# Visualizing with the help of line plot

# Set plot size
plt.figure(figsize = (14,6))

# Plotting lineplot on x- months & y- bookings counts
sns.lineplot(x = bookings_by_months_df['arrival_date_month'], y = bookings_by_months_df['Counts'])

# Set title
plt.title('Number of bookings across each month', fontsize = 20)

# Set labels
plt.xlabel('Month', fontsize = 16)
plt.ylabel('Number of bookings', fontsize = 16)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

For 1st chart, i have picked the line chart here because it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. So, here we can easily track the change of 'number of bookings' with respect to month.

While in 2nd chart here, bar plot has been used. I have used this chart to get clear view in understanding the relation between total stay in terms of days and count of stays(means total number of customers stayed)

##### 2. What is/are the insight(s) found from the chart?

From this graph of 1st chart, i have found that July and August months had the most Bookings. As, July and August generally surrounds in and near the summer vacation. So, summer vacation can be the reason for the bookings.

While, 2nd chart gives us different insights. So, from the above observations, we have found that the Optimal stay in both the type hotel is less than 7 days. So, after that staying numbers have declined drastically.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes from the 1st chart, it is clear that this provides a good insights that hotels should be well prepared for the month of July and August as maximum bookings takes place for this month. So, better the preparation and good approach will definitely adds to the growth of Hotels.

While in 2nd chart also have positive impact. Yes, from the insights gathered here, hotels can work in the domain to increase the staying length of customers to increase their revenue. The other understanding is that customers usually prefers a one week stay in a hotel. So, hotels need to work efficiently in these seven days so that customers would return to the same hotel again so this will increase the revenue.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Set the plot size
hotel_df.hist(figsize = (23,18))

# To show
plt.show()

##### 1. Why did you pick the specific chart?

To understand the data in a clear way with proper insights. I have used the histogram here. The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on a interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data. Thus, i have used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

Some insights found from the chart are as follows:-

We can see that the maximum guest came in the year 2016.

Maximum arrival week number is 30.

Maximum arrival happens in the last of the month.

Maximum guests comes with no children.

There is very less requirement of Car parking spaces.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Just a histogram cannot define business impact. It's done just to see the distribution of the column data over the dataset.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Finding out the percentage and counts of confirmed and canceled bookings
# Plotting a Count Plot chart using seaborn for counts of confirmed and canceled bookings

# Set plot size
plt.figure(figsize = (10,6))

# Create the figure object
sns.countplot(x = 'hotel', hue = 'is_canceled', palette = 'Set2', data = hotel_df)

# Set legends
plt.legend(['Confirmed', 'Canceled'])

# Set labels
plt.title('Hotel wise confirmation and cancelation of the bookings', fontsize = 20)
plt.ylabel('Count of\nconfirmation and cancelation', fontsize = 16)
plt.xlabel('Hotel Type', fontsize = 16)

# To show
plt.show()

In [None]:
# Plotting a Pie chart using matplotlib for percentage of confirmed and canceled bookings of Resort Hotel
resort_hotel = hotel_df.loc[(hotel_df['hotel'] == 'Resort Hotel')]
resort_hotel_checking_cancel = resort_hotel['is_canceled'].value_counts()

# Set labels
mylabels = ['Confirmed', 'Canceled']

# Set figure size
myexplode = [0.2, 0]

# Create the figure object
resort_hotel_cancelation = plt.pie(resort_hotel_checking_cancel, labels = mylabels, explode = myexplode, autopct = '%1.1f%%')

# Set title
plt.title('Resort Hotel\nConfirmed and Cancelation')

resort_hotel_checking_cancel

In [None]:

# Removing the canceled bookings from the data and creating a new dataframe
data_not_canceled = hotel_df[hotel_df['is_canceled'] == 0]

# Year wise Bookings of hotels
# Set style
sns.set_style(style = 'darkgrid')

# Set plot size
plt.figure(figsize = (12,6))

# Create the figure object
sns.countplot(x= 'arrival_date_year', hue= 'hotel', palette = 'tab10', data = data_not_canceled)

# Set legends
plt.legend(['Resort Hotel', 'City Hotel'])

# Set labels
plt.title('Year wise bookings of hotels', fontsize = 20)
plt.ylabel('Number of bookings', fontsize = 16)
plt.xlabel('Year', fontsize = 16)

# To show
plt.show()

##### 1. Why did you pick the specific chart?


I have picked out the count plot and pie plot lot to get proper insights on Hotel wise cancelation and confirmation of bookings.

##### 2. What is/are the insight(s) found from the chart?

We can clearly deduce from the above graphs that the City hotel is having greater number of bookings as compared to Resort hotel. But, the cancelation percentage is high of the City Hotel.

From the above graphs, it can be summarised that in the year 2016 both the hotel saw a massive increase in their bookings and by far the year 2016 is the year of the highest bookings of both hotel. In 2016 and 2017 the City hotel is having the highest number of bookings but in 2015 the Resort hotel is having the highest number of bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Overall the graphs show a positive outcome but the visualization of cancelation graph creates a situation of deep concern. So, here as we can see, that more than 1/4th of overall booking got canceled. So, it's a matter of deep concern. Thus, we need to look over this problem. The solution to this problem is that, we can check the reasons of cancelation of a booking & need to get this sorted out as soon as possible at the business level to stop the problems getting broader.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Using groupby funtion
bookings_by_months_df = hotel_df.groupby(['arrival_date_month', 'hotel'])['adr'].mean().reset_index()

# Create month list
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# It will take the order of the month list in the dataframe along with values
bookings_by_months_df['arrival_date_month'] = pd.Categorical(bookings_by_months_df['arrival_date_month'], categories = months, ordered = True)

# Sorting values
bookings_by_months_df = bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:

# Visualizing with the help of line plot

# Set plot size
plt.figure(figsize = (14,6))

# Create the figure object and plotting the line
sns.lineplot(x = bookings_by_months_df['arrival_date_month'], y = bookings_by_months_df['adr'], hue = bookings_by_months_df['hotel'])

# Set labels
plt.title('ADR across Each Month', fontsize = 20)
plt.xlabel('Month', fontsize = 16)
plt.ylabel('ADR', fontsize = 16)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have picked the line chart here to get the clear insights of ADR by City and Resort hotels across each month. Line chart is very useful because it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. To compare data, more than one line can be plotted on the same axis.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart are as follows:-

For Resort Hotel, ADR is high in the months of June, July, August as compared to City Hotels. The reason may be that customers/people want to spent their summer vacation in Resort Hotels.

The best time for guests to visit Resort or City Hotels is January, February, March, April, October, November and December as the average daily rate in this month is very low. So, it would be feasible and sustainable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

So, the higher the ADR, the higher will be the revenue, so its a good sign. Hotels should work more to enhance their ADR by offering good schemes to attract customers in winter vacation also and other holidays.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# As i have already created a column 'total_stay' above i.e.
# Adding total staying days in hotels
hotel_df['total_stay'] = hotel_df['stays_in_weekend_nights'] + hotel_df['stays_in_week_nights']

# Set the plot size
plt.figure(figsize=(14,7))

# Using a violin plot to know in which weeks, visitors stays the most
sns.violinplot(x = 'arrival_date_week_number', y = 'total_stay', palette = 'Set2', data = hotel_df)

# Set labels
plt.title('Week wise number of stays', fontsize = 20)
plt.ylabel('Number os stays', fontsize = 16)
plt.xlabel('Week number', fontsize = 16)

# To show
plt.show()

In [None]:
# Visualizing with the help of pie plot
hotel_df['is_canceled'].value_counts().plot.pie(explode = [0.05,0.05], autopct = '%1.1f%%', shadow = True, figsize = (10,8), fontsize = 20)

# Set title
plt.title('Cancelation and Non-Cancelation', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have used the violin plot here, to gather proper relation between number of stays and week wise number of stays and violin plots are used when one want to observe the distribution of numetic data, and are especially useful when you want to make a comparison of distributions between multiple groups. This peaks, valleys, and tails of each group's density curve can be compared to see where groups are similar or different.

I have picked this pie plot as it's look very precise and clear to get the insights between two variables. As, we can see now 27.5% tickets was canceled. Here, 0 denotes not canceled and 1 denotes the canceled one. So, i have used the pie plot because it represents data visually as a fractional part of a whole, which can be an effective communication tool for the even uninformed audience. It enables the audience to see a data comparison at a glance to make an immediate analysis or to understand information quickly.

##### 2. What is/are the insight(s) found from the chart?

From the above violin plot, we have found that from the week 28 to 31, it has shown the highest days of stay whereas from the week 1 to 11 has shown a very steady trend in the number of stays and also the week 18 to 22 has shown the least number of stays by the visitors in aggregate of all 3 years 2015, 2016 and 2017.

From the graph, we have found the insights that more than 1/4th of the overall bookings i.e. approx 27.5% of the tickets was got canceled.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, according to the outcomes, Client can have a better plan to provide better services to the guests so that the revenue can be multiplied.

So, here as we can see, that more than 27% booking got canceled. So, it's a matter of deep concern. Thus, we need to look over this problem. The solution to this problem is that, we can check the reasons of cancelation of a booking & need to get this sorted out as soon as possible at the business level to stop the problems getting broader.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Set the plot size
plt.figure(figsize = (14,6))

# Create the figure object
sns.countplot(x = hotel_df['assigned_room_type'], order = hotel_df['assigned_room_type'].value_counts().index)

# Set labels
plt.xlabel('Room Type', fontsize = 16)
plt.ylabel('Count of Room type', fontsize = 16)
plt.title('Most preferred Room Type', fontsize = 20)

# To show
plt.show()

In [None]:
# Using seeborn to plot a count plot chart to demonstrate the types of customer visit the most
# Set the plot size
plt.figure(figsize = (12,6))

# Create the figure object
sns.countplot(x = 'arrival_date_month', hue = 'customer_type', palette = 'Set2', data = hotel_df)

# Set labels
plt.xlabel('Months', fontsize = 16)
plt.ylabel('Number of customers', fontsize = 16)
plt.title('Types of customer arrived month wise', fontsize = 20)

# To show
plt.show()

In [None]:
# Plotting a correlation heatmap for the dataset
plt.figure(figsize=(15,8))

# Select only numerical columns for correlation
numerical_df = new_df.select_dtypes(include=['number'])

sns.heatmap(numerical_df.corr(), vmin=-1, cmap='coolwarm', annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

For 1st visualization, i have picked the bar chart to display result for this set of code. Here, i have used bar graph to show distribution by volume(count of room), which type of room is alotted. Bar graph summarises the large set of data in simple visual form. It displays each category of data in the frequency distribution. It clarifies the trend of data better than the table. So, i have used the bar graph here.

While 2nd visualization involves a count plot because it helps us to get clear insights with the total number of guests visited. So, i have used count plot here to know about the type of guests.

##### 2. What is/are the insight(s) found from the chart?

From the above chart, it is found that the most preferred Room type is 'A'. So, majority of the guests have shown interest in this room type. So, overall this chart shows room type 'A' is most preferred by guests.

From the 2nd graph, it can be summarised that the Transient type of customers visit the most whereas the visitors who are in group comes in the category of least visitors.

3.Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from the graph it can be seen that there are positive impacts because 'A', 'D', 'E' is more preferred by guest due to better services offered in room type. So, overall booking in a hotel matters. So, each room type belongs to each hotel so wherever customers goes, the hotel will be benefit but Hotels should also look in the factors affecting less preference in some particular room type. So, overall if other room types will also gain popularity then again hotel will be benefitted. So, ultimately hotels will encounters more bookings resulting in much more revenues.

Ofcourse the better understanding regarding the different type of guests will help to take proper right steps towards services, facilities, requirements and offers which will directly result in the growth in business.

#### Chart - 15 - Pair Plot

In [None]:
# Creating new dataset
new_df2=new_df[['hotel','is_canceled','lead_time','arrival_date_year','arrival_date_month','meal','market_segment','distribution_channel','reserved_room_type',
       'assigned_room_type','deposit_type','days_in_waiting_list', 'customer_type', 'adr','total_stays',
       'total_people', 'total_childrens', 'reserved_room_assigned',
       'guest_category', 'lead_time_category']]

# Plotting pair plot for dataset
plt.figure(figsize=(10,8))
ax=sns.pairplot(new_df2)
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, i have used pair plot to analyse the patterns of data and relationship between the features. It's exactly same as the correlation map but here it shows the output in the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

We have found the relationship of 'is_repeated_guest' with different types of columns. So, generally this chart reflects the relationship of a particular column with all other columns.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* A city hotel has more bookings than a resort. Offer packages and promotions to promote bookings for the resort hotel.
* BB is the most requested food. The hotel should maintain food quality while also offering discounts on other foods to promote other food types, reducing the burden on kitchen management and keeping a variety of food options available to customers.
* Most of the bookings are made through the online platform. Hotels can cut costs by eliminating market segments such as complementary and aviation because bookings through these segments are very low.
* Because most bookings made through TA/TO distribution are followed by corporate distribution, hotels should invest in both TA/TO and corporate distribution channels. The GDS distribution channel can be eliminated by hoteliers because bookings made through it are extremely low.
* Very few customers (3.86%) visited again. So hotels can increase repeat bookings by offering the right repeat booking incentives, understanding the motivations behind repeat bookings, marketing to your guests’ past interests, and assessing past bookings to identify priority guests.
* Because rooms A and D are the most popular with customers, the hotel should maintain their quality. The hotel should promote rooms E, F, and G to increase demand by offering discounts. Because customers do not prefer to book room types B, C, H, and L, the hotel can eliminate them, lowering the cost of these rooms.
* Customers do not want to pay a pre-deposit for a reservation. Hotels should promote advance deposits because not only does an advance deposit allow you to recognize revenue faster, it also greatly decreases the risk of cancellations.
* Because 3 and 8 parking spaces were rarely requested by customers, hotels can only keep bookings for 1 and 2 parking spaces to save money.
15% of customers were not given reserved rooms. Make sure that guests get the rooms they have booked.
* Almost 25% of customers cancelled their bookings. Hotel should implement a cancellation policy, discount on confirmed bookings, and send booking reminders to guests to reduce booking cancellations.
* People typically book rooms for two people, so encourage family and group bookings. You can maximize revenue by promoting it with a discounted offer for group bookings.

# **Conclusion**

* The top country with the most number of bookings is PRT, and the number one agent with the most number of bookings is 9.
* Customers favored city hotels more than resort hotels by a margin of 61.07 percent.
* One of the four reservations is canceled.
* The most popular food is BB.
* The Online (internet) platform is used to make the majority of bookings.
* The majority of the bookings are made using TA/TO, the leading distribution channel.
* The vast majority of hotel bookings are made by new guests. Almost no consumers (3.86%) returned.
* The customer wants Room A to be reserved the most.
* Customers do not wish to make a bookings with a pre-deposit.
* Customers (80%) favored making a hotel reservation for a short visit.
* Only 10% of people require space to park their cars.
* Most visitors are couples.
* The inability to assign a reserved room to a customer is not grounds for cancellation.
* Booking cancellations are not caused by a longer Lead time.
* A city hotel is busier than a resort.
* The busiest months for hotels are October and September. There isn't
  a lengthy wait for reservations in July.
* Not assigning a reserved room does not affect ADR.

## Challenges
* The data contained a large number of duplicates.
* The improper data type format was used for the data.
* It was challenging to select the best visualization techniques.
* The dataset contained a large number of null values.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***