# **Project Name**    -Hotel Booking Analysis



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual




# **Project Summary -**
By analyzing a vast dataset of hotel bookings, the project seeks to uncover key factors influencing booking behavior, such as seasonal trends, booking lead times, and popular amenities.
The dashboard will provide a real-time overview of booking trends, customer preferences, and  facilitating the identification of areas for improvement and opportunities for growth within the hotel's operations.By leveraging predictive analytics, the project will forecast future booking demands and occupancy rates, resource allocation, and marketing initiatives to maximize revenue and customer satisfaction.
  

# **GitHub Link**

https://github.com/AshwiniSuryakar09

# **Problem Statement**
 In the rapidly evolving hospitality industry, there exists a pressing need to leverage techniques to comprehensively analyze hotel booking patterns and customer preferences. Moreover, the inability to address factors contributing to booking cancellations , impacting overall profitability and hindering the ability to provide a seamless and personalized customer experience.

 To address these challenges, this project aims to develop a comprehensive data analysis framework that can extract meaningful insights from the available booking dat



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import plotly.express as px
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive                    # Mounting drive
drive.mount('/content/drive')

In [None]:
filepath="/content/Hotel Bookings.csv"
hotel_df=pd.read_csv(filepath)

In [None]:
hotel_df

### Dataset First View

In [None]:
# Dataset First Look
hotel_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_df.columns

In [None]:
hotel_df.describe()

### Dataset Information

In [None]:
# Dataset Info
hotel_df.info()

In [None]:
df = hotel_df.copy()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df[df.duplicated()].shape

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum().sort_values(ascending = False)[:6]


In [None]:
df[['company','agent']] = df[['company','agent']].fillna(0)
print(df[['company','agent']])

In [None]:
df['children'].unique()

In [None]:
# Visualizing the missing values
df['children'].fillna(df['children'].mean(), inplace = True)

In [None]:
df['country'].fillna('others', inplace = True)

In [None]:
df.isnull().sum().sort_values(ascending = False)[:6]

In [None]:
df[df['adults']+df['babies']+df['children'] == 0].shape

In [None]:
df.drop(df[df['adults']+df['babies']+df['children'] == 0].index, inplace = True)

In [None]:
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format = '%Y-%m-%d')

print(df['reservation_status_date'])

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df['total_stay'] = df['stays_in_weekend_nights']+df['stays_in_week_nights']

print(df['total_stay'])


In [None]:
df['total_people'] = df['adults']+df['children']+df['babies']

print(df['total_people'])

In [None]:
# Dataset Describe
df.describe()

### Variables Description

hotel : Name of the hotel (Resort Hotel or City Hotel)

is_canceled : If the booking was canceled (1) or not (0)

lead_time: Number of days before the actual arrival of the guests

arrival_date_year : Year of arrival date

arrival_date_month : Month of arrival date

arrival_date_week_number : Week number of year for arrival date

arrival_date_day_of_month : Day of arrival date

stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) spent at the hotel by the guests.

stays_in_week_nights : Number of weeknights (Monday to Friday) spent at the hotel by the guests.

adults : Number of adults among guests

children : Number of children among guests

babies : Number of babies among guests

meal : Type of meal booked

country : Country of guests

market_segment : Designation of market segment

distribution_channel : Name of booking distribution channel

is_repeated_guest : If the booking was from a repeated guest (1) or not (0)

previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

reserved_room_type : Code of room type reserved

assigned_room_type : Code of room type assigned

booking_changes : Number of changes/amendments made to the booking

deposit_type : Type of the deposit made by the guest

agent : ID of travel agent who made the booking

company : ID of the company that made the booking

days_in_waiting_list : Number of days the booking was in the waiting list

customer_type : Type of customer, assuming one of four categories

adr : Average Daily Rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights

required_car_parking_spaces : Number of car parking spaces required by the customer

total_of_special_requests : Number of special requests made by the customer

reservation_status : Reservation status (Canceled, Check-Out or No-Show)

reservation_status_date : Date at which the last reservation status was updated

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df['is_canceled'].unique()


In [None]:

df['arrival_date_year'].unique()


In [None]:
hotel_df['arrival_date_month'].unique()

In [None]:
hotel_df['arrival_date_week_number'].unique()

In [None]:

df['meal'].unique()


In [None]:

df['market_segment'].unique()


In [None]:

df['distribution_channel'].unique()


In [None]:
hotel_df['adults'].unique()


In [None]:

df['children'].unique()

In [None]:
hotel_df['babies'].unique()

In [None]:
hotel_df['reserved_room_type'].unique()

In [None]:
hotel_df['assigned_room_type'].unique()

In [None]:
hotel_df['deposit_type'].unique()

In [None]:
hotel_df['agent'].unique()

In [None]:
for elem in hotel_df.columns:
  print('Number of unique values in',elem,'column is',hotel_df[elem].nunique())

In [None]:
hotel_df['lead_time'].unique()

In [None]:
hotel_df['customer_type'].unique()

In [None]:
hotel_df['reservation_status_date'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
hotel_df.isnull().sum().sort_values(ascending = False)[:6]

In [None]:
# Replacing null values of company, agent and children columns with 0 and country column with 'others'

hotel_df[['company','agent','children']] = hotel_df[['company','agent','children']].fillna(0)
hotel_df[['country']] = hotel_df[['country']].fillna('others')

In [None]:
# Checking if all null values are removed
hotel_df.isnull().sum().sort_values(ascending = False)[:6]

In [None]:
# Checking number of duplicate values in the dataset
len(hotel_df[hotel_df.duplicated()])

In [None]:
# Dropping the duplicate rows from dataset
hotel_df = hotel_df.drop_duplicates()

In [None]:
# Checking the shape of dataset after dropping duplicate values
hotel_df.shape

In [None]:
# Checking the shape of dataset whose combining values of adults, babies and children columns is 0
hotel_df[hotel_df['adults']+hotel_df['babies']+hotel_df['children'] == 0].shape

In [None]:
# Changing datatype of column 'reservation_status_date' from object to data_type
hotel_df['reservation_status_date'] = pd.to_datetime(hotel_df['reservation_status_date'], format = '%Y-%m-%d')

## **Adding important columns as per requirement**

In [None]:
# Adding total staying days in hotels
hotel_df['total_stay'] = hotel_df['stays_in_weekend_nights']+hotel_df['stays_in_week_nights']

# Adding total people number as column, i.e. total types of person = num of adults + children + babies
hotel_df['total_people'] = hotel_df['adults']+hotel_df['babies']+hotel_df['children']

In [None]:
# Checking the final number of rows and columns
hotel_df.shape

### What all manipulations have you done and insights you found?

I have done the following manipulations and the insights were found, are as follows:-

I found that there were four columns containing null values. So we had Null values in columns- Company, Agent, Country and Children.

1.For company and agent, i have filled the missing values with 0

2.For country column, i have fill missing values with object 'Others'(Assuming while collecting data country was not found so user selected the 'Others' option.)

3.As the count of missing values in Children Column was only 4, so it was replaced with 0 considering no children.

-This dataset was also containing duplicate values, so duplicate values was dropped.

-I found that there were some rows in which the combining values of adults, babies and childrens was 0, so this simply means there were no guests as 0 indicates presence of none. So, there were no bookings made. As a result, i simply dropped the rows where combining values of adults, babies and children columns was 0.

-The data type of 'reservation_status_date' column was object type, so it was changed to date type format for better use.

-There were two new columns that was added, one is 'total_people' and other is 'total_stay'.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = df)
plt.show()

From the scatter plot we can see that as length of tottal_stay increases the adr decreases. This means for longer stay, the better deal for customer can be finalised.

##### 1. Why did you pick the specific chart?

This choice of visualization is suitable for exploring the relationship between two continuous variables, 'total_stay' and 'adr'.we can see if there's a linear relationship, a curve, or if the points are scattered randomly.

##### 2. What is/are the insight(s) found from the chart?

  This chart provides insight into the distribution of data points along both axes, which can help in understanding the data's spread and concentration.This enables us to see if there's any discernible pattern or correlation between the two continuous variables .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, We have gained some positive business impacts like :

1. By understanding the relationship between the length of stay ('total_stay') and the average daily rate ('adr') is going to help us to optimize pricing strategy

2. This plot is informing about revenue management decisions.

3. With this insights Businesses can tailor marketing campaigns and services to different segments, improving customer satisfaction and loyalty.


But again Yes, insights gained from analyzing the relationship between 'total_stay' and 'adr' using a scatter plot could potentially lead to negative growth if misinterpreted or if certain patterns are not properly addressed like ,
Insights from the scatter plot may reveal changing trends in customer booking behavior over time and lack of customer centric approach.

We notice that there is an outlier in adr, so we will remove that for better scatter plot

In [None]:


df.drop(df[df['adr'] > 5000].index, inplace = True)
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = df)
plt.show()



#### Chart - 2

Lets first find the correlation between the numerical data.

Since, columns like 'is_cancelled', 'arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'is_repeated_guest', 'company', 'agent' are categorical data having numerical type. So we wont need to check them for correlation.

Also, we have added total_stay and total_people columns. So, we can remove adults, children, babies, stays_in_weekend_nights, stays_in_week_nights columns.

##### 1. Why did you pick the specific chart?

In [None]:
# Chart -2
#Heatmap

num_df = df[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','total_stay','total_people']]


In [None]:
#correlation matrix
corrmat = num_df.corr()
f, ax = plt.subplots(figsize=(12, 7))
sns.heatmap(corrmat,annot = True,fmt='.2f', annot_kws={'size': 10},  vmax=.8, square=True);

The heatmap visualization of the correlation matrix is a popular choice for exploratory data analysis because it efficiently communicates the relationships between numerical variables, facilitating data-driven decision-making and further analysis.
Heatmaps make it easier to identify patterns and relationships in the data and present information in a compact and visually appealing format. This makes them suitable for presentations.

##### 2. What is/are the insight(s) found from the chart?

First thing is we examine the strength and direction of correlations between numerical variables.
These insights provide a starting point for further analysis and decision-making, such as refining pricing strategies, improving customer segmentation, or enhancing service offerings based on observed correlations between different variables in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The insights gained from the correlation heatmap can potentially contribute to positive business impacts, but they also hold the possibility of leading to negative growth if not properly interpreted or addressed.

 Businesses can adjust rates for longer stays accordingly, potentially increasing revenue.By identifying the relationship between booking changes and previous bookings not canceled could lead to strategies aimed at enhancing customer engagement and loyalty, potentially resulting in repeat business and positive word-of-mouth.

 If the misconception happens that longer lead times always warrant lower rates, customers booking last-minute might feel unfairly charged higher rates, potentially leading to dissatisfaction and loss of business.
 While the correlation between booking changes and previous bookings not canceled may suggest customer loyalty, it could also signal operational inefficiencies if the changes are due to errors or inadequate booking management systems, potentially leading to negative reviews and reduced bookings.

#### Chart - 3

In [None]:
# visualization code

# Visualizing by pie chart
hotel_df['hotel'].value_counts().plot.pie(explode=[0.05, 0.05], autopct ='%1.1f%%', shadow = True, figsize =(10,9), fontsize = 20)

# Set labels
plt.title('Pie Chart for Most Preferred Hotel', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Wherever different percentage comparison comes into action, pie chart is used frequently.
 So, i have used Pie Chart and which helped us to get the percentage comparison more clearly and precisely.

##### 2. What is/are the insight(s) found from the chart?

From the above chart, we got to know that City Hotel is most preferred hotel by the guests. Thus City Hotel has maximum bookings. 61.1% guests are preferred City Hotel, while only 38.9% guests have shown interest in Resort Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, for both types of Hotels, this graph and data will make some positive business impacts.

City Hotel are doing well so they are providing more services to attract more guests to increase more revenue. But, in case of Resort Hotel, guests have shown less interest than City Hotel. So, Resort Hotel need to find solution to attract guests and find what City Hotel have done to attract guests. So, there is an scope of tremendous growth in Resort Hotel, if they upgrade their services and adopt the path of growth and success learning from the success strategies of City Hotels and adding new ideas of themselves.

#### Chart - 4 :Hotel type with highest adr (Bivariate with Categorical - Numerical)

In [None]:
# Chart - 4 visualization code

# Group by Hotel
group_by_hotel = hotel_df.groupby('hotel')

# Grouping by Hotel adr
highest_adr = group_by_hotel['adr'].mean().reset_index()

# Set plot size
plt.figure(figsize = (10,8))

# Create the figure object
ax = sns.barplot(x= highest_adr['hotel'], y= highest_adr['adr'])

# Set labels
ax.set_xlabel("Hotel type", fontsize = 20)
ax.set_ylabel("ADR", fontsize = 20)
ax.set_xticklabels(['City Hotel', 'Resort Hotel'], fontsize = 16)
ax.set_title('Average ADR of each Hotel type', fontsize = 20)

# To show
plt.show(ax)

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics such as percentages.

To show the average adr of each hotel type in a clear and feasible way, i have used Bar chart here.

##### 2. What is/are the insight(s) found from the chart?

City hotel has the highest ADR. This means City Hotels are generating more revenues than the Resort Hotels. More the ADR, more will be the revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

So, City Hotel can do more advertising to get more customers that will ultimately add up to their revenue. Thus, the City Hotels are already enjoying high ADR, but a bit more of positive efforts towards growth will definitely adds a lot to their growth and overall revenue.

#### Chart - 5 : Relationship between ADR and Total Stay (Bivariate with Numerical-Numerical)

In [None]:
# Chart - 5 visualization code
# Groupby adr, total_stay, hotel
adr_vs_total_stay = hotel_df.groupby(['total_stay','adr','hotel']).agg('count').reset_index()
adr_vs_total_stay = adr_vs_total_stay.iloc[:, :3]
adr_vs_total_stay = adr_vs_total_stay.rename(columns = {'is_canceled':'number_of_stays'})
adr_vs_total_stay = adr_vs_total_stay[:18000]
adr_vs_total_stay


In [None]:

# Plotting the graph in line chart
# Set plot size
plt.figure(figsize=(12,6))

# Create the figure object
sns.lineplot(x= 'total_stay', y= 'adr', data= adr_vs_total_stay)

# Set labels
plt.xlabel('Total Stay', fontsize = 16)
plt.ylabel('ADR', fontsize = 16)
plt.title('Relationship between ADR and Total Stay', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

This is a line chart and it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. So, here we can easily track the ups and downs of the graph very precisely.

##### 2. What is/are the insight(s) found from the chart?

From this line chart, we have found that as the total stay increases the ADR is also getting high. So, ADR is directly proportional to total stay.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

The hotels should focus on increasing their ADR and the more advertisement and better facilities and good offers will let the guests to stay more, that will directly result in increasing ADR. So, Hotels should offer more attractive offers and facilities, so that total stay can be increased that will directly multiply their ADR and ultimately revenue will increase.

#### Chart - 6 :Percentage of repeated guests (Univariate)

In [None]:
# Chart - 6 visualization code
# Visualizing by pie chart
hotel_df['is_repeated_guest'].value_counts().plot.pie(explode=[0.05, 0.05], autopct ='%1.1f%%', shadow = True, figsize =(10,9), fontsize = 20)

# Set labels
plt.title('Percentage (%) of repeated guests', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are used to represent the data or relative data in a single chart. The concept of pie slices is used to show the percentage of a particular data from the whole pie. Thus, i have used to show the percentage of repeated guests or not (where 0 is not repeated guest and 1 is repeated guest) through pie chart with different colored area under a circle.

##### 2. What is/are the insight(s) found from the chart?

Repeated guests are very few which is only 3.9% while 96.1% guests are not returning to the same hotel. So, it's a matter of deep thinking and taking proper steps to increase the repeated guests numbers for both type of hotels. In order to retained the guests management should take feedbacks from guests and try to improve the services.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the proportion of repeated guests is very much low, so if the Hotels work well in this side also, then the increase in number of repeated guests will ultimately boost their revenue. So Hotels can give alluring offers to non-repetitive customers during off seasons to enhance revenue. So, right steps should be taken like taking feedbacks, solving problems of customers within time limit and offering best offers to the customers.

#### Chart - 7 :Percentage distribution of required car parking spaces (Univariate)

In [None]:
# Chart - 7 visualization code
# Visualizing by pie chart
hotel_df['required_car_parking_spaces'].value_counts().plot.pie(explode=[0.05]*5, autopct ='%1.1f%%', shadow = False, figsize =(12,8), fontsize = 20, labels = None)

# Create the figure object
labels = hotel_df['required_car_parking_spaces'].value_counts().index

# Set labels
plt.title('% Distribution of\nrequired car parking spaces', fontsize = 20)
plt.legend(bbox_to_anchor = (0.85, 1), loc = 'upper left', labels = labels)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have used pie chart here because it gives the output in a more understanding manner as here we can clearly see the different two colors reflecting the demand of car parking spaces by guests. So, it's a very useful chart to get proper insights as we can use other charts also but i have found it more relevent here.

##### 2. What is/are the insight(s) found from the chart?

This chart shows that 91.6% guests did not required the parking space. Only 8.3% guests required the parking space.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from here definitely help the hotels to provide better services. It can be said that hotels need to work less on car parking spaces as only 1 car parking space was required by 8.3% of guests. SO, it's better to focus on other areas to increase quality of hotel rather than focusing mainly on car parking area only. The demand for car parking area is less. This might be due to the reason as many guests prefers to use public vehicles for travel.

#### Chart - 8 : Meal type Distribution (Univariate)

In [None]:
# Chart - 8 visualization code
# Set plot size
plt.figure(figsize=(10,6))

# Create the figure object
sns.countplot(x = hotel_df['meal'])

# Set labels
plt.xlabel('Meal Type', fontsize = 16)
plt.ylabel('Count', fontsize = 16)
plt.title('Preferred Meal Type', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have used the count plot here, because it shows the counts of observations in each categorical bin using bars. Bar plots look similar to count plots. But instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category. So, to get clear insights about the counts of different types of meal, i have used this count plot.

##### 2. What is/are the insight(s) found from the chart?

The insights that i have found from the above graph is that the most preferred meal type by the guests is BB (Bed and Breakfast) while HB (Half Board) and SC (Self Catering) are equally preferred. Types of meal in hotels are as follows:-

BB - (Bed and Breakfast)

HB - (Half Board)

FB - (Full Board)

SC - (Self Catering)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

So, the insights here also have positive impact as hotels need to focus more on the BB meal type so that the majority of customers are satisfied while others types of meals should be given equal importance with proper management of food services so as to offer best services to customers.

#### Chart - 9  :  Bookings by Month and Optimal Stay Length in Hotels

In [None]:
# Using groupby on arrival_date_month and taking the hotel count
bookings_by_months_df = hotel_df.groupby(['arrival_date_month'])['hotel'].count().reset_index().rename(columns = {'hotel':'Counts'})

# Creating list of months in order
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Creating dataframe which will map the order of above months list without changing its values
bookings_by_months_df['arrival_date_month'] = pd.Categorical(bookings_by_months_df['arrival_date_month'], categories = months, ordered = True)

# Sorting by arrival_date_month
bookings_by_months_df = bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:
# Visualizing with the help of line plot

# Set plot size
plt.figure(figsize = (14,6))

# Plotting lineplot on x- months & y- bookings counts
sns.lineplot(x = bookings_by_months_df['arrival_date_month'], y = bookings_by_months_df['Counts'])

# Set title
plt.title('Number of bookings across each month', fontsize = 20)

# Set labels
plt.xlabel('Month', fontsize = 16)
plt.ylabel('Number of bookings', fontsize = 16)

# To show
plt.show()

In [None]:
# Visualizing with the help of bar plot

# Using groupby function on total stay and hotel
stay = hotel_df.groupby(['total_stay', 'hotel']).agg('count').reset_index()

# Taking only first three columns
stay = stay.iloc[:, :3]

# Renaming the columns
stay = stay.rename(columns = {'is_canceled':'Number of stays'})

In [None]:
# Set plot size
plt.figure(figsize = (16,8))

# Plotting barchart
sns.barplot(x = 'total_stay', y = 'Number of stays', hue = 'hotel', data = stay)

# Set labels
plt.title('Optimal Stay Length in Both Hotel types', fontsize = 20)
plt.ylabel('Count of Stays', fontsize = 16)
plt.xlabel('Total Stays(Days)', fontsize = 16)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

For 1st chart, i have picked the line chart here because it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. So, here we can easily track the change of 'number of bookings' with respect to month.

While in 2nd chart here, bar plot has been used. I have used this chart to get clear view in understanding the relation between total stay in terms of days and count of stays(means total number of customers stayed)

##### 2. What is/are the insight(s) found from the chart?

From this graph of 1st chart, i have found that July and August months had the most Bookings. As, July and August generally surrounds in and near the summer vacation. So, summer vacation can be the reason for the bookings.

While, 2nd chart gives us different insights. So, from the above observations, we have found that the Optimal stay in both the type hotel is less than 7 days. So, after that staying numbers have declined drastically.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes from the 1st chart, it is clear that this provides a good insights that hotels should be well prepared for the month of July and August as maximum bookings takes place for this month. So, better the preparation and good approach will definitely adds to the growth of Hotels.

While in 2nd chart also have positive impact. Yes, from the insights gathered here, hotels can work in the domain to increase the staying length of customers to increase their revenue. The other understanding is that customers usually prefers a one week stay in a hotel. So, hotels need to work efficiently in these seven days so that customers would return to the same hotel again so this will increase the revenue.

#### Chart - 10 : Plotting Histogram

In [None]:
# Chart - 10 visualization code
# Set the plot size
hotel_df.hist(figsize = (23,18))

# To show
plt.show()


##### 1. Why did you pick the specific chart?

To understand the data in a clear way with proper insights. I have used the histogram here. The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on a interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data. Thus, i have used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

Some insights found from the chart are as follows:-

We can see that the maximum guest came in the year 2016.

Maximum arrival week number is 30.

Maximum arrival happens in the last of the month.

Maximum guests comes with no children.

There is very less requirement of Car parking spaces.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Just a histogram cannot define business impact. It's done just to see the distribution of the column data over the dataset.

#### Chart - 11 :  Year and Hotel wise confirmed bookings and cancellation distribution

In [None]:
# Chart - 11 visualization code
# Finding out the percentage and counts of confirmed and canceled bookings
# Plotting a Count Plot chart using seaborn for counts of confirmed and canceled bookings

# Set plot size
plt.figure(figsize = (10,6))

# Create the figure object
sns.countplot(x = 'hotel', hue = 'is_canceled', palette = 'Set2', data = hotel_df)

# Set legends
plt.legend(['Confirmed', 'Canceled'])

# Set labels
plt.title('Hotel wise confirmation and cancelation of the bookings', fontsize = 20)
plt.ylabel('Count of\nconfirmation and cancelation', fontsize = 16)
plt.xlabel('Hotel Type', fontsize = 16)

# To show
plt.show()

In [None]:
# Plotting a Pie chart using matplotlib for percentage of confirmed and canceled bookings of Resort Hotel
resort_hotel = hotel_df.loc[(hotel_df['hotel'] == 'Resort Hotel')]
resort_hotel_checking_cancel = resort_hotel['is_canceled'].value_counts()

# Set labels
mylabels = ['Confirmed', 'Canceled']

# Set figure size
myexplode = [0.2, 0]

# Create the figure object
resort_hotel_cancelation = plt.pie(resort_hotel_checking_cancel, labels = mylabels, explode = myexplode, autopct = '%1.1f%%')

# Set title
plt.title('Resort Hotel\nConfirmed and Cancelation')

resort_hotel_checking_cancel

In [None]:
# Removing the canceled bookings from the data and creating a new dataframe
data_not_canceled = hotel_df[hotel_df['is_canceled'] == 0]

# Year wise Bookings of hotels
# Set style
sns.set_style(style = 'darkgrid')

# Set plot size
plt.figure(figsize = (12,6))

# Create the figure object
sns.countplot(x= 'arrival_date_year', hue= 'hotel', palette = 'tab10', data = data_not_canceled)

# Set legends
plt.legend(['Resort Hotel', 'City Hotel'])

# Set labels
plt.title('Year wise bookings of hotels', fontsize = 20)
plt.ylabel('Number of bookings', fontsize = 16)
plt.xlabel('Year', fontsize = 16)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have picked out the count plot and pie plot lot to get proper insights on Hotel wise cancelation and confirmation of bookings.

##### 2. What is/are the insight(s) found from the chart?

We can clearly deduce from the above graphs that the City hotel is having greater number of bookings as compared to Resort hotel. But, the cancelation percentage is high of the City Hotel.

From the above graphs, it can be summarised that in the year 2016 both the hotel saw a massive increase in their bookings and by far the year 2016 is the year of the highest bookings of both hotel. In 2016 and 2017 the City hotel is having the highest number of bookings but in 2015 the Resort hotel is having the highest number of bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Overall the graphs show a positive outcome but the visualization of cancelation graph creates a situation of deep concern. So, here as we can see, that more than 1/4th of overall booking got canceled. So, it's a matter of deep concern. Thus, we need to look over this problem. The solution to this problem is that, we can check the reasons of cancelation of a booking & need to get this sorted out as soon as possible at the business level to stop the problems getting broader.

#### Chart - 12 : ADR across different months

In [None]:
# Chart - 12 visualization code
# Using groupby funtion
bookings_by_months_df = hotel_df.groupby(['arrival_date_month', 'hotel'])['adr'].mean().reset_index()

# Create month list
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# It will take the order of the month list in the dataframe along with values
bookings_by_months_df['arrival_date_month'] = pd.Categorical(bookings_by_months_df['arrival_date_month'], categories = months, ordered = True)

# Sorting values
bookings_by_months_df = bookings_by_months_df.sort_values('arrival_date_month')

bookings_by_months_df

In [None]:
# Visualizing with the help of line plot

# Set plot size
plt.figure(figsize = (14,6))

# Create the figure object and plotting the line
sns.lineplot(x = bookings_by_months_df['arrival_date_month'], y = bookings_by_months_df['adr'], hue = bookings_by_months_df['hotel'])

# Set labels
plt.title('ADR across Each Month', fontsize = 20)
plt.xlabel('Month', fontsize = 16)
plt.ylabel('ADR', fontsize = 16)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have picked the line chart here to get the clear insights of ADR by City and Resort hotels across each month. Line chart is very useful because it helps to show small shifts that may be getting hard to spot in other graphs. It helps show trends for different periods. They are easy to understand. To compare data, more than one line can be plotted on the same axis.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart are as follows:-

For Resort Hotel, ADR is high in the months of June, July, August as compared to City Hotels. The reason may be that customers/people want to spent their summer vacation in Resort Hotels.

The best time for guests to visit Resort or City Hotels is January, February, March, April, October, November and December as the average daily rate in this month is very low. So, it would be feasible and sustainable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

So, the higher the ADR, the higher will be the revenue, so its a good sign. Hotels should work more to enhance their ADR by offering good schemes to attract customers in winter vacation also and other holidays.

#### Chart - 13 : Weekly stay distribution and Calculation of Cancelation and non-cancelation

In [None]:
# Chart - 13 visualization code
# As i have already created a column 'total_stay' above i.e.
# Adding total staying days in hotels
hotel_df['total_stay'] = hotel_df['stays_in_weekend_nights'] + hotel_df['stays_in_week_nights']

# Set the plot size
plt.figure(figsize=(14,7))

# Using a violin plot to know in which weeks, visitors stays the most
sns.violinplot(x = 'arrival_date_week_number', y = 'total_stay', palette = 'Set2', data = hotel_df)

# Set labels
plt.title('Week wise number of stays', fontsize = 20)
plt.ylabel('Number os stays', fontsize = 16)
plt.xlabel('Week number', fontsize = 16)

# To show
plt.show()

In [None]:
# Visualizing with the help of pie plot
hotel_df['is_canceled'].value_counts().plot.pie(explode = [0.05,0.05], autopct = '%1.1f%%', shadow = True, figsize = (10,8), fontsize = 20)

# Set title
plt.title('Cancelation and Non-Cancelation', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

I have used the violin plot here, to gather proper relation between number of stays and week wise number of stays and violin plots are used when one want to observe the distribution of numetic data, and are especially useful when you want to make a comparison of distributions between multiple groups. This peaks, valleys, and tails of each group's density curve can be compared to see where groups are similar or different.

I have picked this pie plot as it's look very precise and clear to get the insights between two variables. As, we can see now 27.5% tickets was canceled. Here, 0 denotes not canceled and 1 denotes the canceled one. So, i have used the pie plot because it represents data visually as a fractional part of a whole, which can be an effective communication tool for the even uninformed audience. It enables the audience to see a data comparison at a glance to make an immediate analysis or to understand information quickly.

##### 2. What is/are the insight(s) found from the chart?

From the above violin plot, we have found that from the week 28 to 31, it has shown the highest days of stay whereas from the week 1 to 11 has shown a very steady trend in the number of stays and also the week 18 to 22 has shown the least number of stays by the visitors in aggregate of all 3 years 2015, 2016 and 2017.

From the graph, we have found the insights that more than 1/4th of the overall bookings i.e. approx 27.5% of the tickets was got canceled.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, according to the outcomes, Client can have a better plan to provide better services to the guests so that the revenue can be multiplied.

So, here as we can see, that more than 27% booking got canceled. So, it's a matter of deep concern. Thus, we need to look over this problem. The solution to this problem is that, we can check the reasons of cancelation of a booking & need to get this sorted out as soon as possible at the business level to stop the problems getting broader.

#### Chart - 14 : Room type preference and Customer types

In [None]:
# Correlation Heatmap visualization code


# Set the plot size
plt.figure(figsize = (14,6))

# Create the figure object
sns.countplot(x = hotel_df['assigned_room_type'], order = hotel_df['assigned_room_type'].value_counts().index)

# Set labels
plt.xlabel('Room Type', fontsize = 16)
plt.ylabel('Count of Room type', fontsize = 16)
plt.title('Most preferred Room Type', fontsize = 20)

# To show
plt.show()

In [None]:
# Using seeborn to plot a count plot chart to demonstrate the types of customer visit the most
# Set the plot size
plt.figure(figsize = (12,6))

# Create the figure object
sns.countplot(x = 'arrival_date_month', hue = 'customer_type', palette = 'Set2', data = hotel_df)

# Set labels
plt.xlabel('Months', fontsize = 16)
plt.ylabel('Number of customers', fontsize = 16)
plt.title('Types of customer arrived month wise', fontsize = 20)

# To show
plt.show()

##### 1. Why did you pick the specific chart?

For 1st visualization, i have picked the bar chart to display result for this set of code. Here, i have used bar graph to show distribution by volume(count of room), which type of room is alotted. Bar graph summarises the large set of data in simple visual form. It displays each category of data in the frequency distribution. It clarifies the trend of data better than the table. So, i have used the bar graph here.

While 2nd visualization involves a count plot because it helps us to get clear insights with the total number of guests visited. So, i have used count plot here to know about the type of guests.

##### 2. What is/are the insight(s) found from the chart?

From the above chart, it is found that the most preferred Room type is 'A'. So, majority of the guests have shown interest in this room type. So, overall this chart shows room type 'A' is most preferred by guests.

From the 2nd graph, it can be summarised that the Transient type of customers visit the most whereas the visitors who are in group comes in the category of least visitors.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from the graph it can be seen that there are positive impacts because 'A', 'D', 'E' is more preferred by guest due to better services offered in room type. So, overall booking in a hotel matters. So, each room type belongs to each hotel so wherever customers goes, the hotel will be benefit but Hotels should also look in the factors affecting less preference in some particular room type. So, overall if other room types will also gain popularity then again hotel will be benefitted. So, ultimately hotels will encounters more bookings resulting in much more revenues.

Ofcourse the better understanding regarding the different type of guests will help to take proper right steps towards services, facilities, requirements and offers which will directly result in the growth in business.

### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_hotel_df = hotel_df.select_dtypes(include=[np.number])

# Create the heatmap
plt.figure(figsize=(18, 10))
sns.heatmap(numeric_hotel_df.corr(), annot=True, fmt=".2f")
plt.title("Co-relation of the columns", fontsize=20)
plt.show()


### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the co-relation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis and as a diagnostic for advanced analysis. The range of correlation is [-1,1].

Thus to know the co-relation between all the variables along with the correlation coefficients, i have used correlation heatmap.

### 2. What is/are the insight(s) found from the chart?
The insights found from the above chart are as follows:-

is_canceled and total_stay are negatively correlated. This means customers are unlikely to cancel their bookings if they don't get the same room as per reserved room. We have visualized it above.

lead_time and total_stay is positively correlated. This means more the stay of customer is, more will be the lead time.

adults, childrens and babies are correlated to each other. This indicates more the people, more will be ADR.

is_repeated guest and previous bookings not canceled have a strong correlation. This may be due to the reason that repeated guests are not more interested to cancel their bookings.

So, these are some powerful insights found from the chart of correlation heatmap.

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(hotel_df, hue = 'is_repeated_guest')

# To show
plt.show()


##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, i have used pair plot to analyse the patterns of data and relationship between the features. It's exactly same as the correlation map but here it shows the output in the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

We have found the relationship of 'is_repeated_guest' with different types of columns. So, generally this chart reflects the relationship of a particular column with all other columns.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1
### **chart 3 :Plotting Pie chart**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Certainly! Here are the null and alternative hypotheses for testing the distribution of required car parking spaces in your hotel_df dataset (assuming the required_car_parking_spaces column holds numerical values):

Null Hypothesis (H₀): The distribution of required car parking spaces follows a uniform distribution.

Alternative Hypothesis (Hₐ): The distribution of required car parking spaces does not follow a uniform distribution.

This null hypothesis suggests that all numbers of required parking spaces (0, 1, 2, etc.) are equally likely for guests, while the alternative hypothesis states that there's a preference for certain numbers of parking spaces, leading to a non-uniform distribution.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy import stats

# Assuming 'variable_name' is the column containing your numerical data
variable_data = hotel_df['variable_name']

# Perform Shapiro-Wilk test
statistic, p_value = stats.shapiro(variable_data)

# Decision rule
if p_value > alpha (e.g., 0.05):  # Fail to reject H₀ (data may be normally distributed)
    print("We fail to reject the null hypothesis. There is not enough evidence to conclude that the data is not normally distributed.")
else:  # Reject H₀ (data likely not normally distributed)
    print("We reject the null hypothesis. The data is likely not normally distributed.")


##### Which statistical test have you done to obtain P-Value?

The code you provided performs the Shapiro-Wilk test, which is a statistical test used to assess whether a sample comes from a normally distributed population. It outputs a test statistic and a p-value.

Here's a breakdown of the test:

Null Hypothesis (H₀): The data is normally distributed.
Alternative Hypothesis (Hₐ): The data is not normally distributed.
The p-value obtained from the Shapiro-Wilk test tells you the probability of observing a test statistic as extreme or more extreme than the one you calculated, assuming the null hypothesis (normality) is true.

Decision Rule:

p-value > α (significance level, e.g., 0.05): You fail to reject H₀.
There's not enough evidence to conclude that the data is not normally distributed.
p-value <= α: You reject H₀.
The data is likely not normally distributed.
This test is helpful when you need to determine if your data can be analyzed using statistical methods that assume normality (e.g., t-tests, ANOVA).

If the data is not normally distributed, you might need to consider alternative methods or transformations.

##### Why did you choose the specific statistical test?

The Shapiro-Wilk test was chosen in the provided code for a few reasons, making it a good initial test for normality in many cases:

1.Versatility: The Shapiro-Wilk test is a relatively powerful test for normality that works well for a wide range of sample sizes, both small and large. This makes it a versatile choice when you're not sure about your sample size beforehand.

2.Ease of Use: The Shapiro-Wilk test is relatively easy to implement in many statistical software packages, including Python's scipy.stats library used in the code example. This makes it an accessible option for many users.

3.Reasonably Robust: Compared to some other normality tests, the Shapiro-Wilk test is fairly robust to departures from normality, meaning it can still provide a good indication even if your data isn't perfectly normal. However, it's important to be aware that severe deviations from normality can affect its accuracy.

4.Alternative Tests: Depending on your specific situation, other normality tests might be better suited. For example, the Kolmogorov-Smirnov test is a non-parametric test that doesn't assume any specific distribution shape. However, it can be less powerful than the Shapiro-Wilk test for smaller samples.

5.Visualization: While the Shapiro-Wilk test provides a statistical test, it's always a good practice to visualize your data using tools like histograms and Q-Q plots. These visualizations can help you gain a better understanding of the shape of your data and potential departures from normality.
Overall, the Shapiro-Wilk test is a solid choice as a starting point for assessing normality, but it's important to consider other factors and potentially use additional methods for a more comprehensive analysis.

### Hypothetical Statement - 2

### **Chart - 4 : Group by Hotel**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

ased on the provided code snippet visualizing the average daily rate (ADR) for different hotel types ("City Hotel" and "Resort Hotel"), here are two possible sets of null and alternative hypotheses to consider:

1: Comparing Means

Null Hypothesis (H₀): The average daily rates (ADR) for city hotels and resort hotels are equal.
Alternative Hypothesis (Hₐ): The average daily rates (ADR) for city hotels and resort hotels are not equal. (This could be one-sided, specifying which type is expected to have a higher ADR, or two-sided, leaving the direction open.)
This option focuses on whether there's a statistically significant difference in the mean ADR between the two hotel types.

2: Comparing Distributions

Null Hypothesis (H₀): The distributions of ADR for city hotels and resort hotels are identical.
Alternative Hypothesis (Hₐ): The distributions of ADR for city hotels and resort hotels are not identical.
This option investigates whether the ADRs for each hotel type have the same shape and spread beyond just the average values.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy import stats
# Group by Hotel and calculate mean ADR for each type
group_by_hotel = hotel_df.groupby('hotel')
hotel_adr_means = group_by_hotel['adr'].mean().reset_index()
# Extract ADR values for each hotel type into separate lists
city_adr = hotel_adr_means[hotel_adr_means['hotel'] == 'City Hotel']['adr'].tolist()
resort_adr = hotel_adr_means[hotel_adr_means['hotel'] == 'Resort Hotel']['adr'].tolist()


In [None]:
# Perform t-test for independent samples with equal variances
statistic, p_value = stats.ttest_ind(city_adr, resort_adr, equal_var=True)
# Decision rule (assuming significance level α = 0.05)
if p_value > 0.05:
    print("We fail to reject the null hypothesis. There is not enough evidence to conclude that the average ADRs for city and resort hotels are different.")
else:
    print("We reject the null hypothesis. There is a statistically significant difference between the average ADRs for city and resort hotels.")

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the p-value in the context of comparing average daily rates (ADR) between city and resort hotels is a t-test. Specifically, it's a two-sample independent t-test assuming equal variances.

Purpose: The t-test is used to compare the means of two independent groups. In this case, the groups are city hotels and resort hotels, and we're interested in whether their average ADRs differ significantly.

The t-test assumes:
Normality of data within each group (consider normality tests if unsure)
Independence of observations (ADR values in each group are not related)
Equal variances in both groups (the equal_var=True argument was used in the code)

Output: The t-test provides two key values:

1.Test statistic: This measures the observed difference between the means relative to the pooled standard deviation.
2.P-value: This is the probability of observing a test statistic as extreme or more extreme than the calculated one, assuming the null hypothesis (equal means) is true.

Interpretation:
1.Low p-value (less than significance level): This suggests strong evidence to reject the null hypothesis. There's a statistically significant difference between the average ADRs for the two hotel types.
2.High p-value (greater than significance level): You fail to reject the null hypothesis. There's not enough evidence to conclude that the average ADRs differ significantly.

##### Why did you choose the specific statistical test?

The specific statistical test chosen in this case - a two-sample independent t-test assuming equal variances - was selected for a few reasons:

1.Comparing Means: We're primarily interested in whether the average daily rates (ADR) for city and resort hotels are statistically different. The t-test is specifically designed to compare the means of two independent groups.

2.Independent Samples: The ADR data for city and resort hotels is assumed to be independent. The t-test is suitable for comparing independent groups where observations within each group don't influence each other.

3.Normality Assumption (if applicable):  The t-test usually performs well when data within each group (city and resort hotels) is normally distributed. While normality tests weren't explicitly shown in the code example, it's good practice to check this assumption for t-tests. If the data isn't normal, consider alternative non-parametric tests like the Mann-Whitney U test.

4.Equal Variances (if applicable): The code snippet used equal_var=True in the t-test function. This assumes that the variances (spread) of ADR values within each hotel type are similar. If you suspect unequal variances, you can either transform the data or use a version of the t-test that doesn't assume equal variances (equal_var=False).

Overall, the two-sample independent t-test with the assumption of equal variances is a good choice because:

* It directly addresses the question of comparing means between two independent groups.
* It's a commonly used and well-understood test.
* It has reasonable assumptions (normality and equal variances) that can be assessed with additional tests if needed.

However, it's important to consider the limitations of the test and choose alternative approaches if the assumptions aren't met or if you're interested in comparing the entire distributions rather than just the means.

### Hypothetical Statement - 3
### **Chart - 6 :Percentage of repeated guests (Univariate)**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Based on the provided code snippet creating a pie chart for the distribution of repeated guests (presumably with values of True and False in the is_repeated_guest column), here are two possible sets of null and alternative hypotheses:

Option 1: Proportion of Repeated Guests

Null Hypothesis (H₀): The proportion of repeated guests (True values) in the data is equal to 50%.
Alternative Hypothesis (Hₐ): The proportion of repeated guests (True values) in the data is not equal to 50%.
This option specifically focuses on whether the percentage of repeated guests is exactly half (50%) or deviates from that value.

Option 2: Comparison with a Threshold

Null Hypothesis (H₀): The proportion of repeated guests (True values) in the data is less than or equal to a specific threshold (e.g., 40%).
Alternative Hypothesis (Hₐ): The proportion of repeated guests (True values) in the data is greater than the specific threshold (e.g., 40%).
This option allows you to test against a pre-defined threshold that might be considered "good" or "bad" for customer retention. You can adjust the threshold value based on your business needs.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats
# Count the number of repeated guests (True values)
repeated_guests = hotel_df[hotel_df['is_repeated_guest'] == True].shape[0]

# Total number of guests
total_guests = hotel_df.shape[0]

# Proportion of repeated guests
p_hat = repeated_guests / total_guests
# Perform normal approximation test with continuity correction
statistic, p_value = stats.norm.cdf(p_hat - 0.5 / total_guests, loc=p_hat, scale=np.sqrt(p_hat * (1 - p_hat) / total_guests))

# Decision rule (assuming significance level α = 0.05)
if p_value > 0.05:
    print("We fail to reject the null hypothesis. There is not enough evidence to conclude that the proportion of repeated guests is different from 50%.")
else:
    print("We reject the null hypothesis. There is a statistically significant difference from the expected 50% proportion of repeated guests.")


##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the p-value in the context of analyzing repeated guest data depends on the specific hypothesis you choose. Here are the two possibilities based on the code snippet and explanations provided earlier:

Option 1: Proportion of Repeated Guests (H₀: p = 0.5):

Test: Normal approximation test with continuity correction
This test is used when you want to assess whether the proportion of repeated guests (True values in is_repeated_guest) deviates from a hypothesized value of 50% (null hypothesis). It assumes a binomial distribution for the data (repeated vs. non-repeated guests) and approximates it with a normal distribution for hypothesis testing due to potential limitations with small sample sizes.

Option 2: Comparison with a Threshold (H₀: p <= threshold):

Test: One-tailed binomial test
This test is used when you have a specific threshold in mind for the proportion of repeated guests (e.g., H₀: proportion <= 40%). It directly works with the binomial distribution, allowing you to specify the threshold value in the null hypothesis and test if the observed proportion is greater than that threshold (one-tailed test).

##### Why did you choose the specific statistical test?

The specific statistical test chosen depends on the hypothesis you want to test regarding the proportion of repeated guests in your data. Here's a breakdown of the rationale behind the two options:

Option 1: Normal Approximation Test with Continuity Correction (H₀: p = 0.5):

We chose this test for the following reasons:

Testing Against 50%: The null hypothesis specifically states that the proportion of repeated guests is exactly 50%. The normal approximation test is suitable for this scenario because it allows us to assess how likely it is to observe the obtained proportion (p_hat) if the true proportion were actually 50%.
Binomial Data Approximation: The data likely follows a binomial distribution, with two categories (repeated vs. non-repeated guests). The normal approximation test works well when the sample size is large enough. It uses the observed proportion (p_hat) and calculates a p-value assuming a normal distribution that approximates the binomial distribution for this test.
Continuity Correction: Since the normal approximation might not be perfect for small samples, the continuity correction is applied. This adjusts the test statistic slightly to account for the discreteness of the binomial distribution.
Option 2: One-Tailed Binomial Test (H₀: p <= threshold):

This test is a good choice if you have a specific threshold in mind:

Predefined Threshold: The null hypothesis focuses on a comparison with a specific value (e.g., proportion of repeated guests is less than or equal to 40%). The one-tailed binomial test allows you to directly incorporate this threshold into the test.
Direct Binomial Analysis: This test works directly with the binomial distribution, eliminating the need for an approximation like the normal test in Option 1. It assesses the probability of observing the obtained proportion (p_hat) or a higher proportion, assuming the true proportion is less than or equal to the threshold value (null hypothesis).

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check for missing values
print(hotel_df.isnull().sum())

# Visualize missing values with heatmap (optional)
import matplotlib.pyplot as plt

missing_values = hotel_df.isnull().sum()
missing_values.plot(kind='bar')
plt.show()


#### What all missing value imputation techniques have you used and why did you use those techniques?

1.Deletion: This involves simply removing rows or columns with missing values.

2.Mean/Median/Mode Imputation: These techniques replace missing values with the mean, median, or most frequent value of the feature, respectively.

3.K-Nearest Neighbors (KNN) Imputation: This method uses the values of the k nearest neighbors (data points most similar to the one with a missing value) to estimate the missing value.

4.Model-based Imputation: This technique involves training a separate model (e.g., linear regression) to predict the missing values based on other features in the data.

The choice depends on several factors:
a.Amount of Missing Data: Deletion might be acceptable for a small percentage, but imputation becomes more important with a larger proportion.

b.Data Distribution:
*.Mean/Median/Mode imputation work well for normally distributed data.
*.KNN imputation is better for skewed data or capturing relationships within the data.

c.Missing Value Pattern:
*.Random missingness: Any technique might work.
*.Systematic missingness (e.g., only for a specific customer segment): Deletion might be less suitable; consider model-based imputation if the pattern can be explained by other features.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

* City hotels are the most preferred hotel type by the guests. So, we can say that City hotels are the busiest hotel in comparison to the resort hotel.

* The average ADR of city hotels is higher as compared to the resort hotels. So, it can be said that these City hotels are generating more revenue than the resort hotels.

* The total stay of guests is directly proportional to the adr. So, higher the days of stay, the higher will be ADR and revenue as well.

* The percentage of repeated guests is very low. Only 3.9% people had revisited the hotels. Rest 96.1% were new guests. So, retention rate is much low.

* The percentage of required car parking spaces is very low. This means less car parking spaces don't affect the business much. Most of the customers (91.6%) do not require car parking spaces.

* Among different types of meals, BB (Bed & Breakfast) is the most preferred type of meal by the guests. So, guests love to opt for this meal type.

* 'Direct' and 'TA/TO' have almost equally contribution in ADR in both type of hotels i.e. 'City Hotel' and 'Resort Hotel'. While, GDS has highly contributed in ADR in 'City Hotel' type.

* Optimal stay length in both the hotel types (City and Resort Hotel) is less than 7 days. Usually people stay for a week. So, after 1 week, the optimal stay length declined drastically.

* Most number of bookings have taken place in the month of July and August. July and August are the favourite months of guests to visit different places.

* The mostly used distribution channel for booking is 'TA/TO'. 79.1% bookings were made through TA/TO (travel agents/ tour operators).

* While calculating ADR across different month, it is found that for Resort hotel, ADR is high in the months of June, July, August as compared to City Hotels.

* Almost 1/4th of the total bookings is canceled. Approx, 27.5% bookings have got canceled out of all the bookings.

* Majority of the guests have shown interest in the room type 'A'. Room type 'A' is the most preferred room type.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***