<a href="https://colab.research.google.com/github/Linku1999/Hotel-Booking-Analysis/blob/main/EDA_Hotel_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Hotel Booking Analysis
##### **Contribution**    - Individual


# **Project Summary -**

The hotel booking dataset was analyzed to gather insights and make informed hotel booking decisions. The dataset comprised factors such as hotel type, food type, stay duration, and others. The analysis's purpose was to aid tourists in selecting the correct hotel at the right price, ensuring a safe and happy stay, and assisting hotel management in making decisions to improve their services.

The first step was to load the dataset into a notebook and examine its structure. The dataset had 119,390 rows and 32 columns. Unwanted values and features were removed from the data. The "company" column, which held a large amount of null data, was eliminated. To guarantee consistency, null values in other columns were replaced with zeros with an accurate analysis.

Data visualisation was critical in comprehending the information contained in the dataset. To depict the data graphically and draw significant insights, various visual elements such as pie charts, bar charts, line charts, and correlation plots were used.

The project's goals were to identify the optimum time of year to book hotel rooms, establish the optimal length of stay for the greatest daily cost, and forecast the chance of hotels receiving an unusually high number of special requests. Furthermore, factors influencing rates and service levels were investigated to help both tourists and hotel management make informed judgements.


# **GitHub Link -**

https://github.com/Linku1999/Hotel-Booking-Analysis

# **Problem Statement**


**Have you ever wondered when the best time of year to book a hotel room is? Or the optimal lenght of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special request?
This hotel booking dataset can help you explore those question! This dataset contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.**

#### **Define Your Business Objective?**

We have given the dataset of hotel booking, In the provided dataset contains the information of Hotel type , Meal type, Stay duration , etc. While any tour or trip I don't want to pay any amount randomly, rather I will pay some optimized money for stay. And I wan't my vacation to be safe with good stay and good meal with optimized money.
This analysis will help the tourist to choose the right hotel,right price, Proper and safe stay also it helps Hotel Management to take right decision for making any changes in service level.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing the required library
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import missingno as msno

### Dataset Loading

In [None]:
# Mount the drive to the notebook
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Importing the dataset from the drive
df_data = pd.read_csv("/content/drive/MyDrive/Alamabetter/cohort enlighten/python/EDA Project/Hotel Bookings (1).csv")

### Dataset First View

In [None]:
# Let's make a copy of our data , so original data will not be affect
df = df_data.copy()

In [None]:
#Let's take a first view of data
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Check the dataset row and column
df.shape

### Dataset Information

In [None]:
# Let's check the info of the data
df.info()

In [None]:
df.describe()

#### Duplicate Values

In [None]:
# Count duplicate values
duplicate_counts = df.duplicated().sum()

# Print the result
print("Number of duplicate values:", duplicate_counts)

#### Missing Values/Null Values

In [None]:
#Checking the total null values in data
df.isnull().sum()

In [None]:
# Visualize missing values
msno.matrix(df)
plt.show()

In [None]:
# In this above Null data
# 4 null values present in 'Children'
# 488 null values present in 'country'
# 16340 null values present in 'agent'
# 112593 null values present in 'company'

#As 'Company' has huge amount of Null values so we will drop this column to make further analysis easy

In [None]:
df.drop(['company'], axis=1, inplace = True)

In [None]:
# Now we will fill remaining null data with zero
df['children'].fillna(value = 0, inplace = True)
df['agent'].fillna(value = 0, inplace = True)
df['country'].fillna(value = 'Others', inplace = True)

In [None]:
# Verify again the data
df.isnull().sum()

### What did you know about your dataset?

**Hence no any null value pending in this dataset now we are ready.**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Describe
df.describe()

### Variables Description

# **Variables Description**
Some of important varibales
1. hotel: type of hotels
2. is_canceled: canceled or not
3. lead_time: no. of days before actual
arrival in the hotel
4. arrival_date_year: year of booking
5. arrival_date_month: month of booking
6. arrival_date_week_number: week
number of the year in which booking
7.arrival_date_day_of_month: arrival
month date
8. stays_in_weekend_nights: no. of
weekends guest stayed
9. stays_in_week_nights: no. of weekdays
guest stayed
10. meal: BB – Bed & Breakfast
HB – only two meals including
breakfast meal
FB – breakfast, lunch, and dinner
11. market_segment: TA: Travel agents
TO: Tour operators
12. previous_cancellations: cancellation in
past
13. previous_bookings_not_canceled: not
canceled in the past.

### Check Unique Values for each variable.

In [None]:
# Iterate over each column in the DataFrame
for column in df.columns:
    unique_values = df[column].unique()
    num_unique_values = len(unique_values)

    # Print the unique values for the current column
    print("Unique values for column", column)
    print(unique_values)

    # Print the number of unique values for the current column
    print("Number of unique values:", num_unique_values)
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Adding some extra column like total number of stay and revenue
df['total_num_of_stay'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights']

In [None]:
#Calculating revenue and adding into a data
df['revenue'] = df['total_num_of_stay']*df['adr']

In [None]:
df.info()

In [None]:
# In is_cancelled column boolean values are present hence lets convet into a string
df['is_cancelled'] = df.is_canceled.replace(to_replace = [1,0], value = ['canceled', 'not_canceled'])

### What all manipulations have you done and insights you found?

# We have added column revenue and total num of stay, also we have changed is_cancelled column with appropriate data

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Q.1 Which hotel is more preferred by customers?

In [None]:
# Let's us use the data visualization tool to find the hotel preference
plt.rcParams['figure.figsize'] = (8,8)

#Lets give the type of chart, colour and size
df['hotel'].value_counts().plot(kind='pie', colors = ['orange','yellow'], autopct='%.0f%%', fontsize = 16)
plt.title('Types of Hotel', fontsize = 18)
plt.ylabel('hotel', fontsize = 16)



*   **Above pie chart city hotel has more preference than the Resort hotel**
*   **City hotel have 66% preference and Resort hotel have 34% preference**





# Q 2 Check the cancellation data using data visualization

In [None]:
# Let's plot the chart to show the data
df.is_canceled.value_counts().plot(kind='pie', colors = ['green', 'pink'], autopct='%.0f%%', fontsize = 16)
plt.title('Cancelation Plot for Hotel Booking Customers', fontsize = 18)
plt.ylabel('count', fontsize = 16)



*   **The plot shows 37% bookings were canceled by the customers over 63% is
not canceled**








# Q 3 Let's check it out the arrival data

In [None]:
# Lets check the data by year
arrival_by_year = df[['arrival_date_year','hotel']].value_counts().groupby('arrival_date_year').sum()

In [None]:
# Lets print the plot
plt.rcParams['figure.figsize'] = (10,8)
arrival_by_year.plot(kind='bar', color = ['pink', 'blue','yellow'], fontsize = 16)
plt.title('Year Wise Bookings', fontsize = 18)
plt.xlabel('Arrival Year', fontsize = 15)
plt.ylabel('Count of arrival', fontsize = 15)

*   **We can observed that in the year 2016 arrival is more than double of the previous year but also**


*   **it is observed that arrival has been fallen down in 2017. Hotel Management need to check for the same**








In [None]:
# Let's calculate for both hotels
# Ploting the graph for both hotel
plt.rcParams['figure.figsize'] = (10,6)
sns.countplot(data = df, x = 'arrival_date_year', hue = 'hotel')
plt.title('Arrival per year for each hotel', fontsize = 18)
plt.xlabel('Arrival Year', fontsize = 14)
plt.ylabel('Count of arrival', fontsize = 14)



*   **We can observe that number of arrivals seems to be high in year 2016 for city hotel compared to Resort Hotel, while the bookings seem to be less in 2015 and 2017 for both resort hotel and city hotel.**









In [None]:
# let's us check arrival per month
# Creating the plot arrival per month
plt.rcParams['figure.figsize'] = (22,7)
sns.countplot(data = df, x = 'arrival_date_month', hue = 'hotel', order = ['January', 'February','March', 'April','May','June','July','August','September','October','November','December'])
plt.title('Arrival per month for each hotel', fontsize = 18)
plt.xlabel('Arrival Month', fontsize = 14)
plt.ylabel('Count of Arrival', fontsize = 14)

[link text](https://)

*   **We observed that for initial month of the year arrival is less compared to mid-months.**


*   **For month May, June,July and August maximum arrival has been seen**

*   **for last 2 months trends follows with Jan and Feb Month**




In [None]:
# Lets check more preferred country
# We will consider only top 10 countries
most_preferred_country = df['country'].value_counts().head(10)
most_preferred_country

In [None]:
#let's print on the plot
plt.rcParams['figure.figsize'] = (20,10)
most_preferred_country.plot(kind='bar', color = ['yellow'], fontsize = 12)
plt.title('Preference by Country', fontsize = 18)
plt.xlabel('Country', fontsize = 15)
plt.ylabel('Counts', fontsize = 15)

* **Most preferred country is PRT followed by GBR**   



# Q 4 Let us visulize the cancellation data

In [None]:
#Lets find the cancelled booking
cancelation_data = df.groupby(['hotel','is_canceled'])['is_canceled'].count().unstack()

In [None]:
cancelation_data

In [None]:
# let's analyze using the plot
plt.rcParams['figure.figsize'] = (12,5)
cancelation_data.plot(kind = 'bar', color = ['greenyellow', 'blue'], fontsize = 12)
plt.title('Cancelation Data for Hotel (0 = Not Cancelled and 1 = Cancelled)', fontsize = 18)
plt.xlabel('Hotel Type', fontsize = 14)
plt.ylabel('Count', fontsize = 14)

*italicized text*

*  **We observed that city hotel have more cancellation than the Resort hotel hence Hotel Management need to take proper decision to minimize the cancellation**  



# Q 5 Analysis the data on the basis of ADR (Average Daily Count Rate)

In [None]:
# Let's find out of Average ADR for types of hotel
average_adr_hotel = df.groupby(['hotel'])['adr'].mean()
average_adr_hotel

In [None]:
#let's visualize using plot
plt.rcParams['figure.figsize'] = (10,6)
average_adr_hotel.plot(kind = 'bar', color = ['skyblue', 'orange'], fontsize = 10)
plt.title('ADR on the basis of Hotel Type')
plt.xlabel('Hotel Type', fontsize = 14 )
plt.ylabel('Average ADR', fontsize = 14)



*    **ADR for city hotel is slightely more than Resort Hotel**





In [None]:
# Let's check ADR for top 10 countries
country_adr = df.groupby(['country'])['adr'].mean().sort_values(ascending = False)[0:10]

country_adr

In [None]:
# Let's check with the graph
plt.rcParams['figure.figsize'] = (14,8)
country_adr.plot(kind = 'bar', color = 'skyblue', fontsize = 10)
plt.title('ADR for top 10 countries')
plt.xlabel('Country', fontsize = 12 )
plt.ylabel('Average ADR', fontsize = 12)


*   **We can see DJI have highest ADR comparing to all other countries**




In [None]:
# Let's Now we will check the ADR for different month and year
month_year_adr = df.groupby(['arrival_date_month','arrival_date_year'])['adr'].mean()

In [None]:
month_year_adr

In [None]:
# Let's do the visulization
line,ax = plt.subplots(figsize=(18,5))
sns.lineplot(x='arrival_date_month', y='adr', data=df, hue='arrival_date_year', palette='dark')
ax.set_title('ADR for Month and Year', fontsize = 18)
ax.set_xlabel('Month', fontsize = 14)
ax.set_ylabel('ADR', fontsize = 14)



*  **As per above line plot we can clearly seen that Hotel Business scaling up each and every year**




# Q 6 Lets check the Daily rate  

In [None]:
# Let's we will calculate
# Let us initially calculate the not cancelled booking
not_cancelled_guests = df.loc[df['is_canceled']==0]

In [None]:
# Now we will calculate the price per guest
# New column has been added in our data as price
df['price'] = not_cancelled_guests['adr']*not_cancelled_guests['total_num_of_stay']
df.head()

In [None]:
#let's visualize the data by using lineplot using month
plt.rcParams['figure.figsize'] = (15,6)
sns.lineplot(data=df, x =df['arrival_date_month'], y=df['price'], hue='hotel')
plt.title('Month-wise Price Paid for stay by Guests', fontsize = 12)
plt.xlabel('Month', fontsize = 12 )
plt.ylabel('Price paid for Stay', fontsize = 12)

*   **We observed that price paid by guest in city hotel is lower than the Resort Hotel**


*   **Price of Resort Hotel are more than City Hotel in Month of June, July, August, September. For rest of the months, Price of city Hotel are consistently higher thar Resort Hotel**

# Q 7 Analysis using Correlation Heatmap

In [None]:
plt.rcParams['figure.figsize'] = 24,12
sns.heatmap(df.corr(), cmap = 'coolwarm', annot=True);

*   **Focus on revenue, stay_in_week_nights total_num_of_stay , we can verify that, revenue were almost same for stay_in_week_nights and total_num_of_stay**


*   **Also it is obsereved that stay_in_week_nights was more than the weekend night stay**

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

#To fulfil the corporate goal of minimising stay costs while providing a safe and satisfying experience, I would recommend to the customer the following:
1. Examine Seasonality: Look through the dataset for patterns in booking trends throughout the year. Consider factors such as price, availability, and demand to determine the ideal time of year to book a hotel room. This study will assist the client in obtaining the most competitive rates and availability for their preferred stay.
2. Examine Length of Stay: Look into the relationship between length of stay and daily charges. Determine the appropriate length of stay for the most value for money. This research will allow the client to efficiently plan their stay time and maximise their money.
3. Compare Hotel and Meal alternatives: Compare the various hotel and meal alternatives available in the dataset. Examine each option's pricing, facilities, and guest reviews. Recommend hotels that provide a good combination of pricing, safety, and quality to guarantee the guest has a pleasant stay.
4. Understand Special Requests: Investigate the elements that contribute to an unusually high number of special requests. Determine whether there are any patterns or specific hotel qualities that influence these requests. This research will help the client select hotels that meet their individual needs and preferences.
5. Monitor Reviews and Ratings: Examine the dataset's guest reviews and ratings. Concentrate on hotels that have received great evaluations about safety, service quality, and value for money. This information will assist the client in picking dependable hotels and ensuring a good vacation experience.

The client can make informed decisions about hotel selection, pricing negotiation, and stay duration by applying these recommendations and exploiting the insights from the dataset. This will allow them to meet their goal of reducing costs while assuring a safe and enjoyable holiday.


# **Conclusions**

1. City hotels have more preferred by customers rather than the resort hotel.
2. In overall data and it is found that over 37% of bookings were canceled by the customers.
3. Number of cancellations at the City Hotel seems very higher than at the Resort Hotel.
4. Number of arrivals seems to be high in year 2016 while the bookings seem to be less in 2015 and 2017.
5. Number of arrivals seems to be high in May, June, July, and August for the City Hotel. But several arrivals for the Resort Hotel seem high only seem July and August.
6. PRT has more preference followed by GBR.
7. As per the line graph of ADR per month for three years, clearly seen that the hotel business is growing every year.
8. As per the line graph Price paid for a stay for each month it is seen that the price paid by City Hotel for July, August, and September was less than the Resort hotels.
9. In the correlation map we can observe that stay on weeknights was more than the weekend nights

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***