<a href="https://colab.research.google.com/github/Siva778-gt/Hotel-Booking-Analysis-Almabetter/blob/main/Hotel_Booking_Analysis_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Bhajanthri Siva


# **Project Summary -**

This project is about hotel bookings for two types of hotels: City Hotel and Resort Hotel. The dataset has 119,390 rows and 32 columns. We split our work into three parts: Data Collection, Data Cleaning and Manipulation, and Exploratory Data Analysis (EDA).

First, in Data Collection, we identified the columns using methods like head(), tail(), info(), describe(), and columns(). Some columns are named hotel, is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, and stays_in_weekend_nights. We listed the unique values for each column and checked their data types, fixing any errors in the Data Cleaning part. We also found and removed 87,396 duplicate rows.

Next, we did data wrangling before visualizing the data. We checked for null values in all columns. If a column had many null values, like the 'company' column, we dropped it. For columns with a few null values, we filled them using .fillna().

Finally, we used different charts to visualize the data and gain insights to meet our business goals.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Have you ever wondered when the best time to book a hotel room is, or how long to stay to get the best rate? Or maybe you want to predict if a hotel will get a lot of special requests? This hotel booking dataset can help you find answers! It includes booking information for a city hotel and a resort hotel, such as booking dates, length of stay, number of adults, children, and babies, and available parking spaces, among other details. All personal information has been removed from the data. Explore and analyze this data to find important factors that affect hotel bookings.

#### **Define Your Business Objective?**

Find the best times of year to book rooms for lower rates and the best length of stay for great daily rates. Look into how guest demographics and parking availability affect bookings, and study hotels with many special requests. Give practical tips to hotel owners and managers to help them set better prices, improve guest experiences, and boost bookings.Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns',None)

### Dataset Loading

In [None]:
# Load Dataset
data = pd.read_csv("/content/Hotel Bookings (2).csv")

### Dataset First View

In [None]:
# Dataset First Look
data

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows,columns=data.shape
print("no of rows are:",rows)
print("no of columns are:",columns)

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

In [None]:
# total rows=119390, duplicated rows =31994
#To remove these values, we use function drop.duplicate to delete duplicate rows.
data.drop_duplicates(inplace = True)

unique_rows= data.shape[0]
unique_rows


In [None]:
# View unique data

data.reset_index()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()


In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(), cbar=True, cmap='viridis')

plt.show()


In [None]:
miss_values =data.isnull().sum().sort_values(ascending=False)
miss_values

### What did you know about your dataset?

The dataset comprises a single file that compares various booking details between a city hotel and a resort hotel. It includes data such as booking dates, length of stay, number of adults, children, and babies, as well as available parking spaces, among other factors. The dataset consists of 119,390 rows and 32 columns. Initially, it contained 31,944 duplicate records, which were subsequently removed. I examined the data types of each column (int, float, string) and noted inaccuracies in some columns, which were then corrected. Additionally, I identified the unique values in each column to understand the distinct entries present.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description

Here is the description of each variable:

1. **hotel**: Name of the hotel (Resort Hotel or City Hotel).
2. **is_canceled**: If the booking was canceled (1) or not (0).
3. **lead_time**: Number of days before the actual arrival of the guests.
4. **arrival_date_year**: Year of arrival date.
5. **arrival_date_month**: Month of arrival date.
6. **arrival_date_week_number**: Week number of the year for the arrival date.
7. **arrival_date_day_of_month**: Day of the month of the arrival date.
8. **stays_in_weekend_nights**: Number of weekend nights the guest stayed or booked to stay.
9. **stays_in_week_nights**: Number of week nights the guest stayed or booked to stay.
10. **adults**: Number of adults in the booking.
11. **children**: Number of children in the booking.
12. **babies**: Number of babies in the booking.
13. **meal**: Type of meal booked (e.g., "BB", "HB", "FB").
14. **country**: Country of origin of the guest.
15. **market_segment**: Market segment designation (e.g., "Direct", "Corporate").
16. **distribution_channel**: Distribution channel through which the booking was made (e.g., "Direct", "TA/TO").
17. **is_repeated_guest**: If the booking was from a repeated guest (1) or not (0).
18. **previous_cancellations**: Number of previous bookings canceled by the customer.
19. **previous_bookings_not_canceled**: Number of previous bookings not canceled by the customer.
20. **reserved_room_type**: Code of room type reserved by the customer.
21. **assigned_room_type**: Code for the type of room assigned to the customer.
22. **booking_changes**: Number of changes/amendments made to the booking.
23. **deposit_type**: Indication if a deposit was made to guarantee the booking.
24. **agent**: ID of the travel agency that made the booking.
25. **company**: ID of the company or entity that made the booking.
26. **days_in_waiting_list**: Number of days the booking was on the waiting list.
27. **customer_type**: Type of booking (e.g., "Contract", "Group").
28. **adr**: Average Daily Rate calculated by dividing the sum of all lodging transactions by the total number of staying nights.
29. **required_car_parking_spaces**: Number of car parking spaces required by the customer.
30. **total_of_special_requests**: Number of special requests made by the customer.
31. **reservation_status**: Reservation last status ("Canceled", "Check-Out", or "No-Show").
32. **reservation_status_date**: Date at which the last modification was made to the reservation status.Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = data.apply(pd.Series.nunique)
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#To check which colomn has null value, I have already stored the same.
miss_values[:4]

In [None]:
#To check, what is the percentage of null value in each column, starting from company

percentage_comp_null = miss_values[0] / unique_rows*100
percentage_comp_null

In [None]:
#Because there are so many empty spots in the 'company' column compared to the total number of entries, it's smarter to just delete that column altogether.
data.drop(['company'], axis=1, inplace=True)
data

In [None]:
#To check for agent
percentage_agent_null = miss_values[1] / unique_rows*100
percentage_agent_null

In [None]:
#There are very few missing values in the 'agent' column, so we can simply replace them with zeros.
data['agent'].fillna(value = 0, inplace = True)
data['agent'].isnull().sum()

In [None]:
#To check the percentage null value in country column
percentage_country_null = miss_values[2] / unique_rows*100
percentage_country_null

In [None]:
#Since there are only a few missing values in the 'country' column, we'll just fill them in with 'other' as the country name.

data['country'].fillna(value ='others',inplace = True)
data['country'].isnull().sum()
#Now this column has no null value


In [None]:
#Check the percentage null value in children column

percentage_children_null = miss_values[3] / unique_rows*100
percentage_children_null

In [None]:
#Given the limited number of missing values in the 'country' column, we can replace them with zeros.

data['children'].fillna(value = 0, inplace = True)
data['children'].isnull().sum() #Now column has no null value

In [None]:
#Let's double-check if there are any additional null values elsewhere in the database.
data.isnull().sum() #Now no column has any null value

We should adjust the data type for certain columns as needed.

In [None]:
#Let's have a look at the data info to see the data types.
data.info()

In [None]:
data[['children', 'agent']] = data[['children', 'agent']].astype('int64')


Addition of new column as per requirement

In [None]:
#Total stay in nights
data['total_stay_in_nights'] = data['stays_in_week_nights'] + data['stays_in_weekend_nights']
data['total_stay_in_nights'] #A new column for total stays in nights has been created by adding the weeknight and weekend night stay columns.

In [None]:
#To create a column for revenue using total stay * adr
data['revenue'] = data['total_stay_in_nights'] *data['adr']
data['revenue']

In [None]:
#To create a column for total guest coming for each booking
data['total_guest'] = data['adults'] + data['children'] + data['babies']
data['total_guest'].sum()

In [None]:
#For understanding, in column 'is_canceled': replace the value from (0,1) to not_canceled, is canceled.

data['is_canceled'] = data['is_canceled'].replace([0,1], ['not canceled', 'is canceled'])
data['is_canceled']

In [None]:
#Same in 'is_repeated_guest' column
data['is_repeated_guest'] = data['is_repeated_guest'].replace([0,1], ['not repeated', 'repeated'])
data['is_repeated_guest']

### What all manipulations have you done and insights you found?

I've made several adjustments to the data to prepare it for analysis:

**Addition of Columns:**
a) Total Guests: This column combines the counts of adults, children, and babies to give us the total number of guests staying, which is crucial for assessing revenue and volume.
b) Revenue: Calculated by multiplying the average daily rate (ADR) by the total number of guests, this column helps us analyze the profitability and growth of each hotel.

**Deletion of Columns:**
a) Company: Since this column contains mostly null data, it won't have any impact on our analysis, so I've removed it.

**Replacement of Values in Columns:**
a) For columns like 'is_canceled' and 'is_repeated_guest', I've replaced the numerical values (0 and 1) with more meaningful labels ('Canceled' and 'Not canceled', 'Repeated' and 'Not repeated' respectively). This enhances clarity during visualization and interpretation.

**Changes in Data Type of Values in Columns:**
a) Adjusted the data type of 'Agent' and 'Children' columns from float to integer, as these columns represent counts of guests and agent IDs, respectively, and don't require decimal values.

**Removal of Null Values & Duplicate Entries:**
a) Conducted data wrangling by checking for null values in all columns. Columns with significant null values, like 'company', were dropped, while minimal null values were filled as necessary using .fillna().
b) Identified and removed duplicate rows using .drop_duplicates() to ensure the dataset is clean and ready for analysis.

These adjustments have streamlined the data and made it ready for thorough analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Plot a pie chart from grouped data

def get_count_from_column(df, column_label):
  df_grpd = df[column_label].value_counts()
  df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
  return df_grpd


def plot_pie_chart_from_column(df, column_label, t1, exp):
  df_grpd = get_count_from_column(df, column_label)
  fig, ax = plt.subplots(figsize=(14,9))
  ax.pie(df_grpd.loc[:, 'count'], labels=df_grpd.loc[:, 'index'],autopct='%1.2f%%',startangle=90,shadow=True, labeldistance = 1, explode = exp)
  plt.title(t1, bbox={'facecolor':'0.8', 'pad':3})
  ax.axis('equal')
  plt.legend()
  plt.show()

In [None]:
exp1 = [0.05,0.05]
plot_pie_chart_from_column(data, 'hotel', 'Booking percentage of Hotel by Name', exp1)

##### 1. Why did you pick the specific chart?

**This pie chart shows which hotel has received more bookings**

##### 2. What is/are the insight(s) found from the chart?

**The pie chart reveals that City Hotel has a higher booking rate at 61.1%, compared to Resort Hotel at 38.9%. Therefore, City Hotel experiences greater demand.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, for the both Hotels, this data making some positive business impact :**
**City Hotel :- Provided more services to attract more guest to increase more revenue.**
**Resort Hotel :- Find solution to attract guest and find what city hotel did to attract guest ****


#### Chart - 2

In [None]:
# Chart - 2 visualization code
exp2 = [0.2, 0,0,0,0,0,0,0,0,0,0,0.1]
plot_pie_chart_from_column(data, 'arrival_date_month', 'Month-wise booking', exp2)

##### 1. Why did you pick the specific chart?

**This pie chart shows the monthly distribution of bookings as a percentage of the total.**

##### 2. What is/are the insight(s) found from the chart?

**The percentages indicate that May, July, and August have the highest booking rates due to the holiday season. It is recommended to increase advertising efforts during these months to attract more customers.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, with increased volume of visitors will help hotel to manage revenue in down time, will also help employee satisfaction and retention.**

#### Chart - 3

In [None]:
# Chart - 3 visualization code
exp3 =[0,0.3]
plot_pie_chart_from_column(data, 'is_repeated_guest', 'Guest repeating status', exp3)

##### 1. Why did you pick the specific chart?

**To show the percentage share of repeated & non-repeated guests.**

##### 2. What is/are the insight(s) found from the chart?

**The number of repeated guests is very less as compared to overall guests**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Offering attractive deals to new customers during slow periods can boost revenue.**

#### Chart - 4

In [None]:
# Chart - 4 visualization code
exp4 = [0,0.2]
plot_pie_chart_from_column(data, 'is_canceled', 'Cancellation volume of Hotel', exp4)

##### 1. Why did you pick the specific chart?

**This chart presentes the cancellation rate of the hotels booking.**

##### 2. What is/are the insight(s) found from the chart?

**That overall more than 25% of booking got cancelled.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Here it can seen that more than 27% booking getting cancelled.**

**Solution: Check the reason of cancellation of a booking & need to get this sort on business level**

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Function to get count from a column
def get_count_from_column_bar(df, column_label):
    df_grpd = df[column_label].value_counts()
    df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
    return df_grpd

# Function to plot a bar chart from a column
def plot_bar_chart_from_column(df, column_label, title):
    df_grpd = get_count_from_column_bar(df, column_label)
    plt.figure(figsize=(14, 6))
    plt.bar(df_grpd['index'], df_grpd['count'], width=0.4, align='edge', linewidth=4, color=['g','r','b','c','y'], linestyle=':', alpha=0.5)
    plt.title(title, bbox={'facecolor':'0.8', 'pad':3})
    plt.ylabel('Count')
    plt.xlabel(column_label)
    plt.show()

# Example usage
plot_bar_chart_from_column(data, 'assigned_room_type', 'Assignment of room by type')


##### 1. Why did you pick the specific chart?

**To show the distribution by volume, indicating which room types are most frequently allotted.**

##### 2. What is/are the insight(s) found from the chart?

**This chart shows room type 'A' is most prefered by guests.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, Positive impact because 'A','D','E' is more prefered by guest due to better services offered in room type.**

#### Chart - 6

In [None]:
# Chart - 6 visualization code
market_segment_df = pd.DataFrame(data['market_segment'])
market_segment_df_data = market_segment_df.groupby('market_segment')['market_segment'].count()
market_segment_df_data.sort_values(ascending = False, inplace = True)
plt.figure(figsize=(10,5))
y = np.array([4,5,6])
market_segment_df_data.plot(kind = 'bar', color=['g', 'r', 'c', 'b', 'y', 'black', 'brown'], fontsize = 20,legend='True')


##### 1. Why did you pick the specific chart?

**It shows the market segment through which the hotel bookings were made.**

##### 2. What is/are the insight(s) found from the chart?

**The most common method guests use to book the hotel is through Online TA.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, it's good for the business that guests mostly use the Online TA market segment to book hotels.Answer Here**

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize = (10,5))
hotel_wise_revenue = data.groupby('hotel')['revenue'].sum()

ax = hotel_wise_revenue.plot(kind = 'bar', color = ('b', 'y'))
plt.xlabel("Hotel", fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel("Total Revenue", fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Brown'} )
plt.title("Total Revenue", fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Green'} )
plt.show()
print(hotel_wise_revenue)

##### 1. Why did you pick the specific chart?

**To show the total revenue generated by each hotel.**

##### 2. What is/are the insight(s) found from the chart?

**From the chart it is clear that toatal revenue of city hotel is more than that of resort hotel.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**More advertising for City hotel to get more customer, which result higher profitAnswer Here**

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
sns.pairplot(data)

##### 1. Why did you pick the specific chart?

**A pair plot allows us to see both distributation of single variable and relationship between two variables.**

**We can see relationship between all the columns with each other in above chart. **

##### 2. What is/are the insight(s) found from the chart?

**from the pair plot we can see that if cancellation increases then total stay also decreases.**

**As the total number of people increases adr also icreases. Thus adr and total number of people are directly proportional to each other.Answer Here**

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help the client achieve their business objectives, I suggest the following:

1. **Offer discounts on City Hotel bookings** since the Resort Hotel is the most preferred.
2. **Reduce cancellations** by offering loyalty discounts to guests who don't cancel their bookings.
3. **Stock up on materials for Bed and Breakfast (BB) meals** in advance to avoid delays, as BB is the most preferred meal type.
4. **Increase the number of rooms in City Hotels** to reduce waiting times.
5. **Run promotions to attract bookings from other market segments,** as Travel Agents (TA) have the most bookings.
6. **Add more A type rooms** since they are the most preferred by guests.
7. **Investigate why repeat guests are low** and fix any issues to encourage more repeat bookings.
8. **Address high waiting times in City Hotels** compared to Resort Hotels, as City Hotels are busier.
9. **Take actions to improve performance** since the optimal stay in both types of hotels is less than 7 days, which is shorter than the typical one-week stay.
10. **Focus on guests from Portugal** since they are the majority of visitors.

# **Conclusion**


- City Hotel is popular and profitable for travelers.
- July and August have the most bookings.
- Room Type A is the most preferred by travelers.
- Most bookings come from Portugal and Great Britain.
- Guests usually stay for 1-4 days.
- City Hotel keeps more guests coming back.
- About 25% of bookings are canceled, with more cancellations at City Hotel.
- New guests cancel more often than repeat guests.
- Lead time, waiting list, and room assignment don't greatly affect cancellations.
- Corporate guests return the most and cancel the least; Travel Agents/Tour Operators cancel the most and return the least.
- Guests stay shorter when the average daily rate (ADR) increases, likely to save money.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***