<a href="https://colab.research.google.com/github/Inayat-M/Colab-Notebook/blob/main/Capstone_Project_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Hotel Booking Analysis EDA using Python



##### **Project Type**    - Hotel Booking Analysis EDA using Python
##### **Contribution**    - Individual


# **Project Summary -**

Objective:
The primary goal of this project is to analyze a dataset of hotel bookings to gain insights into booking patterns, customer behavior, and factors that impact cancellations and booking trends. By performing exploratory data analysis (EDA) using Python, we aim to uncover key findings that can guide business decisions for hotels to optimize revenue, improve customer experience, and reduce cancellations.

Dataset:
The dataset used contains information on bookings made at two types of hotels: a city hotel and a resort hotel. It includes details such as the booking date, length of stay, number of adults, children, and babies, meal options, market segment, distribution channels, customer types, and whether the booking was canceled.

# **GitHub Link -**

Provide your GitHub Link here.

#### **Define Your Business Objective?**

The primary business objective of this analysis is to provide data-driven insights that will help hotels optimize their operations, increase revenue, and enhance customer satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path = "/content/drive/MyDrive/Hotel Bookings.csv"
df = pd.read_csv(path)

In [None]:
df

### Dataset First View

In [None]:
# Dataset First Look
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Get the number of rows and columns
rows, columns = df.shape

print(f"Number of Rows: {rows}")
print(f"Number of Columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
#Count duplicate rows
duplicate_count = df.duplicated().sum()

print(f"Number of Duplicate Rows: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()

print("Missing Values Count per Column:")
print(missing_values)

In [None]:
issing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

print("Columns with Missing Values:")
print(missing_values)


In [None]:
null_value = df.isnull().sum().sort_values(ascending = False)
null_value

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset provides valuable insights into booking patterns, customer behavior, and operational factors. It includes information such as booking dates, length of stay, customer demographics, and whether bookings were canceled. Through initial exploration, we identified that certain features, such as lead time and market segment, significantly influence cancellations. The dataset has missing values in certain columns, which need to be addressed, and some duplicate rows that might require removal. Descriptive statistics reveal key trends like peak booking periods and differences between city and resort hotel performances. Correlation analysis helps in understanding relationships between variables, while visualizations uncover patterns in customer preferences and booking behavior. Overall, this exploration lays the foundation for deeper analysis, predictive modeling, and actionable business strategies.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_columns = df.columns
df_columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1.**hotel** = Name of the hotel (Resort or hotel)

2.**is_canceled** = if the booking was canceled (0 or 1 )

3.**lead_time** = Number of days before the actual arrival of guests

4.**arrival_date_year** = year of arrival date

5.**arrival_date_month** = Month of arrival date

6.**arrival_date_day_of_month** = day of arrival date

7.**arrival_date_week_number** = Week number of year for arrival date

8.**stays_in_weekend_nights** = Number of weekend nights (Saturday or Sunday) spent at the hotel by the guest

9.**stays_in_week_nights** = Number of weeknights(Monday to Friday) spent on the hotel by the guest

10.**adults** = Number of adults among guest

11.**children** = Number of children among guest

12.**babies** = Number of babies among guest

13.**meal** = Type of meal these guest booked

14.**country** = The country where the guest belong to

15.**market_segment** =Designation of market segment

16.**distribution_channel** = Nam of booking distribution channel

17.**is_repeated_guest** = If the booking was from a repeted guest (1) or (0)

18.**previous_cancellations**  = Number of previous booking that were cancelled by the customer prior of the current booking

19.**previous_bookings_not_canceled** = Number of privious booking not cancelled by the customer prior or to the current booking

20.**reserved_room_type** = Code of room type reserved

21.**assigned_room_type** = Code of room type assigned

22.**booking_changes** = Number of changes made to the booking

23.**deposit_type** = Type of deposit made by the guest

24.**agent** = ID of travel agent who made the booking

25.**company** = ID of the company that made the booking

26.**days_in_waiting_list** = Number of the days the booking was on waiting list

27.**customer_type** = Type of customer assuming one or four categories

28.**adr** = Average Daily Rate , as defined by driving the sum of lodging transaction by the total number of staying nights  

29.**required_car_parking_spaces** = Number of car parking spaces required by the customer

30.**total_of_special_requests** = Number of special requests made by the customer

31.**reservation_status** = Reservation status (Cancelled ,Check-Out  or No-Show)

32.**reservation_status_date** = Date at which the last reservation status was uploaded




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique() # this will show the number of uniqe values
print(df.apply(lambda col : col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
print("Initial Dataset Information:")
df.info()

In [None]:
#Display the number of missing values per column
print("\nMissing Values Count per Column:")
print(df.isnull().sum())

In [None]:
# Drop duplicate rows if any
duplicates_count = df.duplicated().sum()
if duplicates_count > 0:
    df = df.drop_duplicates()
    print(f"\nDropped {duplicates_count} duplicate rows.")
else:
    print("\nNo duplicate rows found.")

In [None]:
# Fill missing numerical values with the median of the column
numerical_cols = df.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
    df[col].fillna(df[col].median(), inplace=True)

In [None]:
# Fill missing categorical values with the mode of the column
categorical_cols = df.select_dtypes(include=[object]).columns
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

In [None]:
# Encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

In [None]:
# Display final dataset information
print("\nFinal Dataset Information:")
df_encoded.info()

print("\nSample Data After Preparation:")
print(df_encoded.head())

### What all manipulations have you done and insights you found?

**Manipulations Performed:**
**Loaded the Dataset:**

The dataset was read from a CSV file into a pandas DataFrame for analysis.

**Displayed Initial Information:**
Provided an overview of the dataset's structure, including the number of rows and columns, and basic data types for each column.

**Checked for Missing Values:**
Identified columns with missing values and displayed their counts.

**Removed Duplicates:**
Dropped any duplicate rows to ensure each record in the dataset is unique.

**Handled Missing Values:**
For numerical columns, missing values were filled with the median value of each column.
For categorical columns, missing values were filled with the mode (most frequent value) of each column.

**Encoded Categorical Variables:**
Applied one-hot encoding to convert categorical variables into a format suitable for analysis and modeling. This involved creating binary columns for each category and dropping the first category to avoid multicollinearity.

**Displayed Final Information:**
Provided an updated overview of the dataset, showing the structure after the manipulations and a sample of the prepared data.

**Insights Found:**
**Missing Values:**

Identified the presence of missing values in various columns, with strategies applied to address these issues. Numerical and categorical columns were handled differently to preserve data integrity.

**Duplicates:**
Determined whether any duplicate rows existed and removed them if necessary to ensure a clean dataset.

**Data Quality:**
The handling of missing values and duplicates improved the dataset’s quality, making it more reliable for analysis and modeling.

**Categorical Encoding:**
Categorical variables were converted into numerical format, enabling their inclusion in statistical analyses and machine learning models. One-hot encoding ensures that categorical data is appropriately represented.

**Prepared Data:**
The final dataset is now ready for deeper analysis, such as statistical exploration, correlation analysis, and predictive modeling. The preparation steps ensure that the data is clean, complete, and in a suitable format for further exploration.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
def get_count_from_column_bar(df,column_label):
  df_grpd = df[column_label].value_counts()
  df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
  return df_grpd

In [None]:
def plot_bar_chart_from_column(df,column_label,t1):
  df_grpd = get_count_from_column_bar(df,column_label)
  fig,ax = plt.subplots(figsize = (10,6))
  c = ['green','blue','red','yellow','brown']
  ax.bar(df_grpd['index'],df_grpd['count'],width = 0.4,align = 'edge',edgecolor = 'black',linewidth = 4 ,color = c,linestyle = ':',alpha = 0.5)
  plt.title(t1,bbox= {'facecolor':'0.8','pad':3})
  plt.legend()
  # plt.xlabel(column_label)
  plt.ylabel('count')
  plt.xticks(rotation = 15)
  plt.show()


In [None]:
# chart - 1 visulization code
def get_count_from_column(df,column_label):
  df_grpd = df[column_label].value_counts()
  df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
  return df_grpd

In [None]:
#plot the pie chart from grouped data
def plot_pie_chart_from_column(df,column_label,t1,exp):
  df_grpd = get_count_from_column(df,column_label)
  fig,ax = plt.subplots(figsize = (14,6))
  ax.pie(df_grpd.loc[:,'count'],labels = df_grpd.loc[:,'index'],autopct = '%1.2f%%',startangle = 90 ,labeldistance = 1.2,explode = exp)
  plt.title(t1,bbox={'facecolor':'0.8','pad':5})
  ax.axis('equal')
  plt.legend()
  plt.show()

In [None]:
exp1 = [0.05,0.05]
plot_pie_chart_from_column(df,'hotel','Booking Percentage of Hotel By Name',exp1)

##### 1. Why did you pick the specific chart?

I picked the pie chart because it’s particularly effective for showing the proportions of categorical data relative to a whole. Here’s why the pie chart is suitable for the given scenarios:

1. **Visualizing Proportions:**
Purpose: Pie charts excel at displaying how different parts contribute to the whole. This is useful for understanding the relative sizes of categories within a dataset.
**Scenario Example:** When examining booking cancellations by hotel type or the distribution of bookings by market segment, a pie chart clearly illustrates how each category (hotel type or market segment) compares to the total, making it easier to see which categories are most significant or have the largest share.

##### 2. What is/are the insight(s) found from the chart?

**Immediate Insights:**
**Purpose:** Pie charts offer a quick visual reference for understanding the distribution of data at a glance. This is particularly useful for presentations or executive summaries where clear and immediate insights are valuable.
**Scenario Example:** In a report or presentation, showing the proportion of booking cancellations by hotel type can help stakeholders quickly understand which hotel types are more prone to cancellations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the pie chart and other visualizations can significantly impact business strategy and operations. For instance, understanding the proportion of booking cancellations by hotel type can help tailor marketing and operational strategies to reduce cancellations, ultimately improving revenue. If a particular hotel type has a higher cancellation rate, targeted interventions such as better booking policies or customer retention strategies can be implemented. Conversely, if the analysis reveals a disproportionate reliance on a market segment with low conversion rates, it might indicate a need for diversification or a reevaluation of marketing tactics. However, if the insights reveal significant issues, such as high cancellation rates or low booking proportions in key segments, these could negatively impact growth by highlighting areas that require urgent attention or indicate underlying problems in the business model. Addressing these issues proactively can help mitigate potential negative effects and foster positive growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
exp2 = [0,0.1]
plot_pie_chart_from_column(df,'is_canceled','Cancellation Volume of Hotel',exp2)

##### 1. Why did you pick the specific chart?

This is the best chart where we can use to show the cancellation rate of the hotels booking

##### 2. What is/are the insight(s) found from the chart?

Here ,I found that overall more then 25% of booking got cancelled

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Here i can say that more then 27% of booking getting cancelled .

Solution = We can check the reason of cancellation of booking & and need to get this sort on business level

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plot_bar_chart_from_column(df,'distribution_channel','Distribution Channel Volume')

##### 1. Why did you pick the specific chart?

I think this is chart is best for showing that maximum volume done throught by which which channel to represent the number in descending order .

##### 2. What is/are the insight(s) found from the chart?

As we can see the TA/TO(Tour of Agent & Tour of operator) is highest ,reommending the to countinue booking throught TA/TO

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes this shows positive business impact .

Higher the number of TA/TO will help to increase the revenue generation of hotel

#### Chart - 4

In [None]:
# Chart - 4 visualization code
exp3 = [0.2,0,0,0,0,0,0,0,0,0,0,0]
plot_pie_chart_from_column(df,'arrival_date_month','Month-wise booking',exp3)

##### 1. Why did you pick the specific chart?

This will usefull and easy to understand .Here we are showing the percentage of booking in each month ,on oerall data

##### 2. What is/are the insight(s) found from the chart?

This shows that the lowest booking we get in january but the higest peak of booking is in may,july and august there can be for holidays reason .I suggest to do more of advertise at this time to get such a profitable output .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , with increased volume of customer will help the hotels to manage the revenue at them down time ,this will also help to customer setisfaction .

#### Chart - 5

In [None]:
# Chart - 5 visualization code
exp4 = [0,0.2]
plot_pie_chart_from_column(df,'is_repeated','Guest Reapeating Status ',exp4)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plot_bar_chart_from_column(df,'assigned_room_type','Assigned Room Type')

##### 1. Why did you pick the specific chart?

To show distribution by volume ,which room is alotted to the customers maximum time

##### 2. What is/are the insight(s) found from the chart?

By this chart we can see the room 'A' is much preferd by the customers .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, becouse hotel can take this in a business profitable way. They can provide same services and mosture to others room to get a profitable revenue .

#### Chart - 7

In [None]:
# Chart - 7 visualization code
market_segment_df = pd.DataFrame(df['market_segment'])
market_segment_data = market_segment_df.groupby('market_segment')['market_segment'].count()
market_segment_data.sort_values(ascending = False,inplace = True)
plt.figure(figsize = (14,6))
y = np.array([4,5,6])
market_segment_data.plot(kind = 'bar',color = ['r','g','y','b','pink','black','brown'],fontsize = 20, legend = 'True')

##### 1. Why did you pick the specific chart?

In this chart ,we can see that market segment by which hotel has booked .

##### 2. What is/are the insight(s) found from the chart?

Online TA has been the most frequently to book hotel by the customers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , This generate a positive business impact ,that the owners can see the online platform is the best way the the customers are using , they can increase there online approach also customer can see the online website is trutable and they can easly get any information or can book the room by this .

#### Chart - 8

In [None]:
# Chart - 8 visualization code
guest_country_wise = pd.DataFrame(df[['country','total_guest']])
guest_country_wise_data = guest_country_wise.groupby('country')['total_guest'].sum()
guest_country_wise_data.sort_values(ascending = False,inplace = True)
top_10_country_guest = guest_country_wise_data.head(10)
top_10_country_guest

In [None]:
plt.figure(figsize = (12,6))
sns.barplot(x = top_10_country_guest.index,y = top_10_country_guest).set(title = "Top 10 countries by guest ")
print("Top 10 countries by guest")
print("PRT = Portugal")
print("GBR = Great Britain & Northern Ireland")
print("FRA = France")
print("ESP = Spain")
print("DEU = Germany")
print("ITA = Italy")
print("IRL = Ireland")
print("BRA = Brazil")
print("BEL = Belgium")
print("NOR = Norway")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
average_adr = df.groupby('hotel')['adr'].mean()
average_adr

In [None]:
plt.subplots(figsize = (8,5))
average_adr.plot(kind = 'bar',color = ['r','g'],fontsize = 10)
plt.xlabel('Average ADR')
plt.ylabel("Hotel Name")
plt.title("Average ADR of Hotel")

##### 1. Why did you pick the specific chart?

To show the bothe the hotels (hotel and resorts ) average adr .

##### 2. What is/are the insight(s) found from the chart?

here we can see the city hotels have maxixmum average of ADR compares to Resort hotel . And ADR is basically revenue .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, I think there could be a reason ,may be the customer are prefering the room in budget , or may be its based on facilities so resort hotel can change some of things to et a higher customer references .

In [None]:
df.columns

#### Chart - 10

In [None]:
if 'TotalRevenue' in df.columns:
    plt.figure(figsize=(8, 5))
    hotel_wise_revenue = df.groupby('hotel')['TotalRevenue'].sum()
    ax = hotel_wise_revenue.plot(kind='bar', color=('blue', 'green'))
    plt.xlabel("Hotel")
    plt.ylabel("Total Revenue")
    plt.title("Total Revenue of Hotels")
    plt.show()
else:
    print("The 'TotalRevenue' column is not found. Please verify the column name.")

In [None]:
# Chart - 10 visualization code
plt.figure(figsize = (8,5))
hotel_wise_revenue  =df.groupby('hotel','revenue').sum()
ax = hotel_wise_revenue.plot(kind = 'bar',color = ('blue','green'))
plt.xlabel("Hotel")
plt.ylabel("Total Revenue")
plt.title("Total Revenue of Hotels")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize = (12,5),dpi = 100 )
hotel__wise_meal = df.groupby(['hotel','meal'])['meal'].count().unstack()
hotel__wise_meal.plot(kind = 'bar',figsize = (12,5))
plt.xlabel('Meal')
plt.ylabel('Count')
plt.title('Most Prefered Meal by Customers')
plt.legend()
hotel__wise_meal

##### 1. Why did you pick the specific chart?

We use this chart to show the meal which is most prefered by cutomers in each hotel.

##### 2. What is/are the insight(s) found from the chart?

we can see the bb meal is much prefered by customer n both of hotels but when it comes to compare the hotel the city hotel have greater preferd customer insted of resort hotel .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes ,this chart can help to see the choice of preferd meal by customers , they can analyst this and make certain decision to get a profitable growth on their business .

#### Chart - 12

In [None]:
# Chart - 12 visualization code
corr_df = df[['lead_time','previous_cancellations','previous_bookings_not_canceled','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','booking_changes']].corr()
f,ax = plt.subplots(figsize = (8,10))
sns.heatmap(corr_df,annot = True ,fmt = '.2f',annot_kws = {'size': 10},vmax = 1,square = True,cmap = "YlGnBu")

##### 1. Why did you pick the specific chart?

To show the correlation between numeric variables that the data set containes .

##### 2. What is/are the insight(s) found from the chart?

In this chart the highest correlation is 0.39 % and the lowest correlation is -0.9% .

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective of optimizing hotel bookings and increasing revenue, the client should focus on several key strategies. First, analyzing booking patterns through data insights can reveal factors driving both bookings and cancellations, enabling the implementation of policies to manage or reduce cancellations effectively. Targeted marketing efforts based on customer segments can enhance conversion rates and revenue, with special attention given to segments showing higher booking rates or those needing refined approaches. Optimizing revenue management through dynamic pricing strategies, informed by factors like booking lead time and ADR relationships, can maximize revenue per booking. Improving customer experience by addressing reasons for cancellations and tailoring services to guest preferences will boost satisfaction and retention. Finally, continuous monitoring and adjustment of strategies based on performance metrics and emerging trends will ensure ongoing optimization and growth. By leveraging these insights, the client can enhance operational efficiency, improve customer satisfaction, and drive business growth.

# **Conclusion**

In conclusion, leveraging data insights from the hotel booking dataset provides a powerful foundation for optimizing operations and achieving key business objectives. By analyzing booking patterns, targeting marketing efforts, and implementing dynamic pricing strategies, the client can enhance revenue and reduce cancellations. Improving customer experience through tailored services and addressing feedback will further drive satisfaction and loyalty. Continuous monitoring and adjustment of strategies based on performance metrics will ensure adaptability and ongoing growth. Embracing these data-driven strategies will enable the client to make informed decisions, improve operational efficiency, and ultimately achieve a competitive edge in the hotel industry.

 ***Hurrah! You have successfully completed your EDA Capstone Project !!!***