<a href="https://colab.research.google.com/github/Ankit-Anand-3399/EDA-Hotel-Booking-Analysis/blob/main/Personal_Hotel_Booking_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Hotel Booking Analysis



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -** Ankit Anand
##### **Team Member 2 -** Prateek Singh
##### **Team Member 3 -** Aayush Jha
##### **Team Member 4 -** Vishal Saxena

# **Project Summary -**

Write the summary here within 500-600 words.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**
 
Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in
order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a
disproportionately high number of special requests? This hotel booking dataset can help you explore those
questions! This data set contains booking information for a city hotel and a resort hotel, and includes
information such as when the booking was made, length of stay, the number of adults, children, and/or babies,
and the number of available parking spaces, among other things. All personally identifying information has been
removed from the data. Explore and analyse the data to discover important factors that govern the bookings.

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!ls "/content/drive/My Drive"

In [None]:
df = pd.read_csv('/content/drive/My Drive/EDA datasets/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
count_row = df.shape[0]
count_column = df.shape[1]
print('Rows', count_row)
print('Columns', count_column)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_val  = df.isna().sum()
print(missing_val)

In [None]:
# Visualizing the missing values
missing_values = df.isna().sum().value_counts()
print(missing_values)

In [None]:
visual_missing = df.isna().sum().sort_values(ascending=False)
visual_missing.index[:4]
visual_missing[:4]

In [None]:
plt.style.use('fivethirtyeight')
plt.bar(visual_missing.index[:4],visual_missing[:4], width = 0.5,  color='#444444', label="Null values")
plt.legend()
plt.yscale("log")
axis_font = {'fontname':'Arial', 'size':'14'}
plt.xlabel('Columns', **axis_font)
plt.ylabel('Number of Null Values', **axis_font)
plt.title('Null values count in each Column', **axis_font)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
for col in df.columns:
  print(col)

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(df.apply(lambda col:col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

**DATA Cleaning**

Dealing with NULL values in the Data set

In [None]:
#finding null values in the dataset 
df.isna().sum().sort_values(ascending = False)[:6]

In [None]:
df = df.drop(df[df['babies']+df['adults']+df['children']==0].index)

In [None]:
#filling null values with 0
df['agent'].fillna(0,inplace=True)
df['company'].fillna(0,inplace=True)
df['country'].fillna(0,inplace=True)
df['children'].fillna(0,inplace = True)

In [None]:
#null values cleaned
df.isnull().sum()

**Dealing with Duplicate data in the dataset**

In [None]:
#finding duplicate values in the datasets 
#true means there are duplicates
df.duplicated().value_counts()

In [None]:
#dropping duplicate values
df.drop_duplicates(keep=False, inplace=True)

In [None]:
df.head()

**Checking Datatype for all columns**

In [None]:
df.info()

In [None]:
#children and agent columns have integer values but their data type mentioned is float.
for i in list(df.columns):
  if df[i].dtypes == 'float64':
    df[i] = df[i].apply(int)

df.dtypes

In [None]:
#checking children and agent data type #should be int64
df.info()

**Adding/Removing/Merging columns as per requirements**

In [None]:
#taking stay at nights week wise
df['total_night_stays'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']
df['total_night_stays']
df.head()

In [None]:
#taking total head count
df['total_heads'] = df['adults'] + df['children'] + df['babies']
df.head()

### What all manipulations have you done and insights you found?

The Manipulations done in the data set are :
1. Uniforming the null values with an integer
2. Dropping the duplicate values to avoid data redundancy
3. Merging 3 fields into one.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code    
def get_count_for_bar_columns(df, column_label):
  group_df = df[column_label].value_counts()
  group_df = pd.DataFrame({'index':group_df.index, 'count':group_df.values})
  return group_df


def plot_bar_chart_from_column(df, column_label, t1):
  df_grouped = get_count_for_bar_columns(df, column_label)
  fig, ax = plt.subplots(figsize=(14, 6))
  c= ['g','r','b','c','y']
  ax.bar(df_grouped['index'], df_grouped['count'], width = 0.4, align = 'edge', edgecolor = 'black', linewidth = 4, color = c, linestyle = ':', alpha = 0.5)
  plt.title(t1, bbox={'facecolor':'0.8', 'pad':3})
  plt.legend()
  plt.ylabel('Count')
  plt.xticks(rotation = 15) # use to format the lable of x-axis
  plt.xlabel(column_label)
  plt.show()

In [None]:
plot_bar_chart_from_column(df, 'distribution_channel', 'Distibution Channel Contribution')

##### 1. Why did you pick the specific chart?

**The following chart represent maximum volume of booking done through which channel to represnt the numbers in descending order we chose bar graph**

##### 2. What is/are the insight(s) found from the chart?

**As clearly seen TA/TO(Tour of Agent & Tour of operator) is highest, recommending to continue booking through TA/TO**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes this shows positive business impact.**

**Higher the number of TA/TO will help to increase the revenue generation of Hotel.**

#### Chart - 2

In [None]:
# Chart - 2 visualization code
guest_month_wise = pd.DataFrame(df[['arrival_date_month', 'total_heads']])
guest_month_wise_df = guest_month_wise.groupby(['arrival_date_month'])['total_heads'].sum()
guest_month_wise_df.sort_values(ascending = False, inplace = True)
     

df['total_heads']

In [None]:
market_segment_df = pd.DataFrame(df['market_segment'])
market_segment_df_data = market_segment_df.groupby('market_segment')['market_segment'].count()
market_segment_df_data.sort_values(ascending = False, inplace = True)
plt.figure(figsize=(15,6))
y = np.array([4,5,6])
market_segment_df_data.plot(kind = 'bar', color=['r', 'g', 'b', 'c', 'y', 'black', 'brown'], fontsize = 20,legend='True')

##### 1. Why did you pick the specific chart?

**In this chart, we have seen market segment by which hotel has booked**

##### 2. What is/are the insight(s) found from the chart?

**Online TA has been used most frequently to book hotel by the guest.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, it is creating positive business impact that guests are using Online TA market segment as most prefered to book hotels.**

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize = (10,10))
sns.scatterplot(y = 'total_night_stays', x = 'adr', data = df[df['adr'] < 1000])
plt.show()

##### 1. Why did you pick the specific chart?

**To show comparision & affect of total stay days vs ADR**

##### 2. What is/are the insight(s) found from the chart?

**Here, we found that if guest's stay days is getting decreased, ADR is getting high**

#### Chart - 4

In [None]:
# Chart - 4 visualization code
def get_count_from_column(df, column_label):
  df_grpd = df[column_label].value_counts()
  df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
  return df_grpd

# plot a pie chart from grouped data
def plot_pie_chart_from_column(df, column_label, t1, exp):
  df_grpd = get_count_from_column(df, column_label)
  fig, ax = plt.subplots(figsize=(14,9))
  ax.pie(df_grpd.loc[:, 'count'], labels=df_grpd.loc[:, 'index'], autopct='%1.2f%%',startangle=90,shadow=True, labeldistance = 1, explode = exp)
  plt.title(t1, bbox={'facecolor':'0.8', 'pad':3})
  ax.axis('equal')
  plt.legend()
  plt.show()  

In [None]:
exp2 = [0.2, 0,0,0,0,0,0,0,0,0,0,0.1]
plot_pie_chart_from_column(df, 'arrival_date_month', 'Month-wise booking', exp2)

##### 1. Why did you pick the specific chart?

**Pie chart is used to show the percentage share of booking in each month,on overall level.**

##### 2. What is/are the insight(s) found from the chart?

**The above percentage shows month May, July and Aug are the highest booking months due to holiday season. Recommending aggressive advertisement to lure more and more customers.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, with increased volume of visitors will help hotel to manage revenue in down time, will also help employee satisfaction and retention.**

#### Chart - 5

In [None]:
# Chart - 5 visualization code
guest_country_wise = pd.DataFrame(df[['country', 'total_heads']])
guest_country_wise_df = guest_country_wise.groupby(['country'])['total_heads'].sum()
guest_country_wise_df.sort_values(ascending = False, inplace = True)
top_10_country_by_guest = guest_country_wise_df.head(10)

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(top_10_country_by_guest.index, top_10_country_by_guest).set(title='Top 10 Countries by Guest')
print("\n\nPRT = Portugal\nGBR = Great Britain & Northern Ireland\nFRA = France\nESP = Spain\nDEU = Germany\nITA = Italy\nIRL = Ireland\nBRA = Brazil\nBEL = Belgium\nNLD = Netherland")

##### 1. Why did you pick the specific chart?

**We have seen that mostly from which country Guests is coming**

***Chart is showing for top 10 country***

##### 2. What is/are the insight(s) found from the chart?

**As we can see, that maximum guest is coming from Portugal**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**We can do more advertising & can provide attractive offers to Portugal guests to enhance the customer volume**

#### Chart - 6

In [None]:
# Chart - 6 visualization code
hotel_bookings = df['hotel'].value_counts()
print(hotel_bookings)

In [None]:
df_not_canceled = df[df['is_canceled']==0]  #hotel bookings which are not cancelled
final_booked = df_not_canceled['hotel'] #hotels booked without canceled i.e hotels which are actually booked
t_bookings = final_booked.value_counts() #total bookings in resort and city hotels
r_bookings = t_bookings[0] #total bookings in resort
c_bookings = t_bookings[1] #total bookings in city

In [None]:
plt.style.use('fivethirtyeight')
plt.bar(t_bookings.index[0],t_bookings[0],color=['blue'])
plt.bar(t_bookings.index[1],t_bookings[1],color=['red'])
plt.legend(["City bookings", "Resort bookings"])
axis_font = {'fontname':'Arial', 'size':'14'}
plt.xlabel('Hotels', **axis_font)
plt.ylabel('Number of Bookings', **axis_font)
plt.title('Total Bookings', **axis_font)

##### 1. Why did you pick the specific chart?

**A bar chart is a chart that presents categorical data, and here i want to show the data pertaining to categories of hotels**.

##### 2. What is/are the insight(s) found from the chart?

**The total bookings without cancellation in the city hotel is more than the resort hotel.**

**More than 60% of the population booked the City hotel**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes. As it is just a insight of hotel bookings and we only compared the bookings in diffirent hotel. so, there is no negative insights that lead to negative growth.**

#### Chart - 7

In [None]:
# Chart - 7 visualization code
percent_b_per_y = ((df['arrival_date_year'].value_counts())/len(df['arrival_date_year'])*100) #percentage of bookings per year
print(percent_b_per_y.sort_index())

In [None]:
plt.style.use('fivethirtyeight')
percent_b_per_y.plot(kind='pie',  autopct='%1.0f%%')

##### 1. Why did you pick the specific chart?

**A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. And here we are showing the percentage proportion of each year.**

##### 2. What is/are the insight(s) found from the chart?

**More than double bookings were made in 2016, compared to the previous year. But the bookings decreased by almost 12% the following year**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.**
**Yes, as the bookings were increased from the year 2015 to 2016, then there is rapid fall from 2016 to 2017.**

#### Chart - 8

In [None]:
# Chart - 8 visualization code
single   = df_not_canceled[(df_not_canceled.adults==1) & (df_not_canceled.children==0) & (df_not_canceled.babies==0)]
couple   = df_not_canceled[(df_not_canceled.adults==2) & (df_not_canceled.children==0) & (df_not_canceled.babies==0)]
family   = df_not_canceled[df_not_canceled.adults + df_not_canceled.children + df_not_canceled.babies > 2]
names = ['Single', 'Couple', 'Family']
count = [single.shape[0],couple.shape[0], family.shape[0]]
count_percent = [x/df_not_canceled.shape[0]*100 for x in count] #calculating percentage of bookings

In [None]:
plt.style.use('fivethirtyeight')
plt.bar(names,count_percent,color=['green', 'red', 'orange'])
axis_font = {'fontname':'Arial', 'size':'14'}
plt.xlabel('Type of booking', **axis_font)
plt.ylabel('Booking (%)', **axis_font)
plt.title('Booking Type', **axis_font)

##### 1. Why did you pick the specific chart?

**Here we are showing bookings type so they are all diffirent categories that is why we used the bar graph**

##### 2. What is/are the insight(s) found from the chart?

**The couple have the most bookings among the categories(single,couple,family).**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.**
**Yes, the couple are booking the most but the family is not booking the hotels, so might be the hotel is not able to fulfill the requirements of family.**

#### Chart - 9

In [None]:
# Chart - 9 visualization code
new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
             'October', 'November', 'December']

unique_years = df_not_canceled['arrival_date_year'].value_counts().sort_index()


sorted_months = df_not_canceled.loc[df_not_canceled['arrival_date_year']==unique_years.index[0],'arrival_date_month'].value_counts().reindex(new_order)


x1 = sorted_months.index
y1 = sorted_months/sorted_months.sum()*100 # %Bookings

sorted_months = df_not_canceled.loc[df_not_canceled['arrival_date_year']==unique_years.index[1],'arrival_date_month'].value_counts().reindex(new_order)


x1 = sorted_months.index
y2 = sorted_months/sorted_months.sum()*100 # %Bookings

sorted_months = df_not_canceled.loc[df_not_canceled['arrival_date_year']==unique_years.index[2],'arrival_date_month'].value_counts().reindex(new_order)


x1 = sorted_months.index
y3 = sorted_months/sorted_months.sum()*100 # %Bookings


fig, ax = plt.subplots(figsize=(15,6))

ax.set_xlabel('Months')
ax.set_ylabel('Booking (%)')
ax.set_title('Booking Trend (Monthly) Yearwise')


sns.lineplot(x1, y1.values, label='2015', sort=False)
sns.lineplot(x1, y2.values, label='2016', sort=False)
sns.lineplot(x1, y3.values, label='2017', sort=False)

##### 1. Why did you pick the specific chart?

**A line chart with multiple groups allows to show the evolution of several items on the same figure**

##### 2. What is/are the insight(s) found from the chart?

**Average bookings in September is maximum and in January is minimum.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.**
**Yes, as we can see the bookings from year 2016 to 2017 kept on decreasing.**

#### Chart - 10

In [None]:
# Chart - 10 visualization code
repeated_guest = df_not_canceled[df_not_canceled['is_repeated_guest']==1]
repeated_guest.reset_index(level=0, inplace=True)
repeated_guest.rename(columns={'index': 'no_of_guests'}, inplace=True)

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x = 'reserved_room_type', y = 'no_of_guests', data = repeated_guest)

##### 1. Why did you pick the specific chart?

**A boxplot summarizes the distribution of a numeric variable for one or several groups. It allows to quickly get the median, quartiles and outliers.**

##### 2. What is/are the insight(s) found from the chart?

**The room G have maximum number of repeated guests i.e 119070 and room E have the least i.e 13937.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the room which is used most of times will require more attention as compared to other.**
**Yes, if we are giving equal amount of attention to all the rooms then it would be waste in the rooms which are not used more often.**

#### Chart - 11

In [None]:
# Chart - 11 visualization code
country_meal = df_not_canceled.loc[:,['country','meal']].value_counts().sort_values(ascending=False)
country_meal = pd.DataFrame(country_meal[:20])
country_meal.reset_index(level=0, inplace=True)
country_meal.reset_index(level=0, inplace=True)
country_meal.columns = ['meal','country','values']

In [None]:
sns.set(rc = {'figure.figsize':(8,5)})
sns.scatterplot(data=country_meal, x="country", y="values", hue="meal",style = "meal",s=80)

##### 1. Why did you pick the specific chart?

**A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values of variables for a set of data.**

##### 2. What is/are the insight(s) found from the chart?

**From the above chart we know the type of meal mostly ordered by the top booking countries and we can conclude by looking at it that most countries want the BB type meal and the least ordered meal is SC type.**

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, as the ordered meal is BB for most of the countries, the hotels would be more focusing on that meal.**

**No, there are not any insights that lead to negative growth.**

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# stays_in_week_nights
# stays_in_weekend_nights
barWidth = 0.3
df_not_canceled['total_nights'] = df_not_canceled['stays_in_week_nights']+df_not_canceled['stays_in_weekend_nights']

fig, ax = plt.subplots(figsize=(12,6))
ax.set_xlabel('No of Nights')
ax.set_ylabel('No of Nights')
ax.set_title('Hotel wise night stay duration (Top 10)')
sns.countplot(x='total_nights', hue='hotel', data=df_not_canceled,
              order = df_not_canceled['total_nights'].value_counts().loc[:10].index, ax=ax);

##### 1. Why did you pick the specific chart?

**A subplot is a secondary strand of the plot that is a supporting side story for any story or for the main plot. Subplots may connect to main plots, in either time and place or thematic significance**

##### 2. What is/are the insight(s) found from the chart?

**For Resort hotel, the most popular stay duration is three, two, one, and four days respectively. For City hotel, most popular stay duration is one, two, seven(week), and three respectively.**

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.catplot(x = "is_canceled", y = "previous_cancellations", data = df, kind = "box", col = "hotel")

##### 1. Why did you pick the specific chart?

**Catplot is used to show two categories side by side, using the same time scale,plotting sales from two different products. The goal is to show the two trends side by side.**

##### 2. What is/are the insight(s) found from the chart?

**The percentage of cancelation in city hotel is more than 28% and in resort hotel is 22%.**

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

correlation_df = df[['lead_time','previous_cancellations', 'previous_bookings_not_canceled', 'total_heads',
                    'booking_changes', 'days_in_waiting_list', 'adr', 'required_car_parking_spaces', 'total_of_special_requests']].corr()
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(correlation_df,annot =True, fmt='.2f', annot_kws={'size': 10},  vmax=1, square=True, cmap="OrRd")

##### 1. Why did you pick the specific chart?

**Correlation heatmaps can be used to find potential relationships between variables and to understand the strength of these relationships. In addition, correlation plots can be used to identify outliers and to detect linear and nonlinear relationships.**

##### 2. What is/are the insight(s) found from the chart?

**Highest corelation value between axis is 39% positive & lowest corelation value between the axis is -9% negative.**

#### Chart - 15

In [None]:
# Pair Plot visualization code
df_canceled = df[df['is_canceled']!=0]
fig, plot = plt.subplots(figsize=(16,8))
plt.title("Lead Time Stats")
plt.hist([df_not_canceled.lead_time, df_canceled.lead_time], color = ["orange","skyblue"], bins= np.arange(0, 600,100));
plt.legend(["Confirmed bookings", "Canceled booking"]);

##### 1. Why did you pick the specific chart?

**A subplot is a secondary strand of the plot that is a supporting side story for any story or for the main plot. Subplots may connect to main plots, in either time and place or thematic significance.**

##### 2. What is/are the insight(s) found from the chart?


**It is clear that in the first 100 days of the lead time, we have the highest cancellation. Also the density of the cancellation decreases as the lead time increases still the cancelation rate increases as the lead time increases.**

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***