# **Project Name**    -



##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual



# **Project Summary -**

## HOTEL BOOKINGS

  In Hotels, it is important to keep track of all the customers who use their service, how their stay was, and if their stay was satisfactory.
  
  To do that, it is important to know how the customers came to know about their hotel, through which medium they booked their rooms, if they had any previous cancellations, if they had made any requests, etc, to make sure their needs are met for better customer experiance.

  In that regard, I will be collecting and analysing the data to help understand the type of customers staying at the hotel and how can we use the data to help the hotel understand how to help their customers better.

# **GitHub Link -**

[Link to Github](https://github.com/AgathianDevaraj/Capstone)

# **Problem Statement**


**The idea is to make sense of the abundant data available and help the Hotel understand what their customer base is, and what could the hotel expect based on the already existing booking data.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# The basic ones
import pandas as pd
import requests as r
import numpy as np

# The ones for visualization
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset

dataframe = pd.read_csv('https://raw.githubusercontent.com/AgathianDevaraj/Capstone/main/Hotel-Bookings.csv', index_col=0)

# You can also assign temporary alias to assign url to make it less of a cluster
# Example:

# url = 'https://www.youtube.com'
# df = pd.read_csv(url)

### Dataset First View

In [None]:
# Dataset First Look

dataframe.head( )

### Dataset Rows & Columns count

In [None]:
# Rows and columns count

dataframe.shape

### Dataset Information

In [None]:
# Dataset Info

# Learn all basic info about the table using .info()
# It shows the column list, datatypes in the columns, and the non null count of the content as well

dataframe.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

len(dataframe[dataframe.duplicated()])


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# To get bool results of where the null values are, use .isnull()
# As we need count as well, we sum up all the true results for every column and return it.

dataframe.isnull().sum()

In [None]:
# Visualizing the missing values

plt.barh(width = dataframe.isnull().sum().drop_duplicates(),
         y = dataframe.isnull().sum().drop_duplicates().index)

plt.xlabel('Number of Null values')
plt.ylabel('Columns with Null values')

plt.title('Null Values in data')

plt.show()

In [None]:

# Replace the null values with the number 0

dataframe['company'].fillna(0,inplace = True)
dataframe['agent'].fillna(0,inplace = True)
dataframe['children'].fillna(0,inplace = True)

### What did you know about your dataset?

From the bookings dataset, we have learnt about the date available about various factors from the customers' reservation details to cancellation details and everything inbetween. We also know from the missing data in the dataset, that some data required are optional.

From the missing info we learned that the customers do not specify if they're at the hotel through their company, so we can assume that these are not business stays. Some data have missing agent info, so we can assume they didn't have agents book the hotel for them, etc.

We'll for sure learn more as the data analysis progresses further.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Get the names of the columns using .columns

dataframe.reset_index(inplace = True)
dataframe.columns

In [None]:
# Dataset Describe

# Learn what type of data and the distribution of the same using .describe()

dataframe.describe()

### Variables Description

* **is_canceled**                -   Whether the booking was canceled or not
* **lead_time**                  -   days leading up to the stay
* **arrival_date_year**  - Year of arrival
* **arrival_date_month** - Month of arrival
* **arrival_date_week_number** - Week of the year of arrival
* **arrival_date_day_of_month** - Day of arrival
* **stays_in_weekend_nights** - Whether staying in weekend nights or not
* **stays_in_week_nights** - Whether staying in week nights or not
* **adults** - Number of adults
* **children** - Number of children
* **babies** - Number of babies
* **meal** - Meals requested or not
* **country** - Country of origin
* **market_segment** - Market segment
* **distribution_channel** - Distribution channel
* **is_repeated_guest** - Repeated guest or not
* **previous_cancellations** - Whether they had previous cancellation or not
* **previous_bookings_not_canceled** - Number of previous bookings that were not cancelled
* **reserved_room_type** - Type of room reserved
* **assigned_room_type** - Type of room received
* **booking_changes** - Whether the booking was changed
* **deposit_type** - Type of deposit made
* **agent** - Whether agent was used or not
* **company** - Reserved through company or not
* **days_in_waiting_list** - Days customer spent on the waiting list
* **customer_type** - Customer type
* **adr** - Average Daily Rate
* **required_car_parking_spaces** - Number of car parking spaces required
* **total_of_special_requests** - Number of speial requests placed
* **reservation_status** - Status of reservation
* **reservation_status_date** - Date the status of reservation was updated

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for data in dataframe.columns:
  print("Number of unique values in ",data," is ",dataframe[data].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Makign sense of at what time of the year the booking is good

df_month = dataframe.groupby('arrival_date_month')['arrival_date_year'].value_counts()
print(df_month)

In [None]:
# Doing the same for the corporate bookings

df_corporate = dataframe[dataframe['distribution_channel']=='Corporate']
df_corporate.groupby('arrival_date_month')['arrival_date_year'].value_counts()

In [None]:
# Checking to see if there is any repeated guest in customer type

df_corporate.groupby('customer_type')['is_repeated_guest'].value_counts()

In [None]:
# Checking to see if the customers travelling with their kids requesting for parking spaces

dataframe.groupby('children')['required_car_parking_spaces'].value_counts()

In [None]:
# Checking to see if cancellation has a pattern based on deposit type selected

dataframe.groupby('deposit_type')['is_canceled'].value_counts()

In [None]:
# Getting numbers for the canceled bookings

canceled = dataframe[dataframe['is_canceled']==1]
can = canceled.groupby('deposit_type')['deposit_type'].value_counts()

for i in can.index:
  print('The number of cancellations if the deposit type is "',i,'" is ',can[i])

In [None]:
# Checking to see if the corporate guests were repeat guests and also their reservation status

df_corporate.groupby('is_repeated_guest')['reservation_status'].value_counts()

In [None]:
# Checking to see if bringing children makes the customers request anything special

spl = dataframe[(dataframe['children'] > 0)].groupby('total_of_special_requests')['total_of_special_requests'].value_counts().head(3)

for i in range(3):
  print(' Number of people made ',i,' special requests when children are involved is ',spl[i])


In [None]:
dataframe['days_in_waiting_list'].max()

In [None]:
dataframe['days_in_waiting_list'].mean()

In [None]:
# Checking the waiting list details to see if that has anything to do with cancellations

dataframe[dataframe['customer_type']=='Group'].groupby('days_in_waiting_list')['days_in_waiting_list'].value_counts()

### What all manipulations have you done and insights you found?

From the data manipulation, we can see that:

* The bookings are comparitively lower in January and February months but significantly higher in May and June.
* Bookings for corporate reservations were done the most in October and November
* Based on the data, it seems, no matter what the customer type is, the repeated bookings are very low.
* The weird part about cancellation is that when there was no deposit made for reservation the cancellation was done for about 25% of the time, but when there was a non-refundable reservation, the cancellation was at 99%
* Repeated guests have higher cancellation trend than the regular guests
* While there are a couple special requests made by guests when children are involved, many people did not request for specials.
* The wait time for reservation can go up to 390 days, but it averages to only 2 days. When booked as a group, there is almost no waiting time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

customer = dataframe.groupby('customer_type')['customer_type'].value_counts()

# Charting a pie chart

plt.pie(
    x = customer ,
    labels = customer.index,
    autopct="%1.1f%%",
    shadow = True
)

plt.title('Types of customers')
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart is the most easily understandable charts when trying to visualise data in percentages. It shows how the percentage of the types of customers who booked the hotels in a way where we can easily compare the different sub sets of data.

##### 2. What is/are the insight(s) found from the chart?

From the chart we can see that Transient customers take about 75% of bookings while Transient-Party takes another 21%. It shows that most of the bookings are temporary and might not be repeat customers. And it shows that the hotels might also be located in or near tourist destinations or along the highway where repeat customers are not that common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While the majority of the bookings were transient customers, it does not have to be a bad thing. The hotel can use the data and feedback to provide better care for the customers so that the reviews left by them can attract many customers, even if they're transient ones.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Getting the dataframe ready for the next chart
canceled = dataframe[dataframe['is_canceled'] == 1].groupby('deposit_type')['deposit_type'].value_counts()
not_canceled = dataframe[dataframe['is_canceled'] == 0].groupby('deposit_type')['deposit_type'].value_counts()

# Creating subplots for better visualisation
fig, ( [sub1,sub2] )  =  plt.subplots(1,2,sharex = True)

sub1.barh(y = canceled.index, width = canceled)
sub2.barh(y = not_canceled.index, width = not_canceled)
sub2.set(yticklabels = [])

sub1.set_xlabel('Count')
sub2.set_xlabel('Count')

# Setting mini titles and major one.
sub1.set_title('Canceled Bookings')
sub2.set_title('Not so canceled bookings')

fig.suptitle('Cancellation data based on deposit',fontsize = 16)

fig.figsize = [10,6]

##### 1. Why did you pick the specific chart?

Bar chart are handy while comparing 2 sets of data which are not complicated. Here we have shown the cancellation data, whether the booking was canceled or not, and which deposit types had more cancellations.

##### 2. What is/are the insight(s) found from the chart?

Thanks to the sheer volume of data available for booking, the cancellation clearly shows that bookings made with no deposit were being canceled about 25% of the time, and for some reason the non-refundable bookings were canceled a large amount as well.

##### 3. Will the gained insights help creating a positive business impact?

Since the amount of cancellations the hotel is receiving for no-deposit type of bookings is nearly a quarter of the total bookings, the hotel can use the info to try to make some plans so that many of these cancellations are avoided.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
          'October', 'November', 'December']

# Readying the dataset
canceled = dataframe[dataframe['is_canceled'] == 1]['arrival_date_month'].value_counts().reindex(months)
not_canceled = dataframe[dataframe['is_canceled'] == 0]['arrival_date_month'].value_counts().reindex(months)

# Preparing the chart
plt.figure(figsize = (15,6))
sns.lineplot(canceled, label = 'Canceled')
sns.lineplot(not_canceled, label = 'Not canceled')

# Yay, labels
plt.xlabel('Months of the Year')
plt.ylabel('Number of occurances')
plt.title('Cancellation status over the months')
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are great for showing data over time. So to see how the cancellation and lack thereof were in the data, we can plot the data in line chart.

##### 2. What is/are the insight(s) found from the chart?

From the chart you can see that the bookings are low in the months of January, November, and December so you can see that the cancellation statuses are also showing a dip in the data.

Also, You can see the rise in bookings in months of July and August.Thanks to the higher bookings, these months also have higher cancellation rate than the others as well.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on these data, the hotels can work out a plan to get more bookings in their downtime or give vacation time to their staff or any other productive means to improve their clientele.In other hand, they can hire more temporary workers on the busy months to help them handle the high flow of numbers.

This will help the organisation directly and indirectly improve the quality of services offered.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

Country = dataframe[dataframe['is_canceled'] == 0]['country'].value_counts().head()

# Calculating what the percentage the rest of the customers make up
Country.loc['Others'] = dataframe[dataframe['is_canceled'] == 0]['country'].value_counts().sum() - Country.sum()

plt.pie(Country, autopct="%1.1f%%", labels = Country.index)
plt.title('Top 5 Countries to book the hotels')
plt.show()

##### 1. Why did you pick the specific chart?

As discussed before, pie chart works better to show the percentage of the data involved. So, to show the percentage of the Countries the people are from, pie chart is the best one to do it.

##### 2. What is/are the insight(s) found from the chart?

With PRT - Portugal, GRB - Great Britian, FRA - France, ESP - Spain, DEU - Germany, these 5 countries seems to be the top 5, where the people are from and book the hotels. While every countries other than the top 5 has a total of 30% bookings, Portugal itself has about 28% booking total. This is for the non-canceled bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Thanks to the graph here, the hotels can try to include things like traditional foods of these countries, adding features people from these countries would like to enjoy, etc, to make sure that the stay of their customers are enjoyed to ensure good reviews and possibly repeated visits.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Bar chart to see how many bookings were canceled and how many were not
sns.countplot(data = dataframe[dataframe['booking_changes'] < 5],
              x = 'booking_changes',
              hue='is_canceled' )

plt.xlabel('Changes requested before canceled')
plt.title('Cancellation and Booking changes')
plt.show()

##### 1. Why did you pick the specific chart?

In these bar charts it is easier to compare 2 values under the same category. You can see both the canceled and not_canceled numbers being shown side by  side, which can help us come to a better understanding of the relations.

##### 2. What is/are the insight(s) found from the chart?

Even with no changes were requested, there are still many cancellations recorded. And with the ones with changes requsted, there is a considerably lesser percentage of cancellation. We can assume from this that if we provide the option of special requests or changes that could be arranged for customer's stay, maybe the cancellation requests may stop.

##### 3. Will the gained insights help creating a positive business impact?


With the insight learned in this chart, we can offer customers the option to enter the things they might want to find in their room for their stay, and if they're in the reasonable means, hotels can help customers get better fun staying with them. Also, for customers to learn that their request was atleast considered makes them feel valuable and might reconsider cancelling the booking, and might be willing to talk a compromise where both the customer and the hotel wins.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

sns.countplot(x = 'adults', hue = 'hotel', data = dataframe[dataframe['adults'] < 5])
plt.xlabel('Number of Adults')
plt.ylabel('Count')
plt.title('Number of adults booked the rooms')

##### 1. Why did you pick the specific chart?

I had to compare the results from 2 similar data, and as discussed before bar chart is more useful that way. Used seaborn to actually label the chart without having to type in additional data.

##### 2. What is/are the insight(s) found from the chart?

Adults traveling as couples booked the most, and the city hotel has always seen more booking than that of the resort hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the observation, we can conclude or at least assume that the hotels are booked by couples the most, and common sense would suggest that they would be on vacation.

By using those assumptions, if the hotels try to create some events that most couples might enjoy, then the hotel might get some repeated customers, or at least will get some positive feedback.

#### Chart - 7

In [None]:
# Chart - 7 visualization code


plt.figure(figsize=(20,8))
plt.xticks(rotation = 90)

# Getting the data and plotting them based on market segments
sns.countplot(x = 'market_segment', hue = 'arrival_date_year', data = dataframe)

plt.xlabel('Market Segments')
plt.ylabel('Count')
plt.title('Market segments over the years')



##### 1. Why did you pick the specific chart?

With market segments being compared over the same booking counts, it is recommended that we use bar gragh. We are using Seaborn here to better piece the chart together so it can be read easily.

##### 2. What is/are the insight(s) found from the chart?

Over the years, it looks like the online travel agents were the ones contributing the most towards booking the hotels. Also, we learn that the year 2016 has the most bookings of all 3 years, but maybe it might be because not all months of 2015 and 2017 were added.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We learn that online travel agency helps the most with the bookings. So, hotels could use more similar online agencies or create websites to handle their own bookings or any other form of advertisements they can do to attract customers that way. It would be helpful for them to pursue other means as well, and try to get those numbers up as well.

# **Conclusion**

In conclusion, we learn that the hotel bookings, largely done in the months of July, August. And if we assume these to be the summer vacation holiday bookings that couples book to travel with their children, then the hotels can use the information to better prepare for the services they can provide.

Also armed with the knowledge about where the most of the customers are coming from, whether they're staying temporarily or not (we learned that most were transient customers), whether they're couple or single, etc,. the hotel can hire new help to better manage the higher volumes whenever needed, or give their staff some vacation time on downtimes so that they can cool off and come back to work fully charged.

Some observations made, to wrap it up are:



1.  Most customers booking the hotels are transient customers.
2.  July and August shows most bookings.
3.  November to January shows the least.
4.  In corporate bookings, the cancellation percentage is comparatively lower
5.  Repeated transient customer numbers are at about 40%
6.  There alarming number of cancellation when there was non-refundable deposit is involved.
7.  Highest number of people are from Portugal at a whopping 28%.
8.  When zero changes were made, there is high possibility of it getting canceled.
9.  Couples are the most people booking these hotels
10. People prefer the City Hotel more that the resort one.
11. Online TA contributes more towards bookings for the hotels.




### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***