# EDA of Hotel Booking's Demand 

This project aims to create meaningful estimators from the data set we have and to perform Exploratory Data Analysis so that if anyone who wishes to proceed with the Machine Learning Model, can do so.

This data set contains a single file that compares various booking information between two hotels: a city hotel and a resort hotel. Both hotels are located in Portugal (southern Europe) (“H1 at the resort region of Algarve and H2 at the city of Lisbon”).

This dataset can be downloaded from [HERE](https://www.sciencedirect.com/science/article/pii/S2352340918315191)

The tools for data analysis used in this project are the packages Numpy and Pandas, and to visualize and explore the data: Matplotlib and Seaborn.

We try to answer these questions:


1.  **Where do the guests come from?**
2.  **How much do guests pay for a room per night?**
3.  **How does the price per night vary over the year?**
4.  **Which are the busiest months for hotel bookings?**
5.  **How long do people typically stay at the hotels?**
6.  **What are the booking patterns by market segment?**
7.  **How many bookings were canceled, and which month had the highest number of cancellations?**
8.  **Is there a repeated guest effect on cancellations?**
9.  **What is the relationship between the number of nights spent at hotels and booking types (resort vs. city)?**
10.  **How does the deposit affect cancellations, and is there a difference by market segment?**

Let's start. 

#### Import all necessary libraries

In [1]:
import pandas as pd  # Used for data manipulation and analysis
import numpy as np  # Used for scientific computing
import matplotlib.pyplot as plt  # Used for plotting graphs
import seaborn as sns  # Used for plotting graphs


Let's import and display the data set.

In [4]:
# Read the dataset into a dataframe named data
data = pd.read_csv("../Hotels_Analysis/dataset/hotel_bookings.csv")

In [9]:
# Let's see number of rows and columns in our dataset
data.shape

(119390, 32)

So we have 119390 rows and 32 columnsm, which is a lot of data to work with.

#### Let's see first few rows of the data set.

In [3]:
# Display top 5 rows of the table.
data.head(5)


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


#### Data Pre-processing

Let's copy our dataset so that our original dataset remains unchanged. 

In [5]:
# Copy the dataset
df = data.copy()


Let's check for missing values in our dataset.

In [10]:
# Find the missing value, show the total null values for each column
# We'll sort it in descending order as we have too many columns in our dataset.
df.isnull().sum().sort_values(ascending=False)[:10]


company                   112593
agent                      16340
country                      488
children                       4
reserved_room_type             0
assigned_room_type             0
booking_changes                0
deposit_type                   0
hotel                          0
previous_cancellations         0
dtype: int64

So we do have null values in our dataset. Let's try to handle them  <br />

In the `agent` and the `company` column, we have id_number for each agent or company, so for all the missing values, we will just replace it with 0.

In [11]:
# Replace missing values in `agent` and `company` columns with 0.0`
df[['agent', 'company']] = df[['agent', 'company']].fillna(0.0)

`Children` column contains the count of children, so we will replace all the missing values with the rounded mean value.

In [12]:
# Replace the null values in Children column with the mean.
df["children"].fillna(round(data.children.mean()), inplace=True)

As `Country` column is a catagorical column, we will replace the missing values with the most frequent country, or **MODE**. Mode is the value that appears most frequently in a data set.

In [13]:
# Replace missing values in the country column with the mode.
df['country'].fillna(data.country.mode().to_string(), inplace=True)

In [22]:
# Check the data with no adults, no children and no babies
df[(df.adults+df.children+df.babies) == 0].shape

(180, 32)

So we have 180 such rows, which is a very small number compared to the total number of rows in our dataset, so we can drop these rows.

In [24]:
# Drop the data with no adults, no children and no babies
df.drop(df[(df.adults+df.children+df.babies) == 0].index, inplace=True)