## Hotel Booking Demand Analysis

Group Members: Jingwei Chen, Ju-Hsuan Hsieh, Junze Li, Yuening Zhan, Yunbei Wang

### Outlines

- 1.Data Fetching and Cleaning
- 2.Exploratory Data Analysis
- 3.Feature Engineering
- 4.Cancellation Prediction

### 1. Data Fetching and Cleaning

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
booking_data = pd.read_csv('hotel_bookings.csv')

In [3]:
booking_data.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [4]:
booking_data.dtypes

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             

In [5]:
booking_data.isnull().any()

hotel                             False
is_canceled                       False
lead_time                         False
arrival_date_year                 False
arrival_date_month                False
arrival_date_week_number          False
arrival_date_day_of_month         False
stays_in_weekend_nights           False
stays_in_week_nights              False
adults                            False
children                           True
babies                            False
meal                              False
country                            True
market_segment                    False
distribution_channel              False
is_repeated_guest                 False
previous_cancellations            False
previous_bookings_not_canceled    False
reserved_room_type                False
assigned_room_type                False
booking_changes                   False
deposit_type                      False
agent                              True
company                            True


We notice that there are some missing values in **children**, **country**, **agent** and **company**. For **children** and **country**, there are only 4 and 488 missing values respectively, which is a very small proportion. So we drop the booking records which contain missing values in **children** and **country**.

In [6]:
children_del_list = booking_data[(booking_data['children'].isnull())].index.tolist()
country_del_list = booking_data[(booking_data['country'].isnull())].index.tolist()

In [7]:
booking_data_cleaned = booking_data.drop(set(children_del_list + country_del_list))

In [8]:
print('We delete {} booking records and there are {} records remaining.'.format(len(booking_data)-len(booking_data_cleaned), len(booking_data_cleaned)))

We delete 492 booking records and there are 118898 records remaining.


For the remaining two attributes:

In [9]:
print('There are {} missing values in agent and {} in company.'.format(len(booking_data_cleaned[booking_data_cleaned['agent'].isnull() == True]), len(booking_data_cleaned[booking_data_cleaned['company'].isnull() == True])))

There are 16004 missing values in agent and 112275 in company.


These two attributes mean the agent/company IDs if the booking is made by agent/company, and the values of the remaining records are NULL. We can also notice that there are large proportion of booking records which are not made by agent/company, so we can not delete them directly. 
- For the EDA part, we can analyze the impact of agent/company. 
- For the feature engineering part, we can remove these two attributes for the further prediction based on machine learning.

### 2. Exploratory Data Analysis

### 3. Feature Engineering

### 4. Cancellation Prediction

---