# Hotel Booking Data Analysis - Part I: Data Cleaning

This dataset stores the booking information in the property management system (PMS) of **a resort hoetl** and **a city hotel**. They are located at the resort region of Algarve and the city of Lisbon in Portugal. The dataset comprehends bookings due to arrival between the **1st of July of 2015** and the **31st of August 2017**, including bookings that effectively arrived and bookings that were canceled.

## Goal
In this analysis we are going to inspect various aspects of the booking information and make data-based suggestions to the hotels. To improve readability, this notebook only includes the part of data cleaning. You can view the data anlysis part in a different Jupyter notebook (*Hotel Booking Data Analysis - Part II: Data Analysis*) and a post displaying Tableau dashbords.

Our analysis cover these topics:  
  
#### Python analysis
- How does the cancelation rate look like? Is it especially higher for any market segmentation?
- What are the common profiles shared by the repeat customers?
- Does the current room type configuration meet the demand? 
  
#### Tableau analysis
- How does the revenue vary by month?
- The contribution to revenue by market segment, customer type, the origin of customer country, and room type.

## Data Source
**The original dataset**:  

*[Antonio, Nuno, Ana de Almeida, and Luis Nunes. "Hotel booking demand datasets." Data in brief 22 (2019): 41-49.](https://www.sciencedirect.com/science/article/pii/S2352340918315191#f0010)*

**The dataset used in this analysis**:
  
*[tidytuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)*  
  
This dataset combined the original data from both hotels into one, formatted the original column names, and separated the date information. We start from here.  

## Data Description
The data from both hotels share the same structure, with **31 columns** describing **40,060 bookings of the resort hotel** and **79,330 booking of the city hotel**.

Due to the large number of columns, we do not explicitly describe each of them here separately. The explanation of a column will be given when mentioned for the first time.

## Data Cleaning
The data source paper stated that the PMS assured **no missing data exists in its database tables**. Therefore we do not need to clean the missing data in this analysis. **In this data, the *NULL* presented should not be considered a missing value, but rather as “not applicable”.** For example, if a booking **Agent** is defined as *NULL* it means that the booking did not came from a travel agent.

In [1]:
# Load necessary modules and read in the data
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("hotel_bookings.csv", encoding='latin-1')

### 1. Data Overview

In [3]:
data.columns

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

We can see that the column names are already clean (all the names are spelled in lower cases with no spaces between words). Therefore there is no need to unify the column names. Let's take a look at the how the data look like.

In [4]:
data.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


### 2. Drop the columns less relevant to our goal
After reading through the data description, we decide that the columns listed below are less informative regarding our analysis goal. Therefore we are going to drop these columns.
- **arrival_date_week_number**: Week number of year for arrival date.
- **babies**: Number of babies.
- **distribution_channel**: Booking distribution channel.
- **previous_cancellations**: Number of previous bookings that were cancelled by the customer prior to the current booking.
- **previous_bookings_not_canceled**: Number of previous bookings not cancelled by the customer prior to the current booking.
- **booking_changes**: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.
- **deposit_type**: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories.
- **agent**: ID of the travel agency that made the booking.
- **company**: indicates the ID of the company/entity that made the booking or responsible for paying the booking. This is non-relevant to our analysis, we therefore discard this column.
- **required_car_parking_spaces**: Number of car parking spaces required by the customer.
- **total_of_special_requests**: Number of special requests made by the customer (e.g. twin bed or high floor).
- **reservation_status**: Reservation last status, assuming one of three categories: 'Canceled', 'Check-Out', 'No-Show'.
- **reservation_status_date**: Date at which the last status was set.

In [5]:
cols_to_drop = ['arrival_date_week_number', 'babies', 'distribution_channel', \
                'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', \
                'deposit_type', 'agent', 'company', 'required_car_parking_spaces', \
                'total_of_special_requests', 'reservation_status', 'reservation_status_date']
data_clean = data.drop(cols_to_drop, axis=1)
del cols_to_drop

Take a look at the data information after dropping the irrelevant columns.

In [6]:
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 19 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   hotel                      119390 non-null  object 
 1   is_canceled                119390 non-null  int64  
 2   lead_time                  119390 non-null  int64  
 3   arrival_date_year          119390 non-null  int64  
 4   arrival_date_month         119390 non-null  object 
 5   arrival_date_day_of_month  119390 non-null  int64  
 6   stays_in_weekend_nights    119390 non-null  int64  
 7   stays_in_week_nights       119390 non-null  int64  
 8   adults                     119390 non-null  int64  
 9   children                   119386 non-null  float64
 10  meal                       119390 non-null  object 
 11  country                    118902 non-null  object 
 12  market_segment             119390 non-null  object 
 13  is_repeated_guest          11

In [7]:
(119390 - 118902) / 119390 * 100

0.40874445095904177

As we can see above, only the column **country** has *NULL* values. Because these *NULL* values only account for 0.41% of the total data, we discard these rows.

In [8]:
data_clean.dropna(inplace=True)
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 118898 entries, 0 to 119389
Data columns (total 19 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   hotel                      118898 non-null  object 
 1   is_canceled                118898 non-null  int64  
 2   lead_time                  118898 non-null  int64  
 3   arrival_date_year          118898 non-null  int64  
 4   arrival_date_month         118898 non-null  object 
 5   arrival_date_day_of_month  118898 non-null  int64  
 6   stays_in_weekend_nights    118898 non-null  int64  
 7   stays_in_week_nights       118898 non-null  int64  
 8   adults                     118898 non-null  int64  
 9   children                   118898 non-null  float64
 10  meal                       118898 non-null  object 
 11  country                    118898 non-null  object 
 12  market_segment             118898 non-null  object 
 13  is_repeated_guest          11

We can see that there are **118898** transactrions and **19** columns left in this dataset.

### 3. Look into the remaining data, decide further cleaning strategies

The basic data cleaning is done. Now let's take a closer look at the data column by column. The necessary within-column operations such as *value replacement* will be conducted during the process. Other row-level and column-level operations will be conducted together after this process to keep the data to stay integrated.  
  
Now I am going to list the values in each column and the corresponding counts. The first row is the column meaning, second column is the column name.
#### The type of hotel - Resort / City Hotel

In [9]:
idx_column = 0

print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  hotel 
 City Hotel      79302
Resort Hotel    39596
Name: hotel, dtype: int64


#### If the booking was canceled (1) or not (0)

In [10]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  is_canceled 
 0    74745
1    44153
Name: is_canceled, dtype: int64


**NOTE**: eventually only bookings that were not canceled will be put into analysis.

#### Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

In [11]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  lead_time 
 0      6223
1      3393
2      2033
3      1802
4      1696
       ... 
737       1
370       1
435       1
458       1
709       1
Name: lead_time, Length: 479, dtype: int64


#### Year of arrival date

In [12]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  arrival_date_year 
 2016    56435
2017    40604
2015    21859
Name: arrival_date_year, dtype: int64


#### Month of arrival date

In [13]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  arrival_date_month 
 August       13852
July         12628
May          11779
October      11095
April        11045
June         10927
September    10467
March         9739
February      8012
November      6752
December      6728
January       5874
Name: arrival_date_month, dtype: int64


#### Day of arrival date

In [14]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  arrival_date_day_of_month 
 17    4390
5     4294
15    4161
25    4144
26    4137
9     4082
12    4077
16    4058
19    4041
2     4037
20    4019
18    3986
24    3980
28    3927
8     3908
3     3833
30    3833
6     3813
14    3799
27    3787
21    3754
4     3747
13    3723
7     3657
1     3609
23    3608
11    3590
22    3584
29    3567
10    3554
31    2199
Name: arrival_date_day_of_month, dtype: int64


**NOTE**: in the next step, we will combine the year, month, date into a single date column. Also we will select the data between **1 July 2015** to **1 July 2017** for the subsequent analyses.

#### Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

In [15]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  stays_in_weekend_nights 
 0     51680
2     33249
1     30526
4      1849
3      1253
6       153
5        78
8        60
7        19
9        11
10        7
12        5
16        3
13        3
14        2
Name: stays_in_weekend_nights, dtype: int64


#### Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

In [16]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  stays_in_week_nights 
 2     33574
1     30091
3     22203
5     11051
4      9554
0      7593
6      1491
10     1030
7      1027
8       654
9       231
15       85
11       55
19       44
12       42
20       41
14       35
13       27
16       16
21       15
22        7
18        6
25        6
30        5
17        4
24        3
40        2
26        1
32        1
33        1
34        1
35        1
41        1
Name: stays_in_week_nights, dtype: int64


**NOTE**: we will add up the values in **stays_in_week_nights** and **stays_in_week_nights** as a new column representing overall stay length.

#### Number of adults

In [17]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  adults 
 2     89495
1     22735
3      6197
0       393
4        62
26        5
27        2
20        2
5         2
55        1
50        1
40        1
10        1
6         1
Name: adults, dtype: int64


#### Number of children

In [18]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  children 
 0.0     110319
1.0       4852
2.0       3650
3.0         76
10.0         1
Name: children, dtype: int64


**NOTE**: the two columns **adults** and **children** will be conbined as a new column **guest_number**.

#### Type of meal booked. 
Categories are presented in standard hospitality meal packages: 
- **Undefined**/**SC** – no meal package;
- **BB** – Bed & Breakfast; 
- **HB** – Half board (breakfast and one other meal – usually dinner); 
- **FB** – Full board (breakfast, lunch and dinner)

In [19]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  meal 
 BB           91863
HB           14434
SC           10638
Undefined     1165
FB             798
Name: meal, dtype: int64


According to the data description, the value *Undefined* and *SC* have the same meaning. We convert all *Undefiend* value to *SC* to reduce confuse.

In [20]:
data_clean['meal'] = data_clean['meal'].map({'Undefined': 'SC', 'BB': 'BB', 'HB': 'HB', 'SC': 'SC', 'FB': 'FB'});
data_clean['meal'].value_counts(ascending=False)

BB    91863
HB    14434
SC    11803
FB      798
Name: meal, dtype: int64

#### Country of origin. Categories are represented in the ISO 3155–3:2013 format

In [21]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  country 
 PRT    48586
GBR    12129
FRA    10415
ESP     8568
DEU     7287
       ...  
MRT        1
CYM        1
HND        1
FJI        1
BDI        1
Name: country, Length: 177, dtype: int64


#### Market segment designation.

In [22]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  market_segment 
 Online TA        56402
Offline TA/TO    24160
Groups           19806
Direct           12448
Corporate         5111
Complementary      734
Aviation           237
Name: market_segment, dtype: int64


#### Value indicating if the booking name was from a repeated guest (1) or not (0).

In [23]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  is_repeated_guest 
 0    115092
1      3806
Name: is_repeated_guest, dtype: int64


#### Code of room type reserved. 
Code is presented instead of designation for anonymity reasons.

In [24]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  reserved_room_type 
 A    85601
D    19173
E     6497
F     2890
G     2083
B     1114
C      931
H      601
L        6
P        2
Name: reserved_room_type, dtype: int64


#### Code for the type of room assigned to the booking. 
Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.

In [25]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  assigned_room_type 
 A    73863
D    25166
E     7738
F     3732
G     2539
C     2354
B     2159
H      708
I      357
K      279
P        2
L        1
Name: assigned_room_type, dtype: int64


#### Number of days the booking was in the waiting list before it was confirmed to the customer.

In [26]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  days_in_waiting_list 
 0      115200
39        227
58        164
44        141
31        127
        ...  
175         1
117         1
89          1
92          1
183         1
Name: days_in_waiting_list, Length: 128, dtype: int64


#### Type of booking
assuming one of four categories:
- **Contract** - when the booking has an allotment or other type of contract associated to it;
- **Group** – when the booking is associated to a group;
- **Transient** – when the booking is not part of a group or contract, and is not associated to other transient booking;
- **Transient-party** – when the booking is transient, but is associated to at least other transient booking.

In [27]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))
idx_column += 1

Column:  customer_type 
 Transient          89174
Transient-Party    25078
Contract            4076
Group                570
Name: customer_type, dtype: int64


#### Average Daily Rate (adr) as defined by dividing the sum of all lodging transactions by the total number of staying nights.

In [28]:
print('Column: ', data_clean.columns[idx_column], '\n', data_clean[data_clean.columns[idx_column]].value_counts(ascending=False))

Column:  adr 
 62.00     3753
75.00     2710
90.00     2471
65.00     2397
0.00      1938
          ... 
202.74       1
87.64        1
69.83        1
160.83       1
35.64        1
Name: adr, Length: 8870, dtype: int64


It is interesting to see the value 0 appears 1938 times, because as a rate measure **adr** is supposed to be above 0. Let's take a further look to see if there are more bizarre **adr** values.

In [29]:
data_clean[data_clean['adr']<=0]['adr'].value_counts()

 0.00    1938
-6.38       1
Name: adr, dtype: int64

In [30]:
data_clean[data_clean['adr']<=0]['is_canceled'].value_counts()

0    1736
1     203
Name: is_canceled, dtype: int64

In [31]:
1939 / data_clean.shape[0] * 100

1.6308096015071745

We can see that there are 1938 entries whose **adr** are 0, 1 entry whose **adr** is a negative value -6.38. Also those values are not the results of booking cancelation. Because the value 0 or -6.38 account for 1.63% of the total data, we drop them in the next step.

### 4. Further data cleaning, prepare the data for the subsequent analyses
#### Information integration
- Integrate **adults**, **children** into a new column **guest_number**.
- Integrate **stays_in_weekend_nights**, **stays_in_week_nights** into a new column **stay_length**.

Then delete the **adults**, **children**, **stays_in_weekend_nights** and **stays_in_week_nights**.

In [32]:
data_clean['children'].astype('int64')
data_clean['guest_number'] = data_clean['adults'] + data_clean['children']
data_clean['stay_length'] = data_clean['stays_in_weekend_nights'] + data_clean['stays_in_week_nights']
data_clean.drop(['adults', 'children', 'stays_in_weekend_nights', 'stays_in_week_nights'], axis=1, inplace=True)
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 118898 entries, 0 to 119389
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   hotel                      118898 non-null  object 
 1   is_canceled                118898 non-null  int64  
 2   lead_time                  118898 non-null  int64  
 3   arrival_date_year          118898 non-null  int64  
 4   arrival_date_month         118898 non-null  object 
 5   arrival_date_day_of_month  118898 non-null  int64  
 6   meal                       118898 non-null  object 
 7   country                    118898 non-null  object 
 8   market_segment             118898 non-null  object 
 9   is_repeated_guest          118898 non-null  int64  
 10  reserved_room_type         118898 non-null  object 
 11  assigned_room_type         118898 non-null  object 
 12  days_in_waiting_list       118898 non-null  int64  
 13  customer_type              11

Check the values in **guest_number**.

In [33]:
data_clean['guest_number'].value_counts(ascending=False)

2.0     82594
1.0     22299
3.0      9919
4.0      3797
0.0       170
5.0       104
26.0        5
27.0        2
20.0        2
55.0        1
6.0         1
12.0        1
10.0        1
50.0        1
40.0        1
Name: guest_number, dtype: int64

There are 170 entries where the **guest_number** is 0, which is abnormal. We drop these entries.

In [34]:
data_clean.drop(data_clean[data_clean['guest_number']==0].index, inplace=True)
data_clean['guest_number'].value_counts(ascending=False)

2.0     82594
1.0     22299
3.0      9919
4.0      3797
5.0       104
26.0        5
27.0        2
20.0        2
55.0        1
6.0         1
12.0        1
10.0        1
50.0        1
40.0        1
Name: guest_number, dtype: int64

Check the values in **stay_length**.

In [35]:
data_clean['stay_length'].value_counts(ascending=False)

2     27518
3     27013
1     20781
4     17353
7      8626
5      7752
6      3839
8      1151
10     1131
14      910
9       837
0       640
11      392
12      220
13      139
21       71
15       71
16       40
25       37
18       35
28       34
19       22
17       20
20       14
29       13
30       13
22       13
23        8
24        6
26        6
35        5
27        4
42        4
33        3
56        2
48        1
34        1
38        1
45        1
46        1
Name: stay_length, dtype: int64

In [36]:
data_clean[data_clean['stay_length']==0]['is_canceled'].value_counts()

0    617
1     23
Name: is_canceled, dtype: int64

There are 640 entries whose **stay_length** values are 0, and not all of them are because of booking cancellation. Therefore we drop these entries.

In [37]:
data_clean.drop(data_clean[data_clean['stay_length']==0].index, inplace=True)
data_clean['stay_length'].value_counts(ascending=False)

2     27518
3     27013
1     20781
4     17353
7      8626
5      7752
6      3839
8      1151
10     1131
14      910
9       837
11      392
12      220
13      139
21       71
15       71
16       40
25       37
18       35
28       34
19       22
17       20
20       14
22       13
29       13
30       13
23        8
24        6
26        6
35        5
27        4
42        4
33        3
56        2
48        1
34        1
38        1
45        1
46        1
Name: stay_length, dtype: int64

#### Clean the column ***adr***
From the last session we learned that there are **adr** outliers (values equal to or less than 0) and we should drop these 1939 entries.

In [38]:
data_clean.drop(data_clean[data_clean['adr']<=0].index, inplace=True)

#### Combine the date data & focus on the data between 1 July 2015 to 30 June 2017

In [39]:
# Convert the month data to numeric format.
month_map = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7, 'August': 8, \
            'September': 9, 'October': 10, 'November': 11, 'December': 12}
data_clean['arrival_date_month'] = data_clean['arrival_date_month'].map(month_map)

# Combine the columns represent the year, month and day of the month into a single column
data_clean['arrival_date_year'] = data_clean['arrival_date_year'].astype('string')
data_clean['arrival_date_month'] = data_clean['arrival_date_month'].astype('string')
data_clean['arrival_date_day_of_month'] = data_clean['arrival_date_day_of_month'].astype('string')
data_clean['date'] = data_clean['arrival_date_year'] + '/' + data_clean['arrival_date_month'] + '/' + \
                    data_clean[ 'arrival_date_day_of_month']

def convert_dt(cell):
    tmp = cell.split('/')
    tmp2 = [int(tt) for tt in tmp]
    return dt.datetime(tmp2[0], tmp2[1], tmp2[2])

data_clean['date'] = data_clean['date'].apply(convert_dt)
data_clean['date'].head()

# Limit the analysis to exactly two years: from 1 July 2015 to 1 July 2017
data_clean.drop(data_clean[data_clean['date']>dt.datetime(2017, 6, 30)].index, inplace=True)

# Now we have the data span across two whole year: 1 July 2015 to 30 June 2016, 1 July 2016 to 30 June 2017.
# To make the subsequent more clear, we assign the numerical '1' to the column 'arrival_date_year' for the first year, 
# and '2' for the secone year.
data_clean['arrival_date_year'][(data_clean['date']>=dt.datetime(2015, 7, 1)) & (data_clean['date']<=dt.datetime(2016, 6, 30))] = '1'
data_clean['arrival_date_year'][(data_clean['date']>=dt.datetime(2016, 7, 1)) & (data_clean['date']<=dt.datetime(2017, 6, 30))] = '2'
del month_map

#### Separate the data from the resort hotel and those from the city hotel
Because our dataset contains the data from two different types of hotels that has no connection, we are going to perform the analysis on each of them separately. Since now we finished the data cleaning and preparation, let's reset the index of the resulting datasets and save them for the subsequent analyses.

In [40]:
resort_h = data_clean[data_clean['hotel']=='Resort Hotel']    # The data of the resort hotel
city_h = data_clean[data_clean['hotel']=='City Hotel']        # The data of the city hotel
data_clean = data_clean.reset_index().drop('index', axis=1)
resort_h = resort_h.reset_index().drop(['index', 'hotel'], axis=1)
city_h = city_h.reset_index().drop(['index', 'hotel'], axis=1)
with pd.ExcelWriter('all_bookings_clean.xlsx') as writer:  
    data_clean.to_excel(writer, sheet_name='data')
    resort_h.to_excel(writer, sheet_name='rh')
    city_h.to_excel(writer, sheet_name='ch')

The data lost rate is shown as below.

In [41]:
data_clean['hotel'].value_counts()

City Hotel      71430
Resort Hotel    35333
Name: hotel, dtype: int64

In [42]:
data['hotel'].value_counts()

City Hotel      79330
Resort Hotel    40060
Name: hotel, dtype: int64

In [43]:
print('Total data lost: {0:.1f}%'.format((data.shape[0] - data_clean.shape[0]) / data.shape[0] * 100))
print('Resort Hotel data lost: {0:.1f}%'.format((79330-71430) / 79330 * 100))
print('Overall data lost: {0:.1f}%'.format((40060-35333) / 40060 * 100))

Total data lost: 10.6%
Resort Hotel data lost: 10.0%
Overall data lost: 11.8%
