### Hotel Booking Analysis 

About dataset: https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5

Key business questions answered: 
    1. Where do guests come from?
    2. Resort Hotel vs City Hotel Traffic?
    3. How much, on average, are guests paying for a room per night?
    4. How do the daily prices vary over the year?
    5. What is the monthly traffic - Busiest Month?
    6. How long do guests stay in the hotels?
    7. Bookings per market segment?
    
Tableau dashboard: https://public.tableau.com/app/profile/ikenna4609/viz/HotelBooking_16377872125360/Dashboard1?publish=yes

In [36]:
#read dataset 
import pandas as pd
import os

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


input_path = 'C:/Users/ikennan/Downloads/Datasets/'
output_path = 'C:/Users/ikennan/Downloads/Datasets/output/hotel bookings'

bookings = pd.read_csv(os.path.join(input_path, 'hotel_bookings.csv'))
bookings.head(10)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,0,No Deposit,,,0,Transient,107.0,0,0,Check-Out,2015-07-03
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,0.0,0,FB,PRT,Direct,Direct,0,0,0,C,C,0,No Deposit,303.0,,0,Transient,103.0,0,1,Check-Out,2015-07-03
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,82.0,0,1,Canceled,2015-05-06
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,D,D,0,No Deposit,15.0,,0,Transient,105.5,0,0,Canceled,2015-04-22


In [37]:
#check for misssing values 
bookings.shape

bookings.isna().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         


#### Replace missing values 
Missing values identified in the following fields:
1. country 2. agent 3. company 4. children

The PMS assured no missing data exists in its database tables. 
However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. 
This should not be considered a missing value, but rather as “not applicable”. 
For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.


In [38]:
#fill na 

bookings.country.fillna('UNKNOWN', inplace=True)
bookings.children.fillna(0, inplace=True)
bookings.agent.fillna('N/A', inplace=True)
bookings.company.fillna('N/A', inplace=True)

bookings.isna().sum()

hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_month                0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
agent                             0
company                           0
days_in_waiting_list              0
customer_type                     0
adr                         

In [39]:
bookings.head(10)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,0,No Deposit,,,0,Transient,107.0,0,0,Check-Out,2015-07-03
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,0.0,0,FB,PRT,Direct,Direct,0,0,0,C,C,0,No Deposit,303.0,,0,Transient,103.0,0,1,Check-Out,2015-07-03
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,0.0,0,BB,PRT,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,82.0,0,1,Canceled,2015-05-06
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,0,D,D,0,No Deposit,15.0,,0,Transient,105.5,0,0,Canceled,2015-04-22


In [40]:
bookings.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.103886,0.007949,0.031912,0.087118,0.137097,0.221124,2.321149,101.831122,0.062518,0.571363
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398555,0.097436,0.175767,0.844336,1.497437,0.652306,17.594721,50.53579,0.245291,0.792798
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69.29,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.575,0.0,0.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,391.0,5400.0,8.0,5.0


In [41]:
# No adults, children, babies = No guests
bookings = bookings.loc[(~(bookings.adults == 0) | ~(bookings.children == 0) | ~(bookings.babies == 0))]

In [42]:
bookings.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [44]:
bookings.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0,119210.0
mean,0.370766,104.109227,2016.156472,27.163376,15.798717,0.927053,2.499195,1.859206,0.104043,0.007961,0.031499,0.087191,0.137094,0.218799,2.321215,101.969092,0.062553,0.571504
std,0.483012,106.87545,0.707485,13.601107,8.78107,0.995117,1.897106,0.575186,0.398836,0.097509,0.174663,0.844918,1.498137,0.638504,17.598002,50.434007,0.24536,0.792876
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,69.5,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.95,0.0,0.0
75%,1.0,161.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,18.0,391.0,5400.0,8.0,5.0


This data article describes two datasets with hotel demand data. 
One of the hotels (H1) is a resort hotel and the other is a city hotel (H2)

In [9]:
#separate resort hotel from city hotel 

bookings.hotel.value_counts()

City Hotel      79163
Resort Hotel    40047
Name: hotel, dtype: int64

In [10]:
rh = bookings.loc[bookings.hotel == 'Resort Hotel'] 
ch = bookings.loc[bookings.hotel == 'City Hotel'] 

In [11]:
rh.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0,40047.0
mean,0.277674,92.69381,2016.121482,27.139636,15.819437,1.189827,3.128549,1.867755,0.128724,0.013909,0.044398,0.101755,0.146503,0.287562,0.521837,94.983054,0.138088,0.619972
std,0.447857,97.290559,0.722288,14.00309,8.883495,1.147849,2.461146,0.696587,0.445261,0.119017,0.20598,1.335331,1.002114,0.724918,7.380019,61.429486,0.351024,0.813985
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.38,0.0,0.0
25%,0.0,10.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0,0.0
50%,0.0,57.0,2016.0,28.0,16.0,1.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,0.0,0.0
75%,1.0,155.0,2017.0,38.0,24.0,2.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,125.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,2.0,1.0,26.0,30.0,17.0,185.0,508.0,8.0,5.0


In [12]:
ch.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0,79163.0
mean,0.417859,109.884062,2016.174172,27.175385,15.788234,0.794121,2.180817,1.854882,0.091558,0.004952,0.024974,0.079823,0.132335,0.184013,3.231484,105.503191,0.024342,0.546985
std,0.49321,110.964784,0.699216,13.393231,8.728835,0.878689,1.433095,0.502676,0.372537,0.084412,0.156046,0.415744,1.694625,0.586932,20.888723,43.407605,0.154846,0.78084
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,23.0,2016.0,17.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,79.2,0.0,0.0
50%,0.0,74.0,2016.0,27.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.96,0.0,0.0
75%,1.0,164.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0,0.0,1.0
max,1.0,629.0,2017.0,53.0,31.0,14.0,34.0,4.0,3.0,10.0,1.0,21.0,72.0,18.0,391.0,5400.0,3.0,5.0


#### Q1. Where do guests come from? 

#### - summarize per country and count

In [13]:
#hotel guests' country 

rh_guest_country = rh.loc[rh.is_canceled == 0].country.value_counts().rename_axis('Country').reset_index(name='Count')
rh_guest_country.to_excel(os.path.join(output_path, 'rh_guest_country.xlsx'), index=False)

ch_guest_country = ch.loc[ch.is_canceled == 0].country.value_counts().rename_axis('Country').reset_index(name='Count')
ch_guest_country.to_excel(os.path.join(output_path, 'ch_guest_country.xlsx'), index=False)

ch_guest_country.head()

Unnamed: 0,Country,Count
0,PRT,10793
1,FRA,7069
2,DEU,5010
3,GBR,3746
4,ESP,3278


#### Q2. Resort Hotel vs City Hotel Traffic

#### - compare RH to CH guest count 
#### - identify percentage of cancellations for both hotels

In [14]:
#actual guests 
actual_guests = bookings.loc[bookings.is_canceled == 0].hotel.value_counts().rename_axis('Hotel').reset_index(name='Count')
#actual_guests.to_csv(os.path.join(output_path, 'actual_guests.csv'))

#expected guests
expected_guests = bookings.hotel.value_counts().rename_axis('Hotel').reset_index(name='Count')

#join expected and actual 
guest_count = pd.merge(expected_guests, actual_guests, on='Hotel', suffixes=('_Expected', '_Actual'))

#compute cancellation percentage
guest_count['Cancellation_Percentage'] = ((guest_count.Count_Expected - guest_count.Count_Actual)/guest_count.Count_Expected)*100

guest_count.to_excel(os.path.join(output_path, 'guest_count.xlsx'), index=False)
guest_count

Unnamed: 0,Hotel,Count_Expected,Count_Actual,Cancellation_Percentage
0,City Hotel,79163,46084,41.785935
1,Resort Hotel,40047,28927,27.767373


#### Q3. How much, on average, are guests paying for a room per night?

#### - divide average daily rate by the number of paying guests

In [15]:
#extract all non-cancelled bookings for CH
ch_non_cancelled = ch.loc[ch.is_canceled==0]

#paying guests = adults + children (CH)
ch_non_cancelled['adr_per_guest'] = (ch_non_cancelled.adr/(ch_non_cancelled.adults + ch_non_cancelled.children))
ch_avg_guest_pymt = ch_non_cancelled.adr_per_guest.mean()
ch_avg_guest_pymt


#average daily room rate
ch_avg_room_rate = ch_non_cancelled.adr.mean()
ch_avg_room_rate

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


106.03614117697545

In [16]:
#extract all non-cancelled bookings for RH
rh_non_cancelled = rh.loc[rh.is_canceled==0]

#paying guests = adults + children (RH)
rh_non_cancelled['adr_per_guest'] = (rh_non_cancelled.adr / (rh_non_cancelled.adults + rh_non_cancelled.children))
rh_avg_guest_pymt = rh_non_cancelled.adr_per_guest.mean()
rh_avg_guest_pymt

#average daily room rate
rh_avg_room_rate = rh_non_cancelled.adr.mean()
rh_avg_room_rate

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


90.82252670515652

In [17]:
avg_guest_pymts = pd.DataFrame({'Hotel':['City','Resort'], 'Avg_Guest_Pymt':[ch_avg_guest_pymt, rh_avg_guest_pymt], 'Avg_Room_Rate':[ch_avg_room_rate, rh_avg_room_rate]})
avg_guest_pymts.to_excel(os.path.join(output_path, 'avg_guest_pymt.xlsx'), index=False)
avg_guest_pymts

Unnamed: 0,Hotel,Avg_Guest_Pymt,Avg_Room_Rate
0,City,59.272988,106.036141
1,Resort,47.488866,90.822527


In [18]:
#RH rooms stats for paying guests
rh_room_stats = rh_non_cancelled.groupby('assigned_room_type').adr_per_guest.describe().rename_axis('assigned_room_type')
rh_room_stats

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
assigned_room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,10996.0,46.111934,26.518271,0.0,28.0,40.0,58.25375,225.0
B,150.0,55.923589,31.191622,0.0,35.0,51.495,69.475,150.0
C,1781.0,45.392404,25.818368,0.0,27.72,40.0,57.733333,254.0
D,8249.0,44.567274,26.045225,0.0,27.2,37.8,54.65,248.0
E,4210.0,53.031747,28.274232,0.0,34.0,45.9,66.0,254.0
F,1525.0,59.083839,31.248179,0.0,37.185,52.3,76.88,231.69
G,1201.0,54.811204,29.729641,0.0,34.0,50.0,69.283333,201.0
H,461.0,50.723876,25.490082,-3.19,32.0,47.0,67.5,137.0
I,354.0,20.387745,31.365546,0.0,0.0,0.0,35.75875,212.8


In [19]:
#CH rooms stats for paying guests
ch_room_stats = ch_non_cancelled.groupby('assigned_room_type').adr_per_guest.describe().rename_axis('assigned_room_type')
ch_room_stats

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
assigned_room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,30081.0,59.123993,26.356967,0.0,42.5,54.0,69.5,251.0
B,1493.0,54.204633,29.891499,0.0,38.5,47.85,62.0,206.0
C,143.0,59.102238,35.562489,0.0,40.2125,53.775,70.5,199.0
D,10698.0,58.983686,26.680532,0.0,42.5,54.0,69.5,228.0
E,1626.0,70.1664,36.342973,0.0,47.25,62.666667,83.5,228.53
F,1299.0,57.867544,30.449076,0.0,43.66,51.975,64.346667,270.0
G,568.0,63.261562,44.189569,0.0,45.15375,57.5,73.756667,510.0
K,176.0,42.317363,39.503602,0.0,0.0,44.56,64.375,159.0


In [20]:
#outer join both tables (rh & ch room stats) to compare rooms and average guest rate
room_stats = pd.merge(rh_room_stats, ch_room_stats, on='assigned_room_type', how='outer', suffixes=('_RH', '_CH'))
room_stats

Unnamed: 0_level_0,count_RH,mean_RH,std_RH,min_RH,25%_RH,50%_RH,75%_RH,max_RH,count_CH,mean_CH,std_CH,min_CH,25%_CH,50%_CH,75%_CH,max_CH
assigned_room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
A,10996.0,46.111934,26.518271,0.0,28.0,40.0,58.25375,225.0,30081.0,59.123993,26.356967,0.0,42.5,54.0,69.5,251.0
B,150.0,55.923589,31.191622,0.0,35.0,51.495,69.475,150.0,1493.0,54.204633,29.891499,0.0,38.5,47.85,62.0,206.0
C,1781.0,45.392404,25.818368,0.0,27.72,40.0,57.733333,254.0,143.0,59.102238,35.562489,0.0,40.2125,53.775,70.5,199.0
D,8249.0,44.567274,26.045225,0.0,27.2,37.8,54.65,248.0,10698.0,58.983686,26.680532,0.0,42.5,54.0,69.5,228.0
E,4210.0,53.031747,28.274232,0.0,34.0,45.9,66.0,254.0,1626.0,70.1664,36.342973,0.0,47.25,62.666667,83.5,228.53
F,1525.0,59.083839,31.248179,0.0,37.185,52.3,76.88,231.69,1299.0,57.867544,30.449076,0.0,43.66,51.975,64.346667,270.0
G,1201.0,54.811204,29.729641,0.0,34.0,50.0,69.283333,201.0,568.0,63.261562,44.189569,0.0,45.15375,57.5,73.756667,510.0
H,461.0,50.723876,25.490082,-3.19,32.0,47.0,67.5,137.0,,,,,,,,
I,354.0,20.387745,31.365546,0.0,0.0,0.0,35.75875,212.8,,,,,,,,
K,,,,,,,,,176.0,42.317363,39.503602,0.0,0.0,44.56,64.375,159.0


In [21]:
room_stats.to_excel(os.path.join(output_path, 'room_stats.xlsx'))

#### Q4. How do the daily prices vary over the year?

#### - determine average monthly prices

In [22]:
rh_price_avg = rh_non_cancelled.groupby('arrival_date_month').adr_per_guest.mean().rename_axis('Month').reset_index(name='Avg_Price')

In [23]:
ch_price_avg = ch_non_cancelled.groupby('arrival_date_month').adr_per_guest.mean().rename_axis('Month').reset_index(name='Avg_Price')

In [24]:
hotel_avg_rates = pd.merge(rh_price_avg, ch_price_avg, on='Month', suffixes=('_RH', '_CH'))
hotel_avg_rates

Unnamed: 0,Month,Avg_Price_RH,Avg_Price_CH
0,April,43.726059,58.715028
1,August,83.322653,57.77163
2,December,37.6663,47.724939
3,February,30.845022,50.950846
4,January,31.169218,51.280071
5,July,70.262366,57.258853
6,June,56.346298,66.335898
7,March,34.10069,52.477652
8,May,42.254335,69.98785
9,November,30.002893,57.142431


In [25]:
#sort months in proper order

sort_order = ['January', 'February', 'March', 'April', 'May', 'June', 
              'July', 'August', 'September', 'October', 'November', 'December']

hotel_avg_rates.index = pd.CategoricalIndex(hotel_avg_rates.Month, categories=sort_order, ordered=True)
hotel_avg_rates = hotel_avg_rates.sort_index().reset_index(drop=True)
hotel_avg_rates.to_excel(os.path.join(output_path, 'avg_price_per_month.xlsx'), index=False)

hotel_avg_rates

Unnamed: 0,Month,Avg_Price_RH,Avg_Price_CH
0,January,31.169218,51.280071
1,February,30.845022,50.950846
2,March,34.10069,52.477652
3,April,43.726059,58.715028
4,May,42.254335,69.98785
5,June,56.346298,66.335898
6,July,70.262366,57.258853
7,August,83.322653,57.77163
8,September,50.372746,67.042091
9,October,35.144775,61.800547


#### Q5. What is the monthly traffic - Busiest Month?

#### - summarize by month and sum guests

In [26]:
rh_non_cancelled['total_paying_guest'] = rh_non_cancelled.adults + rh_non_cancelled.children
rh_monthly_traffic = rh_non_cancelled.groupby('arrival_date_month').total_paying_guest.sum().rename_axis('Month').reset_index(name='Count')
rh_monthly_traffic

#normalize 
rh_monthly_traffic['Count_Normalized'] = (rh_monthly_traffic['Count']/3).where(rh_monthly_traffic['Month'].isin(['July','August']), rh_monthly_traffic['Count']/2)
rh_monthly_traffic

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Month,Count,Count_Normalized
0,April,4663.0,2331.5
1,August,7413.0,2471.0
2,December,3750.0,1875.0
3,February,4278.0,2139.0
4,January,3168.0,1584.0
5,July,6986.0,2328.666667
6,June,4092.0,2046.0
7,March,4586.0,2293.0
8,May,4815.0,2407.5
9,November,3436.0,1718.0


In [27]:
# Dataset contains July and August months for 3 years (2015-2017), the other months for 2 years (2016-2017). 
rh_non_cancelled.groupby(['arrival_date_month', 'arrival_date_year']).total_paying_guest.count()

arrival_date_month  arrival_date_year
April               2016                 1345
                    2017                 1205
August              2015                 1043
                    2016                 1107
                    2017                 1107
December            2015                  959
                    2016                 1055
February            2016                 1113
                    2017                 1195
January             2016                  765
                    2017                 1101
July                2015                 1058
                    2016                  985
                    2017                 1094
June                2016                  993
                    2017                 1044
March               2016                 1409
                    2017                 1162
May                 2016                 1323
                    2017                 1212
November            2015                  

In [28]:
ch_non_cancelled['total_paying_guest'] = ch_non_cancelled.adults + ch_non_cancelled.children
ch_monthly_traffic = ch_non_cancelled.groupby('arrival_date_month').total_paying_guest.sum().rename_axis('Month').reset_index(name='Count')
ch_monthly_traffic

#Normalize 
ch_monthly_traffic['Count_Normalized'] = (ch_monthly_traffic['Count']/3).where(ch_monthly_traffic['Month'].isin(['July', 'August']), ch_monthly_traffic['Count']/2)
ch_monthly_traffic

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Month,Count,Count_Normalized
0,April,8130.0,4065.0
1,August,11567.0,3855.666667
2,December,4680.0,2340.0
3,February,5693.0,2846.5
4,January,3973.0,1986.5
5,July,10239.0,3413.0
6,June,8374.0,4187.0
7,March,7606.0,3803.0
8,May,8566.0,4283.0
9,November,4546.0,2273.0


In [29]:
#merge and sort 

hotel_monthly_traffic = pd.merge(rh_monthly_traffic,ch_monthly_traffic, on='Month', suffixes=('_RH', '_CH'))
hotel_monthly_traffic

hotel_monthly_traffic.index = pd.CategoricalIndex(hotel_monthly_traffic.Month, categories=sort_order, ordered=True)
hotel_monthly_traffic = hotel_monthly_traffic.sort_index().reset_index(drop=True)
hotel_monthly_traffic.to_excel(os.path.join(output_path, 'hotel_monthly_traffic.xlsx'), index=False)

hotel_monthly_traffic

Unnamed: 0,Month,Count_RH,Count_Normalized_RH,Count_CH,Count_Normalized_CH
0,January,3168.0,1584.0,3973.0,1986.5
1,February,4278.0,2139.0,5693.0,2846.5
2,March,4586.0,2293.0,7606.0,3803.0
3,April,4663.0,2331.5,8130.0,4065.0
4,May,4815.0,2407.5,8566.0,4283.0
5,June,4092.0,2046.0,8374.0,4187.0
6,July,6986.0,2328.666667,10239.0,3413.0
7,August,7413.0,2471.0,11567.0,3855.666667
8,September,4143.0,2071.5,7857.0,3928.5
9,October,4769.0,2384.5,7897.0,3948.5


### Q6. How long do guests stay in the hotels

#### - summarize by stay-ins and count

In [30]:
rh_non_cancelled['stay_ins'] = rh_non_cancelled.stays_in_weekend_nights + rh_non_cancelled.stays_in_week_nights

rh_guest_duration = rh_non_cancelled.groupby('stay_ins').total_paying_guest.count().rename_axis('days').reset_index(name='guest_count')
total_guest_count = rh_guest_duration.guest_count.sum()
rh_guest_duration['%_of_guests'] = (rh_guest_duration.guest_count/total_guest_count)*100
rh_guest_duration['hotel'] = 'Resort'
rh_guest_duration.to_excel(os.path.join(output_path, 'rh_guest_duration.xlsx'), index=False)

rh_guest_duration.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,days,guest_count,%_of_guests,hotel
0,0,371,1.282539,Resort
1,1,6579,22.743458,Resort
2,2,4488,15.514917,Resort
3,3,3828,13.233311,Resort
4,4,3321,11.480624,Resort


In [31]:
ch_non_cancelled['stay_ins'] = ch_non_cancelled.stays_in_week_nights + ch_non_cancelled.stays_in_weekend_nights

ch_guest_duration = ch_non_cancelled.groupby('stay_ins').total_paying_guest.count().rename_axis('days').reset_index(name='guest_count')
total_guest_count = ch_guest_duration.guest_count.sum()
ch_guest_duration['%_of_guests'] = (ch_guest_duration.guest_count/total_guest_count)*100
ch_guest_duration['hotel'] = 'City'
ch_guest_duration.to_excel(os.path.join(output_path, 'ch_guest_duration.xlsx'), index=False)
ch_guest_duration.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,days,guest_count,%_of_guests,hotel
0,0,251,0.544658,City
1,1,9155,19.865897,City
2,2,10983,23.832567,City
3,3,11889,25.798542,City
4,4,7694,16.695599,City


### Q7. Bookings per market segment?

#### In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

In [32]:
cancelled = bookings.loc[bookings.is_canceled == 1]
cancelled = cancelled.groupby('market_segment').is_canceled.count().rename_axis('Market_Segment').reset_index(name='Count')

In [33]:
not_cancelled = bookings.loc[bookings.is_canceled == 0]
not_cancelled = not_cancelled.groupby('market_segment').is_canceled.count().rename_axis('Market_Segment').reset_index(name='Count')

In [34]:
market_segment = pd.merge(not_cancelled, cancelled, on='Market_Segment', suffixes=('_NC', '_C'))
market_segment['Total_Bookings'] = market_segment.Count_NC + market_segment.Count_C
market_segment['%_Not_Cancelled'] = (market_segment.Count_NC/market_segment.Total_Bookings)*100
market_segment['%_Cancelled'] = (market_segment.Count_C/market_segment.Total_Bookings)*100
#market_segment = market_segment.dropna()

market_segment.to_excel(os.path.join(output_path, 'market_segment.xlsx'), index=False)
market_segment

Unnamed: 0,Market_Segment,Count_NC,Count_C,Total_Bookings,%_Not_Cancelled,%_Cancelled
0,Aviation,183,52,235,77.87234,22.12766
1,Complementary,639,89,728,87.774725,12.225275
2,Corporate,4291,991,5282,81.238167,18.761833
3,Direct,10648,1934,12582,84.628835,15.371165
4,Groups,7697,12094,19791,38.891415,61.108585
5,Offline TA/TO,15880,8302,24182,65.668679,34.331321
6,Online TA,35673,20735,56408,63.24103,36.75897
