# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [49]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [50]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('../..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [51]:
import datetime as dt
import json
import holidays
import numpy as np
import pandas as pd
import seaborn as sns

# *TEMPORARILY CHANGED TO USE BACKUP FILES* | Read Data from DuckDB

In [52]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [53]:
# ## Select subset of data for review
# q = 'SELECT IsCanceled FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [54]:
# ## Convert Arrival columns to strings

# q = ('''
# SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
# StaysInWeekNights, StaysInWeekendNights, LeadTime 
# FROM res_data''')

# with db_utils.duckdb_connection(db_path) as conn:
#     df_data = conn.execute(q).df()

# # df_data = arrival_cols.astype(str)
# df_data.head()

In [55]:
# ## Specify subset of temporal features
# date_features = ['ReservationStatusDate', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
#                  'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime']
# date_features

In [56]:
with open('../../data/column_groups.json') as file:
    col_groups = json.load(file)

## Slice subset of features
col_groups['temporal_features'].extend(['ADR', 'IsCanceled'])

In [57]:
path = '../../data/source/full_data.feather'

# df_data = pd.read_feather(path, columns = col_groups['temporal_features'])
df_data = pd.read_feather(path) ## Use full dataset

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,342,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,873e7749-dcb5-4c01-b54f-46082557421a
1,0,737,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,17c448f2-8715-43d3-a34d-06e0c6ccb502
2,0,7,2015,July,27,1,0,1,1,0.0,...,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1,ccfd11ba-608d-46bb-a97a-7af0dc59fc7d
3,0,13,2015,July,27,1,0,1,1,0.0,...,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1,0b80d489-ff99-4533-b2e2-c07747cb9681
4,0,14,2015,July,27,1,0,2,2,0.0,...,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1,9b03c838-e88c-4682-b491-1f929402c92d
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0.0,...,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2,6ba8edf2-3269-47dd-a643-1717a82977db
119386,0,102,2017,August,35,31,2,5,3,0.0,...,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2,1772f97b-c98e-483c-b3ed-4e19741a0c0b
119387,0,34,2017,August,35,31,2,5,2,0.0,...,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2,1da248f6-1191-4391-9490-066d3bc5d9a8
119388,0,109,2017,August,35,31,2,5,2,0.0,...,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2,2665afe4-8c01-43b7-a2db-80122963f613


In [58]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   IsCanceled                   119390 non-null  int64  
 1   LeadTime                     119390 non-null  int64  
 2   ArrivalDateYear              119390 non-null  int64  
 3   ArrivalDateMonth             119390 non-null  object 
 4   ArrivalDateWeekNumber        119390 non-null  int64  
 5   ArrivalDateDayOfMonth        119390 non-null  int64  
 6   StaysInWeekendNights         119390 non-null  int64  
 7   StaysInWeekNights            119390 non-null  int64  
 8   Adults                       119390 non-null  int64  
 9   Children                     119386 non-null  float64
 10  Babies                       119390 non-null  int64  
 11  Meal                         119390 non-null  object 
 12  Country                      118902 non-null  object 
 13 

## Convert ReservationStatusDate to Datetime Format

In [59]:
df_data['ReservationStatusDate'] = pd.to_datetime(df_data['ReservationStatusDate'], yearfirst = True)
df_data['ReservationStatusDate']

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: ReservationStatusDate, Length: 119390, dtype: datetime64[ns]

# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [60]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = (df_data['ArrivalDateYear']
                          .str.cat(df_data[['ArrivalDateMonth',
                                            'ArrivalDateDayOfMonth']],
                                   '-')
                          )

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data = df_data.sort_values(by = 'ArrivalDate', ignore_index = False)

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate
0,0,342,2015,July,27,1,0,0,2,0.0,...,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,873e7749-dcb5-4c01-b54f-46082557421a,2015-07-01
75559,0,257,2015,July,27,1,0,2,1,0.0,...,0,Transient,80.00,0,0,Check-Out,2015-07-03,H2,34ffe79b-1c75-4a7e-9353-974435aeaa76,2015-07-01
75560,0,257,2015,July,27,1,0,2,2,0.0,...,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,be96389e-f204-4118-beb0-f8c57752beb0,2015-07-01
75561,0,257,2015,July,27,1,0,2,2,0.0,...,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,1f50ec6f-ea8d-48b1-86cb-aa4999b8d460,2015-07-01
75562,0,257,2015,July,27,1,0,2,2,0.0,...,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,d38621e9-377a-4302-a587-bd1267eee0ac,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108,2017,August,35,31,2,5,2,0.0,...,0,Transient,207.03,0,1,Check-Out,2017-09-07,H1,e8240939-e608-4177-8b0a-9c1acbdc25b6,2017-08-31
40040,0,194,2017,August,35,31,2,5,2,1.0,...,0,Transient,312.29,1,1,Check-Out,2017-09-07,H1,55c827d5-b681-4b84-be50-9315d15bb979,2017-08-31
13794,1,17,2017,August,35,31,0,3,2,0.0,...,0,Transient,207.00,0,2,Canceled,2017-08-14,H1,858a4520-54e8-44ee-a36b-6e69d2a06988,2017-08-31
40038,0,191,2017,August,35,31,2,5,2,0.0,...,0,Contract,114.80,0,0,Check-Out,2017-09-07,H1,37a6b3f1-27e8-4719-ae03-a4d280730517,2017-08-31


In [61]:
## Drop features post-conversion
df_data = (df_data
           .drop(columns = arrival_details)
           .drop(columns = 'ArrivalDateWeekNumber'))
df_data

Unnamed: 0,IsCanceled,LeadTime,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,...,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate
0,0,342,0,0,2,0.0,0,BB,PRT,Direct,...,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,873e7749-dcb5-4c01-b54f-46082557421a,2015-07-01
75559,0,257,0,2,1,0.0,0,HB,PRT,Offline TA/TO,...,0,Transient,80.00,0,0,Check-Out,2015-07-03,H2,34ffe79b-1c75-4a7e-9353-974435aeaa76,2015-07-01
75560,0,257,0,2,2,0.0,0,HB,PRT,Offline TA/TO,...,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,be96389e-f204-4118-beb0-f8c57752beb0,2015-07-01
75561,0,257,0,2,2,0.0,0,HB,PRT,Offline TA/TO,...,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,1f50ec6f-ea8d-48b1-86cb-aa4999b8d460,2015-07-01
75562,0,257,0,2,2,0.0,0,HB,PRT,Offline TA/TO,...,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,d38621e9-377a-4302-a587-bd1267eee0ac,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108,2,5,2,0.0,0,HB,GBR,Online TA,...,0,Transient,207.03,0,1,Check-Out,2017-09-07,H1,e8240939-e608-4177-8b0a-9c1acbdc25b6,2017-08-31
40040,0,194,2,5,2,1.0,0,HB,ITA,Online TA,...,0,Transient,312.29,1,1,Check-Out,2017-09-07,H1,55c827d5-b681-4b84-be50-9315d15bb979,2017-08-31
13794,1,17,0,3,2,0.0,0,HB,ESP,Online TA,...,0,Transient,207.00,0,2,Canceled,2017-08-14,H1,858a4520-54e8-44ee-a36b-6e69d2a06988,2017-08-31
40038,0,191,2,5,2,0.0,0,HB,GBR,Offline TA/TO,...,0,Contract,114.80,0,0,Check-Out,2017-09-07,H1,37a6b3f1-27e8-4719-ae03-a4d280730517,2017-08-31


## Departure Date

In [62]:
## Convert number of nights stays to timedelta,
## then use to calculate departure date and stay length

timedelta_wknd = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekendNights'],
                    unit = 'D')
timedelta_wk = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekNights'],
                    unit = 'D')

df_data['DepartureDate'] = (df_data.loc[:, 'ArrivalDate'] 
                            + timedelta_wk 
                            + timedelta_wknd)

df_data['Length of Stay'] = df_data['StaysInWeekendNights'] + df_data['StaysInWeekNights']

df_data = df_data.drop(columns = ['StaysInWeekendNights', 'StaysInWeekNights'])

df_data

Unnamed: 0,IsCanceled,LeadTime,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate,DepartureDate,Length of Stay
0,0,342,2,0.0,0,BB,PRT,Direct,Direct,0,...,0.00,0,0,Check-Out,2015-07-01,H1,873e7749-dcb5-4c01-b54f-46082557421a,2015-07-01,2015-07-01,0
75559,0,257,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,80.00,0,0,Check-Out,2015-07-03,H2,34ffe79b-1c75-4a7e-9353-974435aeaa76,2015-07-01,2015-07-03,2
75560,0,257,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,101.50,0,0,Check-Out,2015-07-03,H2,be96389e-f204-4118-beb0-f8c57752beb0,2015-07-01,2015-07-03,2
75561,0,257,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,101.50,0,0,Check-Out,2015-07-03,H2,1f50ec6f-ea8d-48b1-86cb-aa4999b8d460,2015-07-01,2015-07-03,2
75562,0,257,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,101.50,0,0,Check-Out,2015-07-03,H2,d38621e9-377a-4302-a587-bd1267eee0ac,2015-07-01,2015-07-03,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108,2,0.0,0,HB,GBR,Online TA,TA/TO,0,...,207.03,0,1,Check-Out,2017-09-07,H1,e8240939-e608-4177-8b0a-9c1acbdc25b6,2017-08-31,2017-09-07,7
40040,0,194,2,1.0,0,HB,ITA,Online TA,TA/TO,0,...,312.29,1,1,Check-Out,2017-09-07,H1,55c827d5-b681-4b84-be50-9315d15bb979,2017-08-31,2017-09-07,7
13794,1,17,2,0.0,0,HB,ESP,Online TA,TA/TO,0,...,207.00,0,2,Canceled,2017-08-14,H1,858a4520-54e8-44ee-a36b-6e69d2a06988,2017-08-31,2017-09-03,3
40038,0,191,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,...,114.80,0,0,Check-Out,2017-09-07,H1,37a6b3f1-27e8-4719-ae03-a4d280730517,2017-08-31,2017-09-07,7


## `BookingDate` from `LeadTime`

In [63]:
## Convert to TimeDelta
df_data['LeadTime'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')

## Subtract LeadTime from ArrivalDate to calculate BookingDate
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTime']
df_data['BookingDate']

df_data = df_data.drop(columns = 'LeadTime')

df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,0,0,Check-Out,2015-07-01,H1,873e7749-dcb5-4c01-b54f-46082557421a,2015-07-01,2015-07-01,0,2014-07-24
75559,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,34ffe79b-1c75-4a7e-9353-974435aeaa76,2015-07-01,2015-07-03,2,2014-10-17
75560,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,be96389e-f204-4118-beb0-f8c57752beb0,2015-07-01,2015-07-03,2,2014-10-17
75561,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,1f50ec6f-ea8d-48b1-86cb-aa4999b8d460,2015-07-01,2015-07-03,2,2014-10-17
75562,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,d38621e9-377a-4302-a587-bd1267eee0ac,2015-07-01,2015-07-03,2,2014-10-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,0,1,Check-Out,2017-09-07,H1,e8240939-e608-4177-8b0a-9c1acbdc25b6,2017-08-31,2017-09-07,7,2017-05-15
40040,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,1,1,Check-Out,2017-09-07,H1,55c827d5-b681-4b84-be50-9315d15bb979,2017-08-31,2017-09-07,7,2017-02-18
13794,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,0,2,Canceled,2017-08-14,H1,858a4520-54e8-44ee-a36b-6e69d2a06988,2017-08-31,2017-09-03,3,2017-08-14
40038,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2017-09-07,H1,37a6b3f1-27e8-4719-ae03-a4d280730517,2017-08-31,2017-09-07,7,2017-02-21


In [64]:
df_data.head(10)

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,0,0,Check-Out,2015-07-01,H1,873e7749-dcb5-4c01-b54f-46082557421a,2015-07-01,2015-07-01,0,2014-07-24
75559,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,34ffe79b-1c75-4a7e-9353-974435aeaa76,2015-07-01,2015-07-03,2,2014-10-17
75560,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,be96389e-f204-4118-beb0-f8c57752beb0,2015-07-01,2015-07-03,2,2014-10-17
75561,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,1f50ec6f-ea8d-48b1-86cb-aa4999b8d460,2015-07-01,2015-07-03,2,2014-10-17
75562,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,d38621e9-377a-4302-a587-bd1267eee0ac,2015-07-01,2015-07-03,2,2014-10-17
75563,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,193a3a70-9544-4771-b172-31dc9eabcbfd,2015-07-01,2015-07-03,2,2014-10-17
75564,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,e3f1e8de-2395-4868-a42f-526d94a62001,2015-07-01,2015-07-03,2,2014-10-17
75565,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,b61770a7-5156-470c-a939-fde4a9b72ccf,2015-07-01,2015-07-03,2,2014-10-17
75566,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,7d797058-f23a-4bf8-9c29-1a0fb31b1647,2015-07-01,2015-07-03,2,2014-10-17
75558,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,2015-07-03,H2,c69de7c6-c1a9-4f0d-bbdb-8d62b5092d6a,2015-07-01,2015-07-03,2,2014-10-17


# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [65]:
## Review data prior to changes
df_data['ReservationStatusDate'].head(10)

0       2015-07-01
75559   2015-07-03
75560   2015-07-03
75561   2015-07-03
75562   2015-07-03
75563   2015-07-03
75564   2015-07-03
75565   2015-07-03
75566   2015-07-03
75558   2015-07-03
Name: ReservationStatusDate, dtype: datetime64[ns]

## Compare with `IsCanceled` Data

In [66]:
## Identify reservations changed after arrival
change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

## Calculate average number of reservations changed after arrival
avg_resstatdate_before_arrival = change_filter.mean()

## Calculate average number of canceled reservations
avg_cxl = df_data['IsCanceled'].mean()

print((f'''The overall average number of canceled reservations is: {avg_cxl:.2%}\n'''))

print(' '.join(['The average number of canceled reservations with a ReservationStatusDate',
             f'prior to the arrival date is: {avg_resstatdate_before_arrival:.2%}\n''']))

## Print advice based on results
if avg_cxl >= .9:
    print(' '.join('The `ReservationStatusDate` feature is too strongly indicative of the `IsCanceled` feature.',
          'It should not be used for modeling.'))
elif avg_cxl >= .25 and avg_cxl < .9:
    print(' '.join(['This feature is related to the `IsCanceled` feature.',
          'Make sure to review it in more detail to determine whether to use it.']))
else:
    print('The `ReservationStatusDate` feature is unlikely to be predictive of the `IsCanceled` feature.')

The overall average number of canceled reservations is: 37.04%

The average number of canceled reservations with a ReservationStatusDate prior to the arrival date is: 35.29%

This feature is related to the `IsCanceled` feature. Make sure to review it in more detail to determine whether to use it.


### EDA Questions

In [67]:
# ## What is the breakdown of reservation statuses for those reservations with matching Arrival and Status Dates?
# ## (A.K.A. "same-day departures" or "day-use reservations.")

# sameday_status = (df_data['ReservationStatusDate'] == df_data['ArrivalDate'])

# (df_data[sameday_status]
#  .value_counts(subset = 'ReservationStatus',normalize = True)
#  .round(2))

In [68]:
# ## What is the breakdown of IsCanceled statuses
# ## for those reservations with matching Arrival and Status Dates?

# (df_data[sameday_status]
#  .value_counts(subset = 'IsCanceled',normalize = True)
#  .round(2))

In [69]:
# ## What is the average rate for these day-use/same-day-departure reservations?

# sameday_adr = (sameday_status & (df_data['ReservationStatus'] == 'Check-Out'))

# sameday_adr_median = df_data[sameday_adr]['ADR'].median()

# print(f'The median ADR for same-day reservations is: ${sameday_adr_median:.2f}')

# sameday_adr_gt_zero = (df_data[sameday_adr]['ADR'] > 0).mean().round(2)

# print(f'The number of same-day reservations with an ADR greater than zero is: {sameday_adr_gt_zero:.1%}')

In [70]:
# sameday_departure = (df_data['ReservationStatusDate'] == df_data['DepartureDate'])
# df_data[sameday_departure].value_counts(subset = 'IsCanceled', normalize =True).round(4)

In [71]:
# df_data[sameday_departure].value_counts(subset = 'ReservationStatus', normalize =True).round(4)

In [72]:
# df_data[(sameday_departure & (df_data['ReservationStatus'] != 'Check-Out'))]

## ReservationStatusDate Earlier Than Arrival Date

In [73]:
# after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

# avg_resstatdate_after_arrival = (after_arrival_filter.mean())

# print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

# Future Work: Investigating Canceled Reservations

---

Although the `ReservationStatusDate` feature is not appropriate for feature engineering, I could use it to calculate the number of days between booking and cancellation (for canceled reservations only).

This would be outside of the scope of the current feature engineering, but I am noting it as future work for analysis.

---

In [74]:
## Caculate the number of days between the status and booking dates

df_data['DaysSinceBooking'] = (df_data['ReservationStatusDate'] - df_data['BookingDate']).dt.days

df_data['DaysSinceBooking']

0         342
75559     259
75560     259
75561     259
75562     259
         ... 
40039     115
40040     201
13794       0
40038     198
117424      4
Name: DaysSinceBooking, Length: 119390, dtype: int64

In [75]:
df_data.head(10).T

Unnamed: 0,0,75559,75560,75561,75562,75563,75564,75565,75566,75558
IsCanceled,0,0,0,0,0,0,0,0,0,0
Adults,2,1,2,2,2,2,2,2,2,1
Children,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0,0,0,0,0,0
Meal,BB,HB,HB,HB,HB,HB,HB,HB,HB,HB
Country,PRT,PRT,PRT,PRT,PRT,PRT,PRT,PRT,PRT,PRT
MarketSegment,Direct,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO
DistributionChannel,Direct,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO
IsRepeatedGuest,0,0,0,0,0,0,0,0,0,0
PreviousCancellations,0,0,0,0,0,0,0,0,0,0


## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

In [76]:
df_data = df_data.drop(columns = 'ReservationStatusDate')
df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,HotelNumber,UUID,ArrivalDate,DepartureDate,Length of Stay,BookingDate,DaysSinceBooking
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,0,0,Check-Out,H1,873e7749-dcb5-4c01-b54f-46082557421a,2015-07-01,2015-07-01,0,2014-07-24,342
75559,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,H2,34ffe79b-1c75-4a7e-9353-974435aeaa76,2015-07-01,2015-07-03,2,2014-10-17,259
75560,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,H2,be96389e-f204-4118-beb0-f8c57752beb0,2015-07-01,2015-07-03,2,2014-10-17,259
75561,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,H2,1f50ec6f-ea8d-48b1-86cb-aa4999b8d460,2015-07-01,2015-07-03,2,2014-10-17,259
75562,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,H2,d38621e9-377a-4302-a587-bd1267eee0ac,2015-07-01,2015-07-03,2,2014-10-17,259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,0,1,Check-Out,H1,e8240939-e608-4177-8b0a-9c1acbdc25b6,2017-08-31,2017-09-07,7,2017-05-15,115
40040,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,1,1,Check-Out,H1,55c827d5-b681-4b84-be50-9315d15bb979,2017-08-31,2017-09-07,7,2017-02-18,201
13794,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,0,2,Canceled,H1,858a4520-54e8-44ee-a36b-6e69d2a06988,2017-08-31,2017-09-03,3,2017-08-14,0
40038,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,0,0,Check-Out,H1,37a6b3f1-27e8-4719-ae03-a4d280730517,2017-08-31,2017-09-07,7,2017-02-21,198


# Feature Engineering: Holidays

In [77]:
min_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .min()
                 .min()
                 .year)

max_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .max()
                 .max()
                 .year)

min_year, max_year

(2013, 2017)

In [78]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years= [year for year in range(min_year, max_year+1)]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)

def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [79]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,DepartureDate,Length of Stay,BookingDate,DaysSinceBooking,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,2015-07-01,0,2014-07-24,342,45,21,45,21,22,44
75559,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
75560,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
75561,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
75562,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,2017-09-07,7,2017-05-15,115,35,16,28,23,26,14
40040,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,2017-09-07,7,2017-02-18,201,35,16,28,23,55,48
13794,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,2017-09-03,3,2017-08-14,0,35,16,32,19,1,60
40038,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,2017-09-07,7,2017-02-21,198,35,16,28,23,52,51


# Feature Engineering: ISO Day of Week, ISO Week of Year

In [80]:
df_data['ArrivalDate'].dt.dayofweek.head()

0        2
75559    2
75560    2
75561    2
75562    2
Name: ArrivalDate, dtype: int32

In [81]:
arrival_isocal = (df_data['ArrivalDate']
                  .dt.isocalendar()[['week', 'day']]
                  .rename(columns = {'week':'ArrivalWeek', 'day': 'ArrivalDay'}))
arrival_isocal

Unnamed: 0,ArrivalWeek,ArrivalDay
0,27,3
75559,27,3
75560,27,3
75561,27,3
75562,27,3
...,...,...
40039,35,4
40040,35,4
13794,35,4
40038,35,4


In [82]:
df_data = pd.concat([df_data, arrival_isocal], axis = 1)
df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,BookingDate,DaysSinceBooking,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalWeek,ArrivalDay
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,2014-07-24,342,45,21,45,21,22,44,27,3
75559,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
75560,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
75561,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
75562,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,2017-05-15,115,35,16,28,23,26,14,35,4
40040,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,2017-02-18,201,35,16,28,23,55,48,35,4
13794,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,2017-08-14,0,35,16,32,19,1,60,35,4
40038,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,2017-02-21,198,35,16,28,23,52,51,35,4


# Feature Engineering: Day of Week, Month as Categorical

In [83]:
df_day_name = (df_data['ArrivalDate']
                 .dt.day_name()
                 .astype('category'))
df_day_name.name = 'ArrivalDateDayName'
df_day_name

df_data = pd.concat([df_data, df_day_name], axis = 1)

df_data['ArrivalDateDayName'].head().T

0        Wednesday
75559    Wednesday
75560    Wednesday
75561    Wednesday
75562    Wednesday
Name: ArrivalDateDayName, dtype: category
Categories (7, object): ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']

In [84]:
df_month_name = (df_data['ArrivalDate']
                 .dt.month_name()
                 .astype('category'))
df_month_name.name = 'ArrivalDateMonthName'

df_data = pd.concat([df_data, df_month_name], axis = 1)

df_data['ArrivalDateMonthName'].head().T

0        July
75559    July
75560    July
75561    July
75562    July
Name: ArrivalDateMonthName, dtype: category
Categories (12, object): ['April', 'August', 'December', 'February', ..., 'May', 'November', 'October', 'September']

In [85]:
df_data.head().T

Unnamed: 0,0,75559,75560,75561,75562
IsCanceled,0,0,0,0,0
Adults,2,1,2,2,2
Children,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0
Meal,BB,HB,HB,HB,HB
Country,PRT,PRT,PRT,PRT,PRT
MarketSegment,Direct,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO
DistributionChannel,Direct,TA/TO,TA/TO,TA/TO,TA/TO
IsRepeatedGuest,0,0,0,0,0
PreviousCancellations,0,0,0,0,0


# Feature Engineering: Rolling Averages, Rolling Standard Deviation, and Lag

---

> To help capture the time series-related data from my ADR, I will also introduce rolling averages, rolling standard deviations, and apply exponential smooothing to create new features.
>
> This approach does use the target feature for engineering, but as long as I split my data on the arrival date, I'm confident that I can avoid data leakage.

---

In [86]:
# # Lag features
# df_data['ADR_lag_1'] = df_data['ADR'].shift(1)
# df_data['ADR_lag_7'] = df_data['ADR'].shift(7)

# # 3-day rolling average (past 3 days)
# df_data['ADR_7d_avg'] = df_data['ADR'].shift(1).rolling(window=3).mean().round(2)
# # 7-day rolling average (past 7 days)
# df_data['ADR_30d_avg'] = df_data['ADR'].shift(1).rolling(window=7).mean().round(2)
# # 3-day moving standard deviation (past 3 days)
# df_data['ADR_7d_std'] = df_data['ADR'].shift(1).rolling(window=3).std().round(2)
# # 7-day moving standard deviation (past 7 days)
# df_data['ADR_30d_std'] = df_data['ADR'].shift(1).rolling(window=7).std().round(2)

# # Exponential smoothing
# df_data['ADR_ewm_3'] = df_data['ADR'].shift(1).ewm(span=3, adjust=False).mean().round(2)
# df_data['ADR_ewm_7'] = df_data['ADR'].shift(1).ewm(span=7, adjust=False).mean().round(2)

# df_data

# Prepare to Save Data

In [87]:
df_data = df_data.reset_index(drop = True)
df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalWeek,ArrivalDay,ArrivalDateDayName,ArrivalDateMonthName
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,45,21,45,21,22,44,27,3,Wednesday,July
1,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,45,21,43,23,52,63,27,3,Wednesday,July
2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,45,21,43,23,52,63,27,3,Wednesday,July
3,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,45,21,43,23,52,63,27,3,Wednesday,July
4,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,45,21,43,23,52,63,27,3,Wednesday,July
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,35,16,28,23,26,14,35,4,Thursday,August
119386,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,35,16,28,23,55,48,35,4,Thursday,August
119387,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,35,16,32,19,1,60,35,4,Thursday,August
119388,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,35,16,28,23,52,51,35,4,Thursday,August


# Final Inspection

---

I extracted a good deal of information about booking and stay dates, as well as adding temporal features. While this approach does add a significant number of features, I am confident that the additional data will be worthwhile.

---

In [88]:
df_data.columns

Index(['IsCanceled', 'Adults', 'Children', 'Babies', 'Meal', 'Country',
       'MarketSegment', 'DistributionChannel', 'IsRepeatedGuest',
       'PreviousCancellations', 'PreviousBookingsNotCanceled',
       'ReservedRoomType', 'AssignedRoomType', 'BookingChanges', 'DepositType',
       'Agent', 'Company', 'DaysInWaitingList', 'CustomerType', 'ADR',
       'RequiredCarParkingSpaces', 'TotalOfSpecialRequests',
       'ReservationStatus', 'HotelNumber', 'UUID', 'ArrivalDate',
       'DepartureDate', 'Length of Stay', 'BookingDate', 'DaysSinceBooking',
       'ArrivalDate_DaysBeforeHoliday', 'ArrivalDate_DaysAfterHoliday',
       'DepartureDate_DaysBeforeHoliday', 'DepartureDate_DaysAfterHoliday',
       'BookingDate_DaysBeforeHoliday', 'BookingDate_DaysAfterHoliday',
       'ArrivalWeek', 'ArrivalDay', 'ArrivalDateDayName',
       'ArrivalDateMonthName'],
      dtype='object')

In [89]:
new_temporal_features = ['Length of Stay', 'DaysSinceBooking',
                         'ArrivalDate_DaysBeforeHoliday', 
                         'ArrivalDate_DaysAfterHoliday',
                         'DepartureDate_DaysBeforeHoliday',
                         'DepartureDate_DaysAfterHoliday',
                         'BookingDate_DaysBeforeHoliday',
                         'BookingDate_DaysAfterHoliday',
                         'ArrivalWeek', 'ArrivalDay',
                         'ArrivalDateDayName',
                         'ArrivalDateMonthName']
new_temporal_features.sort()
new_temporal_features

['ArrivalDateDayName',
 'ArrivalDateMonthName',
 'ArrivalDate_DaysAfterHoliday',
 'ArrivalDate_DaysBeforeHoliday',
 'ArrivalDay',
 'ArrivalWeek',
 'BookingDate_DaysAfterHoliday',
 'BookingDate_DaysBeforeHoliday',
 'DaysSinceBooking',
 'DepartureDate_DaysAfterHoliday',
 'DepartureDate_DaysBeforeHoliday',
 'Length of Stay']

In [90]:
# new_temporal_features = ['Length of Stay', 'DaysSinceBooking',
#                          'ArrivalDate_DaysBeforeHoliday', 
#                          'ArrivalDate_DaysAfterHoliday',
#                          'DepartureDate_DaysBeforeHoliday',
#                          'DepartureDate_DaysAfterHoliday',
#                          'BookingDate_DaysBeforeHoliday',
#                          'BookingDate_DaysAfterHoliday',
#                          'ArrivalWeek', 'ArrivalDay',
#                          'ArrivalDateDayName',
#                          'ArrivalDateMonthName',
#                          'ADR_lag_1', 'ADR_lag_7',
#                          'ADR_7d_avg','ADR_30d_avg',
#                          'ADR_7d_std', 'ADR_30d_std',
#                          'ADR_ewm_3', 'ADR_ewm_7']
# new_temporal_features.sort()
# new_temporal_features

In [91]:
## Save results to JSON file
file_name = '../../data/column_groups.json'
with open(file_name, 'r') as file:
    data = json.load(file)

# Step 2: Update the JSON data
new_key = 'new_key'
new_value = 'new_value'
data['new_temporal_features'] = new_temporal_features

# Step 3: Write the updated JSON data back to the file
with open(file_name, 'w') as file:
    json.dump(data, file, indent=4)  # `indent=4` is optional, it makes the JSON file pretty-printed

print("Key-value pair added successfully.")

Key-value pair added successfully.


In [92]:
df_data.to_feather('../../data/3.1_temporally_updated_data.feather')