# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [12]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('../..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [14]:
import datetime as dt

import holidays
import pandas as pd

# *TEMPORARILY CHANGED TO USE BACKUP FILES* | Read Data from DuckDB

In [15]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [16]:
# ## Select subset of data for review
# q = 'SELECT IsCanceled FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [17]:
# ## Convert Arrival columns to strings

# q = ('''
# SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
# StaysInWeekNights, StaysInWeekendNights, LeadTime 
# FROM res_data''')

# with db_utils.duckdb_connection(db_path) as conn:
#     df_data = conn.execute(q).df()

# # df_data = arrival_cols.astype(str)
# df_data.head()

In [18]:
# ## Specify subset of temporal features
# date_features = ['ReservationStatusDate', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
#                  'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime']
# date_features

In [19]:
path = '../../data/source/full_data.feather'

# df_data = pd.read_feather(path, columns = date_features) ## Use full dataset
df_data = pd.read_feather(path)

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,UUID
0,0,342,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,1502832c-f9fe-4c2b-a7be-cce3a85d5441
1,0,737,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2f85c742-d1c8-44fb-84da-c6e3dbc039a2
2,0,7,2015,July,27,1,0,1,1,0.0,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,cbe1e6ae-2b44-4de2-a634-72356a93617a
3,0,13,2015,July,27,1,0,1,1,0.0,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,5a29a4bd-5718-425f-953e-0e18f3ff1d29
4,0,14,2015,July,27,1,0,2,2,0.0,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,a020d6df-bcba-45ea-a1d7-02688d117fbc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,2017,August,35,30,2,5,2,0.0,...,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06,56bcbcf2-ef31-4986-bd3a-71a5f0257e37
79326,0,102,2017,August,35,31,2,5,3,0.0,...,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07,07fdd8a6-059a-4441-beb0-0f73f7e71faa
79327,0,34,2017,August,35,31,2,5,2,0.0,...,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07,c3d83a7c-8c2e-4233-a8e0-9b1f39bf989d
79328,0,109,2017,August,35,31,2,5,2,0.0,...,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07,bbd4055e-9d03-48e7-b4de-fc0a3b047a14


In [20]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119390 entries, 0 to 79329
Data columns (total 32 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   IsCanceled                   119390 non-null  int64  
 1   LeadTime                     119390 non-null  int64  
 2   ArrivalDateYear              119390 non-null  int64  
 3   ArrivalDateMonth             119390 non-null  object 
 4   ArrivalDateWeekNumber        119390 non-null  int64  
 5   ArrivalDateDayOfMonth        119390 non-null  int64  
 6   StaysInWeekendNights         119390 non-null  int64  
 7   StaysInWeekNights            119390 non-null  int64  
 8   Adults                       119390 non-null  int64  
 9   Children                     119386 non-null  float64
 10  Babies                       119390 non-null  int64  
 11  Meal                         119390 non-null  object 
 12  Country                      118902 non-null  object 
 13  Marke

## Convert ReservationStatusDate to Datetime Format

In [21]:
df_data['ReservationStatusDate'] = pd.to_datetime(df_data['ReservationStatusDate'], yearfirst = True)
df_data['ReservationStatusDate']

0       2015-07-01
1       2015-07-01
2       2015-07-02
3       2015-07-02
4       2015-07-03
           ...    
79325   2017-09-06
79326   2017-09-07
79327   2017-09-07
79328   2017-09-07
79329   2017-09-07
Name: ReservationStatusDate, Length: 119390, dtype: datetime64[ns]

# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [22]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = (df_data['ArrivalDateYear']
                          .str.cat(df_data[['ArrivalDateMonth',
                                            'ArrivalDateDayOfMonth']],
                                   '-')
                          )

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data = df_data.sort_values(by = 'ArrivalDate', ignore_index = True)

df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,UUID,ArrivalDate
0,0,342,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1502832c-f9fe-4c2b-a7be-cce3a85d5441,2015-07-01
1,0,257,2015,July,27,1,0,2,1,0.0,...,,0,Transient,80.0,0,0,Check-Out,2015-07-03,2412bc0e-51ce-4b2b-9e7a-79957c90e9c4,2015-07-01
2,0,257,2015,July,27,1,0,2,2,0.0,...,,0,Transient,101.5,0,0,Check-Out,2015-07-03,85f71aaf-fc60-4d40-8e86-21926242d1cc,2015-07-01
3,0,257,2015,July,27,1,0,2,2,0.0,...,,0,Transient,101.5,0,0,Check-Out,2015-07-03,1d47be78-60d1-49a0-a562-a5f1368deb56,2015-07-01
4,0,257,2015,July,27,1,0,2,2,0.0,...,,0,Transient,101.5,0,0,Check-Out,2015-07-03,1e3cbdae-b796-4bff-b194-765f8ae5b51d,2015-07-01


In [23]:
## Drop features post-conversion
df_data = df_data.drop(columns = arrival_details).drop(columns = 'ArrivalDateWeekNumber')
df_data

Unnamed: 0,IsCanceled,LeadTime,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,UUID,ArrivalDate
0,0,342,0,0,2,0.0,0,BB,PRT,Direct,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,1502832c-f9fe-4c2b-a7be-cce3a85d5441,2015-07-01
1,0,257,0,2,1,0.0,0,HB,PRT,Offline TA/TO,...,,0,Transient,80.00,0,0,Check-Out,2015-07-03,2412bc0e-51ce-4b2b-9e7a-79957c90e9c4,2015-07-01
2,0,257,0,2,2,0.0,0,HB,PRT,Offline TA/TO,...,,0,Transient,101.50,0,0,Check-Out,2015-07-03,85f71aaf-fc60-4d40-8e86-21926242d1cc,2015-07-01
3,0,257,0,2,2,0.0,0,HB,PRT,Offline TA/TO,...,,0,Transient,101.50,0,0,Check-Out,2015-07-03,1d47be78-60d1-49a0-a562-a5f1368deb56,2015-07-01
4,0,257,0,2,2,0.0,0,HB,PRT,Offline TA/TO,...,,0,Transient,101.50,0,0,Check-Out,2015-07-03,1e3cbdae-b796-4bff-b194-765f8ae5b51d,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,108,2,5,2,0.0,0,HB,GBR,Online TA,...,,0,Transient,207.03,0,1,Check-Out,2017-09-07,bba6a12e-50bc-40c8-9053-7f8f3a4ece8e,2017-08-31
119386,0,194,2,5,2,1.0,0,HB,ITA,Online TA,...,,0,Transient,312.29,1,1,Check-Out,2017-09-07,2f58e4d6-9d61-481f-a413-cfa43d1af3f0,2017-08-31
119387,1,17,0,3,2,0.0,0,HB,ESP,Online TA,...,,0,Transient,207.00,0,2,Canceled,2017-08-14,e906ae58-d310-4886-8f5c-df89e100f204,2017-08-31
119388,0,191,2,5,2,0.0,0,HB,GBR,Offline TA/TO,...,,0,Contract,114.80,0,0,Check-Out,2017-09-07,82a13c2e-5455-4ee2-b5ed-a4a6f3636586,2017-08-31


## Departure Date

In [24]:
## Convert number of nights stays to timedelta,
## then use to calculate departure date and stay length

timedelta_wknd = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekendNights'],
                    unit = 'D')
timedelta_wk = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekNights'],
                    unit = 'D')

df_data['DepartureDate'] = (df_data.loc[:, 'ArrivalDate'] 
                            + timedelta_wk 
                            + timedelta_wknd)

df_data['Length of Stay'] = df_data['StaysInWeekendNights'] + df_data['StaysInWeekNights']

df_data = df_data.drop(columns = ['StaysInWeekendNights', 'StaysInWeekNights'])

df_data

Unnamed: 0,IsCanceled,LeadTime,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,...,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,UUID,ArrivalDate,DepartureDate,Length of Stay
0,0,342,2,0.0,0,BB,PRT,Direct,Direct,0,...,Transient,0.00,0,0,Check-Out,2015-07-01,1502832c-f9fe-4c2b-a7be-cce3a85d5441,2015-07-01,2015-07-01,0
1,0,257,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,Transient,80.00,0,0,Check-Out,2015-07-03,2412bc0e-51ce-4b2b-9e7a-79957c90e9c4,2015-07-01,2015-07-03,2
2,0,257,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,Transient,101.50,0,0,Check-Out,2015-07-03,85f71aaf-fc60-4d40-8e86-21926242d1cc,2015-07-01,2015-07-03,2
3,0,257,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,Transient,101.50,0,0,Check-Out,2015-07-03,1d47be78-60d1-49a0-a562-a5f1368deb56,2015-07-01,2015-07-03,2
4,0,257,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,Transient,101.50,0,0,Check-Out,2015-07-03,1e3cbdae-b796-4bff-b194-765f8ae5b51d,2015-07-01,2015-07-03,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,108,2,0.0,0,HB,GBR,Online TA,TA/TO,0,...,Transient,207.03,0,1,Check-Out,2017-09-07,bba6a12e-50bc-40c8-9053-7f8f3a4ece8e,2017-08-31,2017-09-07,7
119386,0,194,2,1.0,0,HB,ITA,Online TA,TA/TO,0,...,Transient,312.29,1,1,Check-Out,2017-09-07,2f58e4d6-9d61-481f-a413-cfa43d1af3f0,2017-08-31,2017-09-07,7
119387,1,17,2,0.0,0,HB,ESP,Online TA,TA/TO,0,...,Transient,207.00,0,2,Canceled,2017-08-14,e906ae58-d310-4886-8f5c-df89e100f204,2017-08-31,2017-09-03,3
119388,0,191,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,...,Contract,114.80,0,0,Check-Out,2017-09-07,82a13c2e-5455-4ee2-b5ed-a4a6f3636586,2017-08-31,2017-09-07,7


## `BookingDate` from `LeadTime`

In [25]:
## Convert to TimeDelta
df_data['LeadTime'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')

## Subtract LeadTime from ArrivalDate to calculate BookingDate
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTime']
df_data['BookingDate']

df_data = df_data.drop(columns = 'LeadTime')

df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,UUID,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,0.00,0,0,Check-Out,2015-07-01,1502832c-f9fe-4c2b-a7be-cce3a85d5441,2015-07-01,2015-07-01,0,2014-07-24
1,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,80.00,0,0,Check-Out,2015-07-03,2412bc0e-51ce-4b2b-9e7a-79957c90e9c4,2015-07-01,2015-07-03,2,2014-10-17
2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.50,0,0,Check-Out,2015-07-03,85f71aaf-fc60-4d40-8e86-21926242d1cc,2015-07-01,2015-07-03,2,2014-10-17
3,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.50,0,0,Check-Out,2015-07-03,1d47be78-60d1-49a0-a562-a5f1368deb56,2015-07-01,2015-07-03,2,2014-10-17
4,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.50,0,0,Check-Out,2015-07-03,1e3cbdae-b796-4bff-b194-765f8ae5b51d,2015-07-01,2015-07-03,2,2014-10-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,207.03,0,1,Check-Out,2017-09-07,bba6a12e-50bc-40c8-9053-7f8f3a4ece8e,2017-08-31,2017-09-07,7,2017-05-15
119386,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,312.29,1,1,Check-Out,2017-09-07,2f58e4d6-9d61-481f-a413-cfa43d1af3f0,2017-08-31,2017-09-07,7,2017-02-18
119387,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,207.00,0,2,Canceled,2017-08-14,e906ae58-d310-4886-8f5c-df89e100f204,2017-08-31,2017-09-03,3,2017-08-14
119388,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,114.80,0,0,Check-Out,2017-09-07,82a13c2e-5455-4ee2-b5ed-a4a6f3636586,2017-08-31,2017-09-07,7,2017-02-21


In [26]:
df_data.head(10)

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,UUID,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,0.0,0,0,Check-Out,2015-07-01,1502832c-f9fe-4c2b-a7be-cce3a85d5441,2015-07-01,2015-07-01,0,2014-07-24
1,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,80.0,0,0,Check-Out,2015-07-03,2412bc0e-51ce-4b2b-9e7a-79957c90e9c4,2015-07-01,2015-07-03,2,2014-10-17
2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,85f71aaf-fc60-4d40-8e86-21926242d1cc,2015-07-01,2015-07-03,2,2014-10-17
3,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,1d47be78-60d1-49a0-a562-a5f1368deb56,2015-07-01,2015-07-03,2,2014-10-17
4,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,1e3cbdae-b796-4bff-b194-765f8ae5b51d,2015-07-01,2015-07-03,2,2014-10-17
5,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,b5e61051-be32-4fd8-bf3f-27c7e7615829,2015-07-01,2015-07-03,2,2014-10-17
6,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,0a5b5cd2-e08e-446e-b94c-d0a43aab8d7b,2015-07-01,2015-07-03,2,2014-10-17
7,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,781ec705-fdd7-4c9a-bdee-78047faa1fb2,2015-07-01,2015-07-03,2,2014-10-17
8,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.5,0,0,Check-Out,2015-07-03,995c35cf-a7b7-43b3-ae66-0da61d25a30f,2015-07-01,2015-07-03,2,2014-10-17
9,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,80.0,0,0,Check-Out,2015-07-03,7bf2459f-710e-4393-821a-5d7690bb0011,2015-07-01,2015-07-03,2,2014-10-17


# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [27]:
## Review data prior to changes
df_data['ReservationStatusDate'].head(10)

0   2015-07-01
1   2015-07-03
2   2015-07-03
3   2015-07-03
4   2015-07-03
5   2015-07-03
6   2015-07-03
7   2015-07-03
8   2015-07-03
9   2015-07-03
Name: ReservationStatusDate, dtype: datetime64[ns]

## Read-In `IsCanceled` Data to Match Reservations

In [28]:
## Identify reservations changed after arrival
change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

## Calculate average number of reservations changed after arrival
avg_resstatdate_before_arrival = change_filter.mean()

## Calculate average number of canceled reservations
avg_cxl = df_data['IsCanceled'].mean()

print((f'''The overall average number of canceled reservations is: {avg_cxl:.2%}\n'''))

print(' '.join(['The average number of canceled reservations with a ReservationStatusDate',
             f'prior to the arrival date is: {avg_resstatdate_before_arrival:.2%}\n''']))

## Print advice based on results
if avg_cxl >= .9:
    print(' '.join('The `ReservationStatusDate` feature is too strongly indicative of the `IsCanceled` feature.',
          'It should not be used for modeling.'))
elif avg_cxl >= .25 and avg_cxl < .9:
    print(' '.join(['This feature is related to the `IsCanceled` feature.',
          'Make sure to review it in more detail to determine whether to use it.']))
else:
    print('The `ReservationStatusDate` feature is unlikely to be predictive of the `IsCanceled` feature.')

The overall average number of canceled reservations is: 37.04%

The average number of canceled reservations with a ReservationStatusDate prior to the arrival date is: 35.29%

This feature is related to the `IsCanceled` feature. Make sure to review it in more detail to determine whether to use it.


### EDA Questions

In [29]:
# ## What is the breakdown of reservation statuses for those reservations with matching Arrival and Status Dates?
# ## (A.K.A. "same-day departures" or "day-use reservations.")

# sameday_status = (df_data['ReservationStatusDate'] == df_data['ArrivalDate'])

# (df_data[sameday_status]
#  .value_counts(subset = 'ReservationStatus',normalize = True)
#  .round(2))

In [30]:
# ## What is the breakdown of IsCanceled statuses
# ## for those reservations with matching Arrival and Status Dates?

# (df_data[sameday_status]
#  .value_counts(subset = 'IsCanceled',normalize = True)
#  .round(2))

In [31]:
# ## What is the average rate for these day-use/same-day-departure reservations?

# sameday_adr = (sameday_status & (df_data['ReservationStatus'] == 'Check-Out'))

# sameday_adr_median = df_data[sameday_adr]['ADR'].median()

# print(f'The median ADR for same-day reservations is: ${sameday_adr_median:.2f}')

# sameday_adr_gt_zero = (df_data[sameday_adr]['ADR'] > 0).mean().round(2)

# print(f'The number of same-day reservations with an ADR greater than zero is: {sameday_adr_gt_zero:.1%}')

In [32]:
# sameday_departure = (df_data['ReservationStatusDate'] == df_data['DepartureDate'])
# df_data[sameday_departure].value_counts(subset = 'IsCanceled', normalize =True).round(4)

In [33]:
# df_data[sameday_departure].value_counts(subset = 'ReservationStatus', normalize =True).round(4)

In [34]:
# df_data[(sameday_departure & (df_data['ReservationStatus'] != 'Check-Out'))]

The 

## ReservationStatusDate Earlier Than Arrival Date

In [35]:
after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

avg_resstatdate_after_arrival = (after_arrival_filter.mean())

print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

The average number of reservations changed after arrival is: 62%.


# Future Work: Investigating Canceled Reservations

---

Although the `ReservationStatusDate` feature is not appropriate for feature engineering, I could use it to calculate the number of days between booking and cancellation (for canceled reservations only).

This would be outside of the scope of the current feature engineering, but I am noting it as future work for analysis.

---

In [36]:
## Caculate the number of days between the status and booking dates

df_data['DaysSinceBooking'] = (df_data['ReservationStatusDate'] - df_data['BookingDate']).dt.days

df_data['DaysSinceBooking']

0         342
1         259
2         259
3         259
4         259
         ... 
119385    115
119386    201
119387      0
119388    198
119389      4
Name: DaysSinceBooking, Length: 119390, dtype: int64

In [37]:
df_data.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
IsCanceled,0,0,0,0,0,0,0,0,0,0
Adults,2,1,2,2,2,2,2,2,2,1
Children,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0,0,0,0,0,0
Meal,BB,HB,HB,HB,HB,HB,HB,HB,HB,HB
Country,PRT,PRT,PRT,PRT,PRT,PRT,PRT,PRT,PRT,PRT
MarketSegment,Direct,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO
DistributionChannel,Direct,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO,TA/TO
IsRepeatedGuest,0,0,0,0,0,0,0,0,0,0
PreviousCancellations,0,0,0,0,0,0,0,0,0,0


## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

In [38]:
df_data = df_data.drop(columns = 'ReservationStatusDate')
df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,UUID,ArrivalDate,DepartureDate,Length of Stay,BookingDate,DaysSinceBooking
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,0.00,0,0,Check-Out,1502832c-f9fe-4c2b-a7be-cce3a85d5441,2015-07-01,2015-07-01,0,2014-07-24,342
1,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,80.00,0,0,Check-Out,2412bc0e-51ce-4b2b-9e7a-79957c90e9c4,2015-07-01,2015-07-03,2,2014-10-17,259
2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.50,0,0,Check-Out,85f71aaf-fc60-4d40-8e86-21926242d1cc,2015-07-01,2015-07-03,2,2014-10-17,259
3,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.50,0,0,Check-Out,1d47be78-60d1-49a0-a562-a5f1368deb56,2015-07-01,2015-07-03,2,2014-10-17,259
4,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,101.50,0,0,Check-Out,1e3cbdae-b796-4bff-b194-765f8ae5b51d,2015-07-01,2015-07-03,2,2014-10-17,259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,207.03,0,1,Check-Out,bba6a12e-50bc-40c8-9053-7f8f3a4ece8e,2017-08-31,2017-09-07,7,2017-05-15,115
119386,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,312.29,1,1,Check-Out,2f58e4d6-9d61-481f-a413-cfa43d1af3f0,2017-08-31,2017-09-07,7,2017-02-18,201
119387,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,207.00,0,2,Canceled,e906ae58-d310-4886-8f5c-df89e100f204,2017-08-31,2017-09-03,3,2017-08-14,0
119388,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,114.80,0,0,Check-Out,82a13c2e-5455-4ee2-b5ed-a4a6f3636586,2017-08-31,2017-09-07,7,2017-02-21,198


# Feature Engineering: Holidays

In [39]:
min_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .min()
                 .min()
                 .year)

max_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .max()
                 .max()
                 .year)

min_year, max_year

(2013, 2017)

In [40]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years= [year for year in range(min_year, max_year+1)]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)

def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [41]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,DepartureDate,Length of Stay,BookingDate,DaysSinceBooking,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,2015-07-01,0,2014-07-24,342,45,21,45,21,22,44
1,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
3,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
4,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2015-07-03,2,2014-10-17,259,45,21,43,23,52,63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,2017-09-07,7,2017-05-15,115,35,16,28,23,26,14
119386,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,2017-09-07,7,2017-02-18,201,35,16,28,23,55,48
119387,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,2017-09-03,3,2017-08-14,0,35,16,32,19,1,60
119388,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,2017-09-07,7,2017-02-21,198,35,16,28,23,52,51


# Feature Engineering: ISO Day of Week, ISO Week of Year

In [42]:
df_data['ArrivalDate'].dt.dayofweek.head()

0    2
1    2
2    2
3    2
4    2
Name: ArrivalDate, dtype: int32

In [43]:
arrival_isocal = (df_data['ArrivalDate']
                  .dt.isocalendar()[['week', 'day']]
                  .rename(columns = {'week':'ArrivalWeek', 'day': 'ArrivalDay'}))
arrival_isocal

Unnamed: 0,ArrivalWeek,ArrivalDay
0,27,3
1,27,3
2,27,3
3,27,3
4,27,3
...,...,...
119385,35,4
119386,35,4
119387,35,4
119388,35,4


In [44]:
df_data = pd.concat([df_data, arrival_isocal], axis = 1)
df_data

Unnamed: 0,IsCanceled,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,...,BookingDate,DaysSinceBooking,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalWeek,ArrivalDay
0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,...,2014-07-24,342,45,21,45,21,22,44,27,3
1,0,1,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
2,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
3,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
4,0,2,0.0,0,HB,PRT,Offline TA/TO,TA/TO,0,0,...,2014-10-17,259,45,21,43,23,52,63,27,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,2,0.0,0,HB,GBR,Online TA,TA/TO,0,0,...,2017-05-15,115,35,16,28,23,26,14,35,4
119386,0,2,1.0,0,HB,ITA,Online TA,TA/TO,0,0,...,2017-02-18,201,35,16,28,23,55,48,35,4
119387,1,2,0.0,0,HB,ESP,Online TA,TA/TO,0,0,...,2017-08-14,0,35,16,32,19,1,60,35,4
119388,0,2,0.0,0,HB,GBR,Offline TA/TO,TA/TO,0,0,...,2017-02-21,198,35,16,28,23,52,51,35,4


In [45]:
df_data.head().T

Unnamed: 0,0,1,2,3,4
IsCanceled,0,0,0,0,0
Adults,2,1,2,2,2
Children,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0
Meal,BB,HB,HB,HB,HB
Country,PRT,PRT,PRT,PRT,PRT
MarketSegment,Direct,Offline TA/TO,Offline TA/TO,Offline TA/TO,Offline TA/TO
DistributionChannel,Direct,TA/TO,TA/TO,TA/TO,TA/TO
IsRepeatedGuest,0,0,0,0,0
PreviousCancellations,0,0,0,0,0


# Feature Engineering: Rolling Averages, Lag, and Polynomial Features

---

***EXPANDING DATASET USING ON TIME-SERIES APPROACHES***

> To help capture the time series-related data from my ADR, I will also introduce rolling averages, rolling standard deviations, and apply exponential smooothing to create new features.
>
> This approach does use the target feature for engineering, but as long as I split my data on the arrival date, I'm confident that I can avoid data leakage.
>
> Polynomial features will be added during the modeling pipeline.

---

In [48]:
# Lag features
df_data['ADR_lag_1'] = df_data['ADR'].shift(1)
df_data['ADR_lag_7'] = df_data['ADR'].shift(7)

# 3-day rolling average (past 3 days)
df_data['ADR_7d_avg'] = df_data['ADR'].shift(1).rolling(window=3).mean()
# 7-day rolling average (past 7 days)
df_data['ADR_30d_avg'] = df_data['ADR'].shift(1).rolling(window=7).mean()
# 3-day moving standard deviation (past 3 days)
df_data['ADR_7d_std'] = df_data['ADR'].shift(1).rolling(window=3).std()
# 7-day moving standard deviation (past 7 days)
df_data['ADR_30d_std'] = df_data['ADR'].shift(1).rolling(window=7).std()

# Exponential smoothing
df_data['ADR_ewm_3'] = df_data['ADR'].shift(1).ewm(span=3, adjust=False).mean()
df_data['ADR_ewm_7'] = df_data['ADR'].shift(1).ewm(span=7, adjust=False).mean()

# Final Inspection

---

I extracted a good deal of information about booking and stay dates, as well as adding temporal features. While this approach does add a significant number of features, I am confident that the additional data will be worthwhile.

---

In [49]:
df_data.to_feather('../../data/2.2_temporally_updated_data.feather')