# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [9]:
import holidays
import pandas as pd

# Read Data from DuckDB

In [10]:
# Path to the DuckDB database file
db_path = '../data/hotel_reservations.duckdb'

## Select subset of data for review
q = 'SELECT * FROM res_data LIMIT 5'

with db_utils.duckdb_connection(db_path) as conn:
    display(conn.execute(q).df())

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID
0,0,342,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,f0308f81-b9a3-414f-99da-07882aeb3093
1,0,737,2015,July,27,1,0,0,2,0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,45191a56-9557-4c70-84e4-724e4549a35c
2,0,7,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,599d7902-a7b9-466d-8d12-3dd66fa0c6eb
3,0,13,2015,July,27,1,0,1,1,0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,66c0071a-36b2-4cd8-b930-2242c1aace20
4,0,14,2015,July,27,1,0,2,2,0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db


In [11]:
## Convert Arrival columns to strings

q = ('''
SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
StaysInWeekNights, StaysInWeekendNights, LeadTime 
FROM res_data''')

with db_utils.duckdb_connection(db_path) as conn:
    df_data = conn.execute(q).df()

# df_data = arrival_cols.astype(str)
df_data.head()

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015,July,1,0,0,342
1,45191a56-9557-4c70-84e4-724e4549a35c,2015,July,1,0,0,737
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015,July,1,1,0,7
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015,July,1,1,0,13
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015,July,1,2,0,14


# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [12]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = df_data['ArrivalDateYear'].str.cat(df_data[['ArrivalDateMonth',
                                                                     'ArrivalDateDayOfMonth']],
                                                            '-')

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data.head()

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015,July,1,0,0,342,2015-07-01
1,45191a56-9557-4c70-84e4-724e4549a35c,2015,July,1,0,0,737,2015-07-01
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015,July,1,1,0,7,2015-07-01
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015,July,1,1,0,13,2015-07-01
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015,July,1,2,0,14,2015-07-01


## Departure Date

In [13]:
timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

df_data['DepartureDate'] = df_data.loc[:, 'ArrivalDate'] + timedelta_wk + timedelta_wknd

df_data.head()

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015,July,1,0,0,342,2015-07-01,2015-07-01
1,45191a56-9557-4c70-84e4-724e4549a35c,2015,July,1,0,0,737,2015-07-01,2015-07-01
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015,July,1,1,0,7,2015-07-01,2015-07-02
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015,July,1,1,0,13,2015-07-01,2015-07-02
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015,July,1,2,0,14,2015-07-01,2015-07-03


## Booking Date

In [14]:
df_data['LeadTime']

0         342
1         737
2           7
3          13
4          14
         ... 
119385     23
119386    102
119387     34
119388    109
119389    205
Name: LeadTime, Length: 119390, dtype: int64

In [15]:
df_data['LeadTimeDelta'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')
df_data['LeadTimeDelta']

0        342 days
1        737 days
2          7 days
3         13 days
4         14 days
           ...   
119385    23 days
119386   102 days
119387    34 days
119388   109 days
119389   205 days
Name: LeadTimeDelta, Length: 119390, dtype: timedelta64[ns]

In [16]:
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTimeDelta']

df_data.head(10)

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015,July,1,0,0,342,2015-07-01,2015-07-01,342 days,2014-07-24
1,45191a56-9557-4c70-84e4-724e4549a35c,2015,July,1,0,0,737,2015-07-01,2015-07-01,737 days,2013-06-24
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015,July,1,1,0,7,2015-07-01,2015-07-02,7 days,2015-06-24
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015,July,1,1,0,13,2015-07-01,2015-07-02,13 days,2015-06-18
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015,July,1,2,0,14,2015-07-01,2015-07-03,14 days,2015-06-17
5,03710df6-604b-4d72-a36d-93e52d2c0739,2015,July,1,2,0,14,2015-07-01,2015-07-03,14 days,2015-06-17
6,f9eb5e25-88c9-45ce-9956-bdfcfff01ad7,2015,July,1,2,0,0,2015-07-01,2015-07-03,0 days,2015-07-01
7,a55e2cf5-2af5-4e82-a657-94c34c6b2129,2015,July,1,2,0,9,2015-07-01,2015-07-03,9 days,2015-06-22
8,08917955-1d70-4206-a553-e1b6613dd1db,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07
9,2f0aa580-0a90-47e1-93e6-6550f5117d05,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17


In [17]:
drop_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime', 'LeadTimeDelta']
df_data = df_data.drop(columns = drop_cols)
df_data.head()

Unnamed: 0,UUID,ArrivalDate,DepartureDate,BookingDate
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015-07-01,2015-07-01,2014-07-24
1,45191a56-9557-4c70-84e4-724e4549a35c,2015-07-01,2015-07-01,2013-06-24
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015-07-01,2015-07-02,2015-06-24
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015-07-01,2015-07-02,2015-06-18
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015-07-01,2015-07-03,2015-06-17


In [18]:
df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']].min()

ArrivalDate     2015-07-01
DepartureDate   2015-07-01
BookingDate     2013-06-24
dtype: datetime64[ns]

# FE: Holidays

In [19]:
# Fetch holidays for the specific range of years (2014-2017)
pt_holidays = holidays.CountryHoliday('PT', years=[2013, 2014, 2015, 2016, 2017])

# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:
        date = dt.date()  # Convert Timestamp to datetime.date
        # Find the closest past holiday
        past_holidays = [(date - h_date).days for h_date in holidays if h_date < date]
        if past_holidays:
            days_after = min((d for d in past_holidays if d >= 0), default=None)
        else:
            days_after = None

        # Find the closest upcoming holiday
        future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        if future_holidays:
            days_before = min((d for d in future_holidays if d >= 0), default=None)
        else:
            days_before = None

        days_after_recent_holiday.append(days_after)
        days_before_next_holiday.append(days_before)

    return days_after_recent_holiday, days_before_next_holiday

In [20]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,UUID,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44
1,45191a56-9557-4c70-84e4-724e4549a35c,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7
...,...,...,...,...,...,...,...,...,...,...
119385,9544f2a3-d202-4993-be8f-8a1b7639ba80,2017-08-30,2017-09-06,2017-08-07,36,15,29,22,8,53
119386,4bc68dbe-ce80-47e8-9e2f-204c01033720,2017-08-31,2017-09-07,2017-05-21,35,16,28,23,20,20
119387,ba81771c-d85f-44ac-8184-a3119f121404,2017-08-31,2017-09-07,2017-07-28,35,16,28,23,18,43
119388,186c0f9d-72ea-4885-b287-e0f97ac1de43,2017-08-31,2017-09-07,2017-05-14,35,16,28,23,27,13


# FE: ISO Day of Week, ISO Week of Year

In [21]:
df_data['ArrivalDate'].dt.dayofweek.head()

0    2
1    2
2    2
3    2
4    2
Name: ArrivalDate, dtype: int32

In [22]:
df_data['ArrivalDate'].dt.isocalendar().head()

Unnamed: 0,year,week,day
0,2015,27,3
1,2015,27,3
2,2015,27,3
3,2015,27,3
4,2015,27,3


In [23]:
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    df_data[f'{column}_WeekNumber'] = df_data[column].dt.isocalendar()['week']
    df_data[f'{column}_DayOfWeek'] = df_data[column].dt.isocalendar()['day']
    
df_data.head()

Unnamed: 0,UUID,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
0,f0308f81-b9a3-414f-99da-07882aeb3093,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44,27,3,27,3,30,4
1,45191a56-9557-4c70-84e4-724e4549a35c,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14,27,3,27,3,26,1
2,599d7902-a7b9-466d-8d12-3dd66fa0c6eb,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14,27,3,27,4,26,3
3,66c0071a-36b2-4cd8-b930-2242c1aace20,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8,27,3,27,4,25,4
4,c7b7e1d6-cd15-4f9a-94bd-a2d774e4f4db,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7,27,3,27,5,25,3


In [24]:
df_data.to_parquet('../data/engineered_data_dates.parquet', compression = 'snappy')