# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [23]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [25]:
import holidays
import pandas as pd

# *TEMPORARILY CHANGED TO USE BACKUP FILES* | Read Data from DuckDB

In [26]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [27]:
# ## Select subset of data for review
# q = 'SELECT IsCanceled FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [28]:
# ## Convert Arrival columns to strings

# q = ('''
# SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
# StaysInWeekNights, StaysInWeekendNights, LeadTime 
# FROM res_data''')

# with db_utils.duckdb_connection(db_path) as conn:
#     df_data = conn.execute(q).df()

# # df_data = arrival_cols.astype(str)
# df_data.head()

In [29]:
date_features = ['UUID', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
                 'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime']
date_features

['UUID',
 'ArrivalDateYear',
 'ArrivalDateMonth',
 'ArrivalDateDayOfMonth',
 'StaysInWeekNights',
 'StaysInWeekendNights',
 'LeadTime']

In [30]:
path_to_backup_files = '../data/data_condensed_with_uuid.parquet'

df_data = pd.read_parquet(path_to_backup_files, columns=date_features)

df_data

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015,July,1,0,0,342
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015,July,1,0,0,737
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015,July,1,1,0,7
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015,July,1,1,0,13
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015,July,1,2,0,14
...,...,...,...,...,...,...,...
79325,c06e053b-6856-4200-9011-904ac4fc59af,2017,August,30,5,2,23
79326,831fc051-24ce-483d-b6fb-42f1c8daf5fe,2017,August,31,5,2,102
79327,a1f82972-185d-4116-b9da-8cdf6b901a07,2017,August,31,5,2,34
79328,494e8060-664c-47bc-a15b-440781d9de34,2017,August,31,5,2,109


# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [31]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = df_data['ArrivalDateYear'].str.cat(df_data[['ArrivalDateMonth',
                                                                     'ArrivalDateDayOfMonth']],
                                                            '-')

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data.head()

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015,July,1,0,0,342,2015-07-01
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015,July,1,0,0,737,2015-07-01
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015,July,1,1,0,7,2015-07-01
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015,July,1,1,0,13,2015-07-01
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015,July,1,2,0,14,2015-07-01


## Departure Date

In [32]:
timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

df_data['DepartureDate'] = df_data.loc[:, 'ArrivalDate'] + timedelta_wk + timedelta_wknd

df_data.head()

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015,July,1,0,0,342,2015-07-01,2015-07-01
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015,July,1,0,0,737,2015-07-01,2015-07-01
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015,July,1,1,0,7,2015-07-01,2015-07-02
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015,July,1,1,0,13,2015-07-01,2015-07-02
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015,July,1,2,0,14,2015-07-01,2015-07-03


## Booking Date

In [33]:
df_data['LeadTime']

0        342
1        737
2          7
3         13
4         14
        ... 
79325     23
79326    102
79327     34
79328    109
79329    205
Name: LeadTime, Length: 119390, dtype: int64

In [34]:
df_data['LeadTimeDelta'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')
df_data['LeadTimeDelta']

0       342 days
1       737 days
2         7 days
3        13 days
4        14 days
          ...   
79325    23 days
79326   102 days
79327    34 days
79328   109 days
79329   205 days
Name: LeadTimeDelta, Length: 119390, dtype: timedelta64[ns]

In [35]:
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTimeDelta']

df_data.head(10)

Unnamed: 0,UUID,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015,July,1,0,0,342,2015-07-01,2015-07-01,342 days,2014-07-24
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015,July,1,0,0,737,2015-07-01,2015-07-01,737 days,2013-06-24
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015,July,1,1,0,7,2015-07-01,2015-07-02,7 days,2015-06-24
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015,July,1,1,0,13,2015-07-01,2015-07-02,13 days,2015-06-18
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015,July,1,2,0,14,2015-07-01,2015-07-03,14 days,2015-06-17
5,a51c1681-66d1-4aaf-a945-f3cd92f4bd48,2015,July,1,2,0,14,2015-07-01,2015-07-03,14 days,2015-06-17
6,78391ed3-d18a-41d5-9451-8b6cfefcc0f4,2015,July,1,2,0,0,2015-07-01,2015-07-03,0 days,2015-07-01
7,9412feee-b5ea-4ef3-9812-1603156567ee,2015,July,1,2,0,9,2015-07-01,2015-07-03,9 days,2015-06-22
8,8ec81372-6406-4c38-9365-52d34cc6b0c9,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07
9,26d04349-7c42-4ee3-b8c7-1a6d78560f7c,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17


In [36]:
drop_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime', 'LeadTimeDelta']
df_data = df_data.drop(columns = drop_cols)
df_data.head()

Unnamed: 0,UUID,ArrivalDate,DepartureDate,BookingDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015-07-01,2014-07-24
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015-07-01,2013-06-24
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-01,2015-07-02,2015-06-24
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-01,2015-07-02,2015-06-18
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-01,2015-07-03,2015-06-17


In [37]:
df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']].min()

ArrivalDate     2015-07-01
DepartureDate   2015-07-01
BookingDate     2013-06-24
dtype: datetime64[ns]

# FE: Holidays

In [38]:
# Fetch holidays for the specific range of years (2014-2017)
pt_holidays = holidays.CountryHoliday('PT', years=[2013, 2014, 2015, 2016, 2017])

# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:
        date = dt.date()  # Convert Timestamp to datetime.date
        # Find the closest past holiday
        past_holidays = [(date - h_date).days for h_date in holidays if h_date < date]
        if past_holidays:
            days_after = min((d for d in past_holidays if d >= 0), default=None)
        else:
            days_after = None

        # Find the closest upcoming holiday
        future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        if future_holidays:
            days_before = min((d for d in future_holidays if d >= 0), default=None)
        else:
            days_before = None

        days_after_recent_holiday.append(days_after)
        days_before_next_holiday.append(days_before)

    return days_after_recent_holiday, days_before_next_holiday

In [39]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,UUID,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7
...,...,...,...,...,...,...,...,...,...,...
79325,c06e053b-6856-4200-9011-904ac4fc59af,2017-08-30,2017-09-06,2017-08-07,36,15,29,22,8,53
79326,831fc051-24ce-483d-b6fb-42f1c8daf5fe,2017-08-31,2017-09-07,2017-05-21,35,16,28,23,20,20
79327,a1f82972-185d-4116-b9da-8cdf6b901a07,2017-08-31,2017-09-07,2017-07-28,35,16,28,23,18,43
79328,494e8060-664c-47bc-a15b-440781d9de34,2017-08-31,2017-09-07,2017-05-14,35,16,28,23,27,13


# FE: ISO Day of Week, ISO Week of Year

In [40]:
df_data['ArrivalDate'].dt.dayofweek.head()

0    2
1    2
2    2
3    2
4    2
Name: ArrivalDate, dtype: int32

In [41]:
df_data['ArrivalDate'].dt.isocalendar().head()

Unnamed: 0,year,week,day
0,2015,27,3
1,2015,27,3
2,2015,27,3
3,2015,27,3
4,2015,27,3


In [42]:
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    df_data[f'{column}_WeekNumber'] = df_data[column].dt.isocalendar()['week']
    df_data[f'{column}_DayOfWeek'] = df_data[column].dt.isocalendar()['day']
    
df_data.head()

Unnamed: 0,UUID,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44,27,3,27,3,30,4
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14,27,3,27,3,26,1
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14,27,3,27,4,26,3
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8,27,3,27,4,25,4
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7,27,3,27,5,25,3


In [43]:
df_data.to_parquet('../data/engineered_data_dates.parquet', compression = 'snappy')