# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [29]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [31]:
import holidays
import pandas as pd

# *TEMPORARILY CHANGED TO USE BACKUP FILES* | Read Data from DuckDB

In [32]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [33]:
# ## Select subset of data for review
# q = 'SELECT IsCanceled FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [34]:
# ## Convert Arrival columns to strings

# q = ('''
# SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
# StaysInWeekNights, StaysInWeekendNights, LeadTime 
# FROM res_data''')

# with db_utils.duckdb_connection(db_path) as conn:
#     df_data = conn.execute(q).df()

# # df_data = arrival_cols.astype(str)
# df_data.head()

In [35]:
date_features = ['UUID','ReservationStatusDate', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
                 'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime']
date_features

['UUID',
 'ReservationStatusDate',
 'ArrivalDateYear',
 'ArrivalDateMonth',
 'ArrivalDateDayOfMonth',
 'StaysInWeekNights',
 'StaysInWeekendNights',
 'LeadTime']

In [36]:
path_to_backup_files = '../data/data_condensed_with_uuid.parquet'

df_data = pd.read_parquet(path_to_backup_files, columns=date_features)

df_data

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015,July,1,0,0,342
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015,July,1,0,0,737
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-02,2015,July,1,1,0,7
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-02,2015,July,1,1,0,13
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-03,2015,July,1,2,0,14
...,...,...,...,...,...,...,...,...
79325,c06e053b-6856-4200-9011-904ac4fc59af,2017-09-06,2017,August,30,5,2,23
79326,831fc051-24ce-483d-b6fb-42f1c8daf5fe,2017-09-07,2017,August,31,5,2,102
79327,a1f82972-185d-4116-b9da-8cdf6b901a07,2017-09-07,2017,August,31,5,2,34
79328,494e8060-664c-47bc-a15b-440781d9de34,2017-09-07,2017,August,31,5,2,109


# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [37]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = df_data['ArrivalDateYear'].str.cat(df_data[['ArrivalDateMonth',
                                                                     'ArrivalDateDayOfMonth']],
                                                            '-')

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015,July,1,0,0,342,2015-07-01
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015,July,1,0,0,737,2015-07-01
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-02,2015,July,1,1,0,7,2015-07-01
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-02,2015,July,1,1,0,13,2015-07-01
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-03,2015,July,1,2,0,14,2015-07-01


## Departure Date

In [38]:
timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

df_data['DepartureDate'] = df_data.loc[:, 'ArrivalDate'] + timedelta_wk + timedelta_wknd

df_data.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015,July,1,0,0,342,2015-07-01,2015-07-01
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015,July,1,0,0,737,2015-07-01,2015-07-01
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-02,2015,July,1,1,0,7,2015-07-01,2015-07-02
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-02,2015,July,1,1,0,13,2015-07-01,2015-07-02
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-03,2015,July,1,2,0,14,2015-07-01,2015-07-03


## Booking Date

### Convert `LeadTime` to TimeDelta Datatype

In [39]:
df_data['LeadTime']

0        342
1        737
2          7
3         13
4         14
        ... 
79325     23
79326    102
79327     34
79328    109
79329    205
Name: LeadTime, Length: 119390, dtype: int64

In [40]:
df_data['LeadTimeDelta'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')
df_data['LeadTimeDelta']

0       342 days
1       737 days
2         7 days
3        13 days
4        14 days
          ...   
79325    23 days
79326   102 days
79327    34 days
79328   109 days
79329   205 days
Name: LeadTimeDelta, Length: 119390, dtype: timedelta64[ns]

### Calculate LeadTime Using LeadTimeDelta

In [41]:
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTimeDelta']

df_data.head(10)

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,2015-07-01,2015,July,1,0,0,342,2015-07-01,2015-07-01,342 days,2014-07-24
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,2015-07-01,2015,July,1,0,0,737,2015-07-01,2015-07-01,737 days,2013-06-24
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,2015-07-02,2015,July,1,1,0,7,2015-07-01,2015-07-02,7 days,2015-06-24
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,2015-07-02,2015,July,1,1,0,13,2015-07-01,2015-07-02,13 days,2015-06-18
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,2015-07-03,2015,July,1,2,0,14,2015-07-01,2015-07-03,14 days,2015-06-17
5,a51c1681-66d1-4aaf-a945-f3cd92f4bd48,2015-07-03,2015,July,1,2,0,14,2015-07-01,2015-07-03,14 days,2015-06-17
6,78391ed3-d18a-41d5-9451-8b6cfefcc0f4,2015-07-03,2015,July,1,2,0,0,2015-07-01,2015-07-03,0 days,2015-07-01
7,9412feee-b5ea-4ef3-9812-1603156567ee,2015-07-03,2015,July,1,2,0,9,2015-07-01,2015-07-03,9 days,2015-06-22
8,8ec81372-6406-4c38-9365-52d34cc6b0c9,2015-05-06,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07
9,26d04349-7c42-4ee3-b8c7-1a6d78560f7c,2015-04-22,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17


# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [42]:
df_data['ReservationStatusDate'].head()

0    2015-07-01
1    2015-07-01
2    2015-07-02
3    2015-07-02
4    2015-07-03
Name: ReservationStatusDate, dtype: object

In [43]:
change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

avg_resstatdate_before_arrival = (change_filter
                                  .mean()
                                  .round(2))
print(f'The average number of reservations last changed before arrival is: {avg_resstatdate_before_arrival:.0%}.')

The average number of reservations last changed before arrival is: 35%.


## Read-In `IsCanceled` Data to Match Reservations

In [44]:
path_to_backup_files = '../data/data_condensed_with_uuid.parquet'

df_UUID = pd.read_parquet(path_to_backup_files, columns = ['UUID', 'IsCanceled'])

df_UUID.head()

Unnamed: 0,UUID,IsCanceled
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,0
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,0
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,0
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,0
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,0


In [45]:
uuids = df_data[change_filter]['UUID'].to_list()
uuids[:5]

['8ec81372-6406-4c38-9365-52d34cc6b0c9',
 '26d04349-7c42-4ee3-b8c7-1a6d78560f7c',
 '7e167004-b650-49d4-ac08-a7d4ba009847',
 'b7dc30c4-5081-4ca1-a14a-5b3d2f91820d',
 '9919d530-52f1-455d-a022-8d7e6445f384']

In [46]:
uuid_filter = df_UUID['UUID'].isin(uuids)

avg_cxl = df_UUID[uuid_filter]['IsCanceled'].mean()

print((f'''The average number of canceled reservations with a ReservationStatusDate earlier than the arrival date is: {avg_cxl:.2%}\n'''))

if avg_cxl >= .9:
    print(f'''This feature is too strongly indicative of the `IsCanceled` feature. It should not be used for modeling.''')
elif avg_cxl >= .75 and avg_cxl < .9:
    print('''This feature is strongly related to the `IsCanceled` feature. Make sure to review it in more detail to determine whether to use it.''')
else:
    print('''This feature is too strongly indicative of the `IsCanceled` feature. It should not be used for modeling.''')

The average number of canceled reservations with a ReservationStatusDate earlier than the arrival date is: 100.00%

This feature is too strongly indicative of the `IsCanceled` feature. It should not be used for modeling.


## ReservationStatusDate Earlier Than Arrival Date

In [47]:
after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

avg_resstatdate_after_arrival = (after_arrival_filter
                                  .mean()
                                  .round(2))
print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

The average number of reservations changed after arrival is: 62%.


In [48]:
uuids_post_arv = df_data[after_arrival_filter]['UUID'].to_list()
# uuids[:10]

In [49]:
uuid_filter = df_UUID['UUID'].isin(uuids_post_arv)

avg_cxl = df_UUID[uuid_filter]['IsCanceled'].mean()

print((f'''The average number of canceled reservations with a ReservationStatusDate after than the arrival date is: {avg_cxl:.2%}\n'''))

The average number of canceled reservations with a ReservationStatusDate after than the arrival date is: 0.00%



# Future Work: Investigating Canceled Reservations

---

Although the `ReservationStatusDate` feature is not appropriate for feature engineering, I could use it to calculate the number of days between booking and cancellation (for canceled reservations only).

This would be outside of the scope of the current feature engineering, but I am noting it as future work for analysis.

---

In [50]:
df_UUID.head()

Unnamed: 0,UUID,IsCanceled
0,ba900e76-b827-45a0-b4c9-25c7c8dd1b91,0
1,a5140c89-83b0-4d86-a5c8-00aa3a3e7e14,0
2,ca76ba7a-aeb3-4302-954a-3fd64fc2123d,0
3,f45b6800-21ac-4212-aabc-b3ba62e74dbc,0
4,70351011-dea4-45fb-ab38-f11cc8fdd53f,0


In [51]:
## Create list of UUIDs from cancelled reservations
cxl_filter = df_UUID['IsCanceled'] == 1
cxl_uuids = df_data[cxl_filter]['UUID'].to_list()
cxl_uuids[:10]

['8ec81372-6406-4c38-9365-52d34cc6b0c9',
 '26d04349-7c42-4ee3-b8c7-1a6d78560f7c',
 '7e167004-b650-49d4-ac08-a7d4ba009847',
 'b7dc30c4-5081-4ca1-a14a-5b3d2f91820d',
 '9919d530-52f1-455d-a022-8d7e6445f384',
 'e406ebdc-aedc-4da9-9ef6-37889b7094cb',
 'ac0e697b-495c-48fc-a998-7a16e96eef9a',
 '957bab97-0815-4ec5-8221-231d01e324d8',
 '9b506680-a057-4928-a3c4-b2550c4ae29c',
 '85bcb5ed-bba0-486a-b9c8-75b5abf1ab41']

In [52]:
## Subset the date-engineered dataframe for cancelled reservations
cxl_res = df_data[df_data['UUID'].isin(cxl_uuids)]
cxl_res.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate
8,8ec81372-6406-4c38-9365-52d34cc6b0c9,2015-05-06,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07
9,26d04349-7c42-4ee3-b8c7-1a6d78560f7c,2015-04-22,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17
10,7e167004-b650-49d4-ac08-a7d4ba009847,2015-06-23,2015,July,1,4,0,23,2015-07-01,2015-07-05,23 days,2015-06-08
27,b7dc30c4-5081-4ca1-a14a-5b3d2f91820d,2015-05-11,2015,July,1,5,2,60,2015-07-01,2015-07-08,60 days,2015-05-02
32,9919d530-52f1-455d-a022-8d7e6445f384,2015-05-29,2015,July,1,8,2,96,2015-07-01,2015-07-11,96 days,2015-03-27


In [53]:
cxl_res.loc[:, 'ReservationStatusDate'] = pd.to_datetime(cxl_res.loc[:, 'ReservationStatusDate'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cxl_res['ReservationStatusDate'] = pd.to_datetime(cxl_res['ReservationStatusDate'])


In [54]:
## Calculate number of days between booking and cancellation
age_at_cxl = (cxl_res['ReservationStatusDate'] - cxl_res['BookingDate']).dt.days

age_at_cxl.name = 'DaysOldAtCancelation'

cxl_res = pd.concat([cxl_res, age_at_cxl], axis = 1)
cxl_res.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate,DaysOldAtCancelation
8,8ec81372-6406-4c38-9365-52d34cc6b0c9,2015-05-06,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07,29
9,26d04349-7c42-4ee3-b8c7-1a6d78560f7c,2015-04-22,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17,5
10,7e167004-b650-49d4-ac08-a7d4ba009847,2015-06-23,2015,July,1,4,0,23,2015-07-01,2015-07-05,23 days,2015-06-08,15
27,b7dc30c4-5081-4ca1-a14a-5b3d2f91820d,2015-05-11,2015,July,1,5,2,60,2015-07-01,2015-07-08,60 days,2015-05-02,9
32,9919d530-52f1-455d-a022-8d7e6445f384,2015-05-29,2015,July,1,8,2,96,2015-07-01,2015-07-11,96 days,2015-03-27,63


In [56]:
cxl_res[['UUID', 'DaysOldAtCancelation']].head()

Unnamed: 0,UUID,DaysOldAtCancelation
8,8ec81372-6406-4c38-9365-52d34cc6b0c9,29
9,26d04349-7c42-4ee3-b8c7-1a6d78560f7c,5
10,7e167004-b650-49d4-ac08-a7d4ba009847,15
27,b7dc30c4-5081-4ca1-a14a-5b3d2f91820d,9
32,9919d530-52f1-455d-a022-8d7e6445f384,63


In [57]:
cxl_res[['UUID', 'DaysOldAtCancelation']].to_parquet('../data/cxl_res_age.parquet', compression = 'snappy')

## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

# Subset DataFrame to Focus on Engineered Dates

In [None]:
drop_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime', 'LeadTimeDelta']
df_data = df_data.drop(columns = drop_cols)
df_data.head()

In [None]:
df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']].min()

# FE: Holidays

In [None]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years=[2013, 2014, 2015, 2016, 2017]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)

def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [None]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

# FE: ISO Day of Week, ISO Week of Year

In [None]:
df_data['ArrivalDate'].dt.dayofweek.head()

In [None]:
df_data['ArrivalDate'].dt.isocalendar().head()

In [None]:
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    df_data[f'{column}_WeekNumber'] = df_data[column].dt.isocalendar()['week']
    df_data[f'{column}_DayOfWeek'] = df_data[column].dt.isocalendar()['day']
    
df_data.head()

# Saving Results

In [None]:
df_data.to_parquet('../data/engineered_data_dates.parquet', compression = 'snappy')