# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('../..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [3]:
import datetime as dt

from feature_engine.imputation import CategoricalImputer, MeanMedianImputer
from feature_engine.encoding import DecisionTreeEncoder, MeanEncoder, OneHotEncoder, RareLabelEncoder
from feature_engine.outliers import OutlierTrimmer
from feature_engine.pipeline import Pipeline

import holidays
import pandas as pd

# *TEMPORARILY CHANGED TO USE BACKUP FILES* | Read Data from DuckDB

In [4]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [5]:
# ## Select subset of data for review
# q = 'SELECT IsCanceled FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [6]:
# ## Convert Arrival columns to strings

# q = ('''
# SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
# StaysInWeekNights, StaysInWeekendNights, LeadTime 
# FROM res_data''')

# with db_utils.duckdb_connection(db_path) as conn:
#     df_data = conn.execute(q).df()

# # df_data = arrival_cols.astype(str)
# df_data.head()

In [7]:
# date_features = ['ReservationStatusDate', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
#                  'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime']
# date_features

In [4]:
path = '../../data/source/full_data.feather'

# df_data = pd.read_feather(path, columns = date_features)
df_data = pd.read_feather(path)

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber
0,0,342,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
1,0,737,2015,July,27,1,0,0,2,0.0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
2,0,7,2015,July,27,1,0,1,1,0.0,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
3,0,13,2015,July,27,1,0,1,1,0.0,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
4,0,14,2015,July,27,1,0,2,2,0.0,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,2017,August,35,30,2,5,2,0.0,...,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2
79326,0,102,2017,August,35,31,2,5,3,0.0,...,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2
79327,0,34,2017,August,35,31,2,5,2,0.0,...,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2
79328,0,109,2017,August,35,31,2,5,2,0.0,...,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2


In [5]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119390 entries, 0 to 79329
Data columns (total 32 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   IsCanceled                   119390 non-null  int64  
 1   LeadTime                     119390 non-null  int64  
 2   ArrivalDateYear              119390 non-null  int64  
 3   ArrivalDateMonth             119390 non-null  object 
 4   ArrivalDateWeekNumber        119390 non-null  int64  
 5   ArrivalDateDayOfMonth        119390 non-null  int64  
 6   StaysInWeekendNights         119390 non-null  int64  
 7   StaysInWeekNights            119390 non-null  int64  
 8   Adults                       119390 non-null  int64  
 9   Children                     119386 non-null  float64
 10  Babies                       119390 non-null  int64  
 11  Meal                         119390 non-null  object 
 12  Country                      118902 non-null  object 
 13  Marke

## Convert ReservationStatusDate to Datetime Format

In [6]:
df_data['ReservationStatusDate'] = pd.to_datetime(df_data['ReservationStatusDate'], yearfirst = True)
df_data['ReservationStatusDate']

0       2015-07-01
1       2015-07-01
2       2015-07-02
3       2015-07-02
4       2015-07-03
           ...    
79325   2017-09-06
79326   2017-09-07
79327   2017-09-07
79328   2017-09-07
79329   2017-09-07
Name: ReservationStatusDate, Length: 119390, dtype: datetime64[ns]

# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [7]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = (df_data['ArrivalDateYear']
                          .str.cat(df_data[['ArrivalDateMonth',
                                            'ArrivalDateDayOfMonth']],
                                   '-')
                          )

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate
0,0,342,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,H1,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.0,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,H1,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,H1,2015-07-01
3,0,13,2015,July,27,1,0,1,1,0.0,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,H1,2015-07-01
4,0,14,2015,July,27,1,0,2,2,0.0,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,H1,2015-07-01


In [8]:
df_data = df_data.drop(columns = arrival_details)
df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate
0,0,342,27,0,0,2,0.0,0,BB,PRT,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01
1,0,737,27,0,0,2,0.0,0,BB,PRT,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01
2,0,7,27,0,1,1,0.0,0,BB,GBR,...,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01
3,0,13,27,0,1,1,0.0,0,BB,GBR,...,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01
4,0,14,27,0,2,2,0.0,0,BB,GBR,...,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,35,2,5,2,0.0,0,BB,BEL,...,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2,2017-08-30
79326,0,102,35,2,5,3,0.0,0,BB,FRA,...,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2,2017-08-31
79327,0,34,35,2,5,2,0.0,0,BB,DEU,...,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2,2017-08-31
79328,0,109,35,2,5,2,0.0,0,BB,GBR,...,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2,2017-08-31


## Departure Date

In [9]:
## Convert number of nights stays to timedelta,
## then use to calculate departure date and stay length

timedelta_wknd = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekendNights'],
                    unit = 'D')
timedelta_wk = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekNights'],
                    unit = 'D')

df_data['DepartureDate'] = (df_data.loc[:, 'ArrivalDate'] 
                            + timedelta_wk 
                            + timedelta_wknd)

df_data['Length of Stay'] = df_data['StaysInWeekendNights'] +df_data['StaysInWeekNights']

df_data = df_data.drop(columns = ['StaysInWeekendNights', 'StaysInWeekNights'])

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,...,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,DepartureDate,Length of Stay
0,0,342,27,2,0.0,0,BB,PRT,Direct,Direct,...,Transient,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,2015-07-01,0
1,0,737,27,2,0.0,0,BB,PRT,Direct,Direct,...,Transient,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,2015-07-01,0
2,0,7,27,1,0.0,0,BB,GBR,Direct,Direct,...,Transient,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01,2015-07-02,1
3,0,13,27,1,0.0,0,BB,GBR,Corporate,Corporate,...,Transient,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01,2015-07-02,1
4,0,14,27,2,0.0,0,BB,GBR,Online TA,TA/TO,...,Transient,98.00,0,1,Check-Out,2015-07-03,H1,2015-07-01,2015-07-03,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,35,2,0.0,0,BB,BEL,Offline TA/TO,TA/TO,...,Transient,96.14,0,0,Check-Out,2017-09-06,H2,2017-08-30,2017-09-06,7
79326,0,102,35,3,0.0,0,BB,FRA,Online TA,TA/TO,...,Transient,225.43,0,2,Check-Out,2017-09-07,H2,2017-08-31,2017-09-07,7
79327,0,34,35,2,0.0,0,BB,DEU,Online TA,TA/TO,...,Transient,157.71,0,4,Check-Out,2017-09-07,H2,2017-08-31,2017-09-07,7
79328,0,109,35,2,0.0,0,BB,GBR,Online TA,TA/TO,...,Transient,104.40,0,0,Check-Out,2017-09-07,H2,2017-08-31,2017-09-07,7


## `BookingDate` from `LeadTime`

In [10]:
## Convert to TimeDelta
df_data['LeadTime'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')

## Subtract LeadTime from ArrivalDate to calculate BookingDate
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTime']
df_data['BookingDate']

df_data = df_data.drop(columns = 'LeadTime')

df_data

Unnamed: 0,IsCanceled,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,0,27,2,0.0,0,BB,PRT,Direct,Direct,0,...,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,2015-07-01,0,2014-07-24
1,0,27,2,0.0,0,BB,PRT,Direct,Direct,0,...,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,2015-07-01,0,2013-06-24
2,0,27,1,0.0,0,BB,GBR,Direct,Direct,0,...,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01,2015-07-02,1,2015-06-24
3,0,27,1,0.0,0,BB,GBR,Corporate,Corporate,0,...,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01,2015-07-02,1,2015-06-18
4,0,27,2,0.0,0,BB,GBR,Online TA,TA/TO,0,...,98.00,0,1,Check-Out,2015-07-03,H1,2015-07-01,2015-07-03,2,2015-06-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,35,2,0.0,0,BB,BEL,Offline TA/TO,TA/TO,0,...,96.14,0,0,Check-Out,2017-09-06,H2,2017-08-30,2017-09-06,7,2017-08-07
79326,0,35,3,0.0,0,BB,FRA,Online TA,TA/TO,0,...,225.43,0,2,Check-Out,2017-09-07,H2,2017-08-31,2017-09-07,7,2017-05-21
79327,0,35,2,0.0,0,BB,DEU,Online TA,TA/TO,0,...,157.71,0,4,Check-Out,2017-09-07,H2,2017-08-31,2017-09-07,7,2017-07-28
79328,0,35,2,0.0,0,BB,GBR,Online TA,TA/TO,0,...,104.40,0,0,Check-Out,2017-09-07,H2,2017-08-31,2017-09-07,7,2017-05-14


In [12]:
df_data.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
IsCanceled,0,0,0,0,0,0,0,0,1,1
ArrivalDateWeekNumber,27,27,27,27,27,27,27,27,27,27
Adults,2,2,1,1,2,2,2,2,2,2
Children,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0,0,0,0,0,0
Meal,BB,BB,BB,BB,BB,BB,BB,FB,BB,HB
Country,PRT,PRT,GBR,GBR,GBR,GBR,PRT,PRT,PRT,PRT
MarketSegment,Direct,Direct,Direct,Corporate,Online TA,Online TA,Direct,Direct,Online TA,Offline TA/TO
DistributionChannel,Direct,Direct,Direct,Corporate,TA/TO,TA/TO,Direct,Direct,TA/TO,TA/TO
IsRepeatedGuest,0,0,0,0,0,0,0,0,0,0


# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [14]:
## Review data prior to changes
df_data['ReservationStatusDate'].head(10)

0   2015-07-01
1   2015-07-01
2   2015-07-02
3   2015-07-02
4   2015-07-03
5   2015-07-03
6   2015-07-03
7   2015-07-03
8   2015-05-06
9   2015-04-22
Name: ReservationStatusDate, dtype: datetime64[ns]

## Read-In `IsCanceled` Data to Match Reservations

In [17]:
change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

avg_resstatdate_before_arrival = change_filter.mean()

avg_cxl = df_data['IsCanceled'].mean()

print((f'''The overall average number of canceled reservations is: {avg_cxl:.2%}\n'''))

print(' '.join(['The average number of canceled reservations with a ReservationStatusDate',
             f'prior to the arrival date is: {avg_resstatdate_before_arrival:.2%}\n''']))

if avg_cxl >= .9:
    print(' '.join('The `ReservationStatusDate` feature is too strongly indicative of the `IsCanceled` feature.',
          'It should not be used for modeling.'))
elif avg_cxl >= .25 and avg_cxl < .9:
    print(' '.join(['This feature is related to the `IsCanceled` feature.',
          'Make sure to review it in more detail to determine whether to use it.']))
else:
    print('The `ReservationStatusDate` feature is unlikely to be predictive of the `IsCanceled` feature.')

The overall average number of canceled reservations is: 37.04%

The average number of canceled reservations with a ReservationStatusDate prior to the arrival date is: 35.29%

This feature is related to the `IsCanceled` feature. Make sure to review it in more detail to determine whether to use it.


In [18]:
# ## What is the breakdown of reservation statuses for those reservations with matching Arrival and Status Dates?
# ## (A.K.A. "same-day departures" or "day-use reservations.")

# sameday_status = (df_data['ReservationStatusDate'] == df_data['ArrivalDate'])

# (df_data[sameday_status]
#  .value_counts(subset = 'ReservationStatus',normalize = True)
#  .round(2))

In [19]:
# ## What is the breakdown of IsCanceled statuses
# ## for those reservations with matching Arrival and Status Dates?

# (df_data[sameday_status]
#  .value_counts(subset = 'IsCanceled',normalize = True)
#  .round(2))

In [20]:
# ## What is the average rate for these day-use/same-day-departure reservations?

# sameday_adr = (sameday_status & (df_data['ReservationStatus'] == 'Check-Out'))

# sameday_adr_median = df_data[sameday_adr]['ADR'].median()

# print(f'The median ADR for same-day reservations is: ${sameday_adr_median:.2f}')

# sameday_adr_gt_zero = (df_data[sameday_adr]['ADR'] > 0).mean().round(2)

# print(f'The number of same-day reservations with an ADR greater than zero is: {sameday_adr_gt_zero:.1%}')

In [21]:
# sameday_departure = (df_data['ReservationStatusDate'] == df_data['DepartureDate'])
# df_data[sameday_departure].value_counts(subset = 'IsCanceled', normalize =True).round(4)

In [22]:
# df_data[sameday_departure].value_counts(subset = 'ReservationStatus', normalize =True).round(4)

In [23]:
# df_data[(sameday_departure & (df_data['ReservationStatus'] != 'Check-Out'))]

The 

## ReservationStatusDate Earlier Than Arrival Date

In [24]:
after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

avg_resstatdate_after_arrival = (after_arrival_filter
                                  .mean()
                                  .round(2))
print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

The average number of reservations changed after arrival is: 62%.


# Future Work: Investigating Canceled Reservations

---

Although the `ReservationStatusDate` feature is not appropriate for feature engineering, I could use it to calculate the number of days between booking and cancellation (for canceled reservations only).

This would be outside of the scope of the current feature engineering, but I am noting it as future work for analysis.

---

In [25]:
## Caculate the number of days between the status and booking dates

df_data['DaysSinceBooking'] = (df_data['ReservationStatusDate'] - df_data['BookingDate']).dt.days

df_data['DaysSinceBooking']

0        342
1        737
2          8
3         14
4         16
        ... 
79325     30
79326    109
79327     41
79328    116
79329    214
Name: DaysSinceBooking, Length: 119390, dtype: int64

In [26]:
df_data.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
IsCanceled,0,0,0,0,0,0,0,0,1,1
ArrivalDateWeekNumber,27,27,27,27,27,27,27,27,27,27
Adults,2,2,1,1,2,2,2,2,2,2
Children,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0,0,0,0,0,0
Meal,BB,BB,BB,BB,BB,BB,BB,FB,BB,HB
Country,PRT,PRT,GBR,GBR,GBR,GBR,PRT,PRT,PRT,PRT
MarketSegment,Direct,Direct,Direct,Corporate,Online TA,Online TA,Direct,Direct,Online TA,Offline TA/TO
DistributionChannel,Direct,Direct,Direct,Corporate,TA/TO,TA/TO,Direct,Direct,TA/TO,TA/TO
IsRepeatedGuest,0,0,0,0,0,0,0,0,0,0


## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

# FE: Holidays

In [27]:
min_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .min()
                 .min()
                 .year)
min_year

2013

In [28]:
max_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .max()
                 .max()
                 .year)
max_year

2017

In [29]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years= [year for year in range(min_year, max_year+1)]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)

def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [30]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,IsCanceled,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,...,DepartureDate,Length of Stay,BookingDate,DaysSinceBooking,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,0,27,2,0.0,0,BB,PRT,Direct,Direct,0,...,2015-07-01,0,2014-07-24,342,45,21,45,21,22,44
1,0,27,2,0.0,0,BB,PRT,Direct,Direct,0,...,2015-07-01,0,2013-06-24,737,45,21,45,21,52,14
2,0,27,1,0.0,0,BB,GBR,Direct,Direct,0,...,2015-07-02,1,2015-06-24,8,45,21,44,22,52,14
3,0,27,1,0.0,0,BB,GBR,Corporate,Corporate,0,...,2015-07-02,1,2015-06-18,14,45,21,44,22,58,8
4,0,27,2,0.0,0,BB,GBR,Online TA,TA/TO,0,...,2015-07-03,2,2015-06-17,16,45,21,43,23,59,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,35,2,0.0,0,BB,BEL,Offline TA/TO,TA/TO,0,...,2017-09-06,7,2017-08-07,30,36,15,29,22,8,53
79326,0,35,3,0.0,0,BB,FRA,Online TA,TA/TO,0,...,2017-09-07,7,2017-05-21,109,35,16,28,23,20,20
79327,0,35,2,0.0,0,BB,DEU,Online TA,TA/TO,0,...,2017-09-07,7,2017-07-28,41,35,16,28,23,18,43
79328,0,35,2,0.0,0,BB,GBR,Online TA,TA/TO,0,...,2017-09-07,7,2017-05-14,116,35,16,28,23,27,13


In [35]:
df_data.to_feather('../../data/df_data.feather')

# FE: ISO Day of Week, ISO Week of Year

In [31]:
# df_data['ArrivalDate'].dt.dayofweek.head()

In [32]:
# df_data['ArrivalDate'].dt.isocalendar().head()

In [33]:
# for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
#     df_data[f'{column}_WeekNumber'] = df_data[column].dt.isocalendar()['week']
#     df_data[f'{column}_DayOfWeek'] = df_data[column].dt.isocalendar()['day']
    
# df_data.head()