# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [1]:
import datetime as dt
import holidays
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
path = '../../data/5.1_dataset_exploded.parquet'
df_data = pd.read_parquet(path)
df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,Date
0,0,342,2015,July,27,1,0,0,2,0.0,...,0,Check-Out,2015-07-01,H1,9af79666-f290-45c5-868c-2f9601b8f98b,2015-07-01,0,0 days,2015-07-01,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.0,...,0,Check-Out,2015-07-01,H1,81440274-e84e-4502-89f3-e01681d0672a,2015-07-01,0,0 days,2015-07-01,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.0,...,0,Check-Out,2015-07-02,H1,60fe936c-f7ba-48d9-ac73-71c21e1b3978,2015-07-01,1,1 days,2015-07-02,2015-07-01
3,0,13,2015,July,27,1,0,1,1,0.0,...,0,Check-Out,2015-07-02,H1,5b2aae61-1d0c-4314-b4c1-603595e43163,2015-07-01,1,1 days,2015-07-02,2015-07-01
4,0,12,2015,July,27,1,0,1,2,0.0,...,0,Check-Out,2015-07-02,H1,8225801b-2e43-490a-9703-4fec1baa34d8,2015-07-01,1,1 days,2015-07-02,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,0,161,2017,August,35,31,4,10,2,0.0,...,0,Check-Out,2017-09-14,H1,8ebd4e66-52a2-4630-a636-b3da3297918d,2017-08-31,14,14 days,2017-09-14,2017-09-12
528643,0,211,2017,August,35,31,4,10,2,0.0,...,1,Check-Out,2017-09-14,H1,5370a7d8-c7ba-41d2-92b5-3ce1c3469c5a,2017-08-31,14,14 days,2017-09-14,2017-09-13
528644,0,161,2017,August,35,31,4,10,2,0.0,...,0,Check-Out,2017-09-14,H1,8ebd4e66-52a2-4630-a636-b3da3297918d,2017-08-31,14,14 days,2017-09-14,2017-09-13
528645,0,211,2017,August,35,31,4,10,2,0.0,...,1,Check-Out,2017-09-14,H1,5370a7d8-c7ba-41d2-92b5-3ce1c3469c5a,2017-08-31,14,14 days,2017-09-14,2017-09-14


In [3]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528647 entries, 0 to 528646
Data columns (total 38 columns):
 #   Column                       Non-Null Count   Dtype          
---  ------                       --------------   -----          
 0   IsCanceled                   528647 non-null  int64          
 1   LeadTime                     528647 non-null  int64          
 2   ArrivalDateYear              528647 non-null  int64          
 3   ArrivalDateMonth             528647 non-null  object         
 4   ArrivalDateWeekNumber        528647 non-null  int64          
 5   ArrivalDateDayOfMonth        528647 non-null  int64          
 6   StaysInWeekendNights         528647 non-null  int64          
 7   StaysInWeekNights            528647 non-null  int64          
 8   Adults                       528647 non-null  int64          
 9   Children                     528631 non-null  float64        
 10  Babies                       528647 non-null  int64          
 11  Meal         

## Convert ReservationStatusDate to Datetime Format

In [4]:
df_data['ReservationStatusDate'] = pd.to_datetime(df_data['ReservationStatusDate'], yearfirst = True)
df_data['ReservationStatusDate']

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-02
            ...    
528642   2017-09-14
528643   2017-09-14
528644   2017-09-14
528645   2017-09-14
528646   2017-09-14
Name: ReservationStatusDate, Length: 528647, dtype: datetime64[ns]

# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [5]:
# ## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

# arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

# df_data[arrival_details] = df_data[arrival_details].astype(str)

# df_data['ArrivalDate'] = (df_data['ArrivalDateYear']
#                           .str.cat(df_data[['ArrivalDateMonth',
#                                             'ArrivalDateDayOfMonth']],
#                                    '-')
#                           )

# df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

# df_data = df_data.sort_values(by = 'ArrivalDate', ignore_index = False)

# df_data

In [6]:
# ## Drop features post-conversion
# df_data = (df_data
#            .drop(columns = ['ArrivalDateMonth',
#                             'ArrivalDateWeekNumber',
#                             'ArrivalDateDayOfMonth']))
# df_data

## Departure Date

In [7]:
# ## Convert number of nights stays to timedelta,
# ## then use to calculate departure date and stay length

# timedelta_wknd = pd.to_timedelta(
#                     df_data.loc[:, 'StaysInWeekendNights'],
#                     unit = 'D')
# timedelta_wk = pd.to_timedelta(
#                     df_data.loc[:, 'StaysInWeekNights'],
#                     unit = 'D')

# df_data['DepartureDate'] = (df_data.loc[:, 'ArrivalDate'] 
#                             + timedelta_wk 
#                             + timedelta_wknd)

# df_data['Length of Stay'] = df_data['StaysInWeekendNights'] + df_data['StaysInWeekNights']

# df_data = df_data.drop(columns = ['StaysInWeekendNights', 'StaysInWeekNights'])

# df_data

## `BookingDate` from `LeadTime`

In [8]:
## Convert to TimeDelta
df_data['LeadTime'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')

## Subtract LeadTime from ArrivalDate to calculate BookingDate
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTime']
df_data['BookingDate']

df_data = df_data.drop(columns = 'LeadTime')

df_data

Unnamed: 0,IsCanceled,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,Date,BookingDate
0,0,2015,July,27,1,0,0,2,0.0,0,...,Check-Out,2015-07-01,H1,9af79666-f290-45c5-868c-2f9601b8f98b,2015-07-01,0,0 days,2015-07-01,2015-07-01,2014-07-24
1,0,2015,July,27,1,0,0,2,0.0,0,...,Check-Out,2015-07-01,H1,81440274-e84e-4502-89f3-e01681d0672a,2015-07-01,0,0 days,2015-07-01,2015-07-01,2013-06-24
2,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,60fe936c-f7ba-48d9-ac73-71c21e1b3978,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-24
3,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,5b2aae61-1d0c-4314-b4c1-603595e43163,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-18
4,0,2015,July,27,1,0,1,2,0.0,0,...,Check-Out,2015-07-02,H1,8225801b-2e43-490a-9703-4fec1baa34d8,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,2017-09-14,H1,8ebd4e66-52a2-4630-a636-b3da3297918d,2017-08-31,14,14 days,2017-09-14,2017-09-12,2017-03-23
528643,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,2017-09-14,H1,5370a7d8-c7ba-41d2-92b5-3ce1c3469c5a,2017-08-31,14,14 days,2017-09-14,2017-09-13,2017-02-01
528644,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,2017-09-14,H1,8ebd4e66-52a2-4630-a636-b3da3297918d,2017-08-31,14,14 days,2017-09-14,2017-09-13,2017-03-23
528645,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,2017-09-14,H1,5370a7d8-c7ba-41d2-92b5-3ce1c3469c5a,2017-08-31,14,14 days,2017-09-14,2017-09-14,2017-02-01


In [9]:
df_data.head(10)

Unnamed: 0,IsCanceled,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,ReservationStatus,ReservationStatusDate,HotelNumber,UUID,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,Date,BookingDate
0,0,2015,July,27,1,0,0,2,0.0,0,...,Check-Out,2015-07-01,H1,9af79666-f290-45c5-868c-2f9601b8f98b,2015-07-01,0,0 days,2015-07-01,2015-07-01,2014-07-24
1,0,2015,July,27,1,0,0,2,0.0,0,...,Check-Out,2015-07-01,H1,81440274-e84e-4502-89f3-e01681d0672a,2015-07-01,0,0 days,2015-07-01,2015-07-01,2013-06-24
2,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,60fe936c-f7ba-48d9-ac73-71c21e1b3978,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-24
3,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,5b2aae61-1d0c-4314-b4c1-603595e43163,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-18
4,0,2015,July,27,1,0,1,2,0.0,0,...,Check-Out,2015-07-02,H1,8225801b-2e43-490a-9703-4fec1baa34d8,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-19
5,0,2015,July,27,1,0,1,2,0.0,0,...,Check-Out,2015-07-02,H1,1dd5456d-f8be-4fa5-a40b-0d03c3572d23,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-07-01
6,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,4286ab17-3f53-424b-b6ae-71c062d78664,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-27
7,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,14b34771-9b50-4f2a-8eae-4c72e85a5f2e,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-29
8,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,2015-07-02,H1,78f5d91e-05c8-47d2-a953-084ddf7ed4b1,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-30
9,0,2015,July,27,1,0,2,2,0.0,0,...,Check-Out,2015-07-03,H1,e92881a3-faf8-402b-beff-64dad4707236,2015-07-01,2,2 days,2015-07-03,2015-07-01,2015-06-17


# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [10]:
## Review data prior to changes
df_data['ReservationStatusDate'].head(10)

0   2015-07-01
1   2015-07-01
2   2015-07-02
3   2015-07-02
4   2015-07-02
5   2015-07-02
6   2015-07-02
7   2015-07-02
8   2015-07-02
9   2015-07-03
Name: ReservationStatusDate, dtype: datetime64[ns]

## Compare with `IsCanceled` Data

In [11]:
## Identify reservations changed after arrival
change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

## Calculate average number of reservations changed after arrival
avg_resstatdate_before_arrival = change_filter.mean()

## Calculate average number of canceled reservations
avg_cxl = df_data['IsCanceled'].mean()

print((f'''The overall average number of canceled reservations is: {avg_cxl:.2%}\n'''))

print(' '.join(['The average number of canceled reservations with a ReservationStatusDate',
             f'prior to the arrival date is: {avg_resstatdate_before_arrival:.2%}\n''']))

## Print advice based on results
if avg_cxl >= .9:
    print(' '.join('The `ReservationStatusDate` feature is too strongly indicative of the `IsCanceled` feature.',
          'It should not be used for modeling.'))
elif avg_cxl >= .25 and avg_cxl < .9:
    print(' '.join(['This feature is related to the `IsCanceled` feature.',
          'Make sure to review it in more detail to determine whether to use it.']))
else:
    print('The `ReservationStatusDate` feature is unlikely to be predictive of the `IsCanceled` feature.')

The overall average number of canceled reservations is: 37.54%

The average number of canceled reservations with a ReservationStatusDate prior to the arrival date is: 35.88%

This feature is related to the `IsCanceled` feature. Make sure to review it in more detail to determine whether to use it.


### EDA Questions

In [12]:
# ## What is the breakdown of reservation statuses for those reservations with matching Arrival and Status Dates?
# ## (A.K.A. "same-day departures" or "day-use reservations.")

# sameday_status = (df_data['ReservationStatusDate'] == df_data['ArrivalDate'])

# (df_data[sameday_status]
#  .value_counts(subset = 'ReservationStatus',normalize = True)
#  .round(2))

In [13]:
# ## What is the breakdown of IsCanceled statuses
# ## for those reservations with matching Arrival and Status Dates?

# (df_data[sameday_status]
#  .value_counts(subset = 'IsCanceled',normalize = True)
#  .round(2))

In [14]:
# ## What is the average rate for these day-use/same-day-departure reservations?

# sameday_adr = (sameday_status & (df_data['ReservationStatus'] == 'Check-Out'))

# sameday_adr_median = df_data[sameday_adr]['ADR'].median()

# print(f'The median ADR for same-day reservations is: ${sameday_adr_median:.2f}')

# sameday_adr_gt_zero = (df_data[sameday_adr]['ADR'] > 0).mean().round(2)

# print(f'The number of same-day reservations with an ADR greater than zero is: {sameday_adr_gt_zero:.1%}')

In [15]:
# sameday_departure = (df_data['ReservationStatusDate'] == df_data['DepartureDate'])
# df_data[sameday_departure].value_counts(subset = 'IsCanceled', normalize =True).round(4)

In [16]:
# df_data[sameday_departure].value_counts(subset = 'ReservationStatus', normalize =True).round(4)

In [17]:
# df_data[(sameday_departure & (df_data['ReservationStatus'] != 'Check-Out'))]

## ReservationStatusDate Earlier Than Arrival Date

In [18]:
# after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

# avg_resstatdate_after_arrival = (after_arrival_filter.mean())

# print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

# Future Work: Investigating Canceled Reservations

---

Although the `ReservationStatusDate` feature is not appropriate for feature engineering, I could use it to calculate the number of days between booking and cancellation (for canceled reservations only).

This would be outside of the scope of the current feature engineering, but I am noting it as future work for analysis.

---

In [19]:
## Caculate the number of days between the status and booking dates

df_data['DaysSinceBooking'] = (df_data['ReservationStatusDate'] - df_data['BookingDate']).dt.days

df_data['DaysSinceBooking']

0         342
1         737
2           8
3          14
4          13
         ... 
528642    175
528643    225
528644    175
528645    225
528646    175
Name: DaysSinceBooking, Length: 528647, dtype: int64

In [20]:
df_data.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
IsCanceled,0,0,0,0,0,0,0,0,0,0
ArrivalDateYear,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015
ArrivalDateMonth,July,July,July,July,July,July,July,July,July,July
ArrivalDateWeekNumber,27,27,27,27,27,27,27,27,27,27
ArrivalDateDayOfMonth,1,1,1,1,1,1,1,1,1,1
StaysInWeekendNights,0,0,0,0,0,0,0,0,0,0
StaysInWeekNights,0,0,1,1,1,1,1,1,1,2
Adults,2,2,1,1,2,2,1,1,1,2
Children,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0,0,0,0,0,0


## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

In [21]:
df_data = df_data.drop(columns = 'ReservationStatusDate')
df_data

Unnamed: 0,IsCanceled,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,ReservationStatus,HotelNumber,UUID,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,Date,BookingDate,DaysSinceBooking
0,0,2015,July,27,1,0,0,2,0.0,0,...,Check-Out,H1,9af79666-f290-45c5-868c-2f9601b8f98b,2015-07-01,0,0 days,2015-07-01,2015-07-01,2014-07-24,342
1,0,2015,July,27,1,0,0,2,0.0,0,...,Check-Out,H1,81440274-e84e-4502-89f3-e01681d0672a,2015-07-01,0,0 days,2015-07-01,2015-07-01,2013-06-24,737
2,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,H1,60fe936c-f7ba-48d9-ac73-71c21e1b3978,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-24,8
3,0,2015,July,27,1,0,1,1,0.0,0,...,Check-Out,H1,5b2aae61-1d0c-4314-b4c1-603595e43163,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-18,14
4,0,2015,July,27,1,0,1,2,0.0,0,...,Check-Out,H1,8225801b-2e43-490a-9703-4fec1baa34d8,2015-07-01,1,1 days,2015-07-02,2015-07-01,2015-06-19,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,H1,8ebd4e66-52a2-4630-a636-b3da3297918d,2017-08-31,14,14 days,2017-09-14,2017-09-12,2017-03-23,175
528643,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,H1,5370a7d8-c7ba-41d2-92b5-3ce1c3469c5a,2017-08-31,14,14 days,2017-09-14,2017-09-13,2017-02-01,225
528644,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,H1,8ebd4e66-52a2-4630-a636-b3da3297918d,2017-08-31,14,14 days,2017-09-14,2017-09-13,2017-03-23,175
528645,0,2017,August,35,31,4,10,2,0.0,0,...,Check-Out,H1,5370a7d8-c7ba-41d2-92b5-3ce1c3469c5a,2017-08-31,14,14 days,2017-09-14,2017-09-14,2017-02-01,225


# Feature Engineering: Holidays

In [22]:
min_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .min()
                 .min()
                 .year)

max_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .max()
                 .max()
                 .year)

min_year, max_year

(2013, 2017)

In [23]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years= [year for year in range(min_year, max_year+1)]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)
pt_holidays

{datetime.date(2016, 1, 1): 'Ano Novo', datetime.date(2016, 3, 25): 'Sexta-feira Santa', datetime.date(2016, 3, 27): 'Páscoa', datetime.date(2016, 5, 26): 'Corpo de Deus', datetime.date(2016, 10, 5): 'Implantação da República', datetime.date(2016, 11, 1): 'Dia de Todos os Santos', datetime.date(2016, 12, 1): 'Restauração da Independência', datetime.date(2016, 4, 25): 'Dia da Liberdade', datetime.date(2016, 5, 1): 'Dia do Trabalhador', datetime.date(2016, 6, 10): 'Dia de Portugal, de Camões e das Comunidades Portuguesas', datetime.date(2016, 8, 15): 'Assunção de Nossa Senhora', datetime.date(2016, 12, 8): 'Imaculada Conceição', datetime.date(2016, 12, 25): 'Dia de Natal', datetime.date(2017, 1, 1): 'Ano Novo', datetime.date(2017, 4, 14): 'Sexta-feira Santa', datetime.date(2017, 4, 16): 'Páscoa', datetime.date(2017, 6, 15): 'Corpo de Deus', datetime.date(2017, 10, 5): 'Implantação da República', datetime.date(2017, 11, 1): 'Dia de Todos os Santos', datetime.date(2017, 12, 1): 'Restauraçã

In [24]:
def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [25]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate', 'Date']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,IsCanceled,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,BookingDate,DaysSinceBooking,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,Date_DaysBeforeHoliday,Date_DaysAfterHoliday
0,0,2015,July,27,1,0,0,2,0.0,0,...,2014-07-24,342,45,21,45,21,22,44,45,21
1,0,2015,July,27,1,0,0,2,0.0,0,...,2013-06-24,737,45,21,45,21,52,14,45,21
2,0,2015,July,27,1,0,1,1,0.0,0,...,2015-06-24,8,45,21,44,22,52,14,45,21
3,0,2015,July,27,1,0,1,1,0.0,0,...,2015-06-18,14,45,21,44,22,58,8,45,21
4,0,2015,July,27,1,0,1,2,0.0,0,...,2015-06-19,13,45,21,44,22,57,9,45,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,0,2017,August,35,31,4,10,2,0.0,0,...,2017-03-23,175,35,16,21,30,22,81,23,28
528643,0,2017,August,35,31,4,10,2,0.0,0,...,2017-02-01,225,35,16,21,30,72,31,22,29
528644,0,2017,August,35,31,4,10,2,0.0,0,...,2017-03-23,175,35,16,21,30,22,81,22,29
528645,0,2017,August,35,31,4,10,2,0.0,0,...,2017-02-01,225,35,16,21,30,72,31,21,30


# Feature Engineering: ISO Day of Week, ISO Week of Year

In [26]:
# df_data['ArrivalDate'].dt.dayofweek.head()
df_data['Date'].dt.dayofweek.head()

0    2
1    2
2    2
3    2
4    2
Name: Date, dtype: int32

In [27]:
# arrival_isocal = (df_data['ArrivalDate']
#                   .dt.isocalendar()[['week', 'day']]
#                   .rename(columns = {'week':'ArrivalWeek', 'day': 'ArrivalDay'}))
# arrival_isocal

arrival_isocal = (df_data['Date']
                  .dt.isocalendar()[['week', 'day']]
                  .rename(columns = {'week':'DateWeek', 'day': 'DateDay'}))
arrival_isocal

Unnamed: 0,DateWeek,DateDay
0,27,3
1,27,3
2,27,3
3,27,3
4,27,3
...,...,...
528642,37,2
528643,37,3
528644,37,3
528645,37,4


In [28]:
df_data = pd.concat([df_data, arrival_isocal], axis = 1)
df_data

Unnamed: 0,IsCanceled,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,Date_DaysBeforeHoliday,Date_DaysAfterHoliday,DateWeek,DateDay
0,0,2015,July,27,1,0,0,2,0.0,0,...,45,21,45,21,22,44,45,21,27,3
1,0,2015,July,27,1,0,0,2,0.0,0,...,45,21,45,21,52,14,45,21,27,3
2,0,2015,July,27,1,0,1,1,0.0,0,...,45,21,44,22,52,14,45,21,27,3
3,0,2015,July,27,1,0,1,1,0.0,0,...,45,21,44,22,58,8,45,21,27,3
4,0,2015,July,27,1,0,1,2,0.0,0,...,45,21,44,22,57,9,45,21,27,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,0,2017,August,35,31,4,10,2,0.0,0,...,35,16,21,30,22,81,23,28,37,2
528643,0,2017,August,35,31,4,10,2,0.0,0,...,35,16,21,30,72,31,22,29,37,3
528644,0,2017,August,35,31,4,10,2,0.0,0,...,35,16,21,30,22,81,22,29,37,3
528645,0,2017,August,35,31,4,10,2,0.0,0,...,35,16,21,30,72,31,21,30,37,4


# Feature Engineering: Day of Week, Month as Categorical

In [29]:
# df_day_name = (df_data['ArrivalDate']
#                  .dt.day_name()
#                  .astype('category'))
# df_day_name.name = 'ArrivalDateDayName'
# df_day_name

# df_data = pd.concat([df_data, df_day_name], axis = 1)

# df_data['ArrivalDateDayName'].head().T

df_day_name = (df_data['Date']
                 .dt.day_name()
                 .astype('category'))
df_day_name.name = 'DateDayName'
df_day_name

df_data = pd.concat([df_data, df_day_name], axis = 1)

df_data['DateDayName'].head().T

0    Wednesday
1    Wednesday
2    Wednesday
3    Wednesday
4    Wednesday
Name: DateDayName, dtype: category
Categories (7, object): ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']

In [30]:
df_month_name = (df_data['Date']
                 .dt.month_name()
                 .astype('category'))
df_month_name.name = 'DateMonthName'

df_data = pd.concat([df_data, df_month_name], axis = 1)

df_data['DateMonthName'].head().T

0    July
1    July
2    July
3    July
4    July
Name: DateMonthName, dtype: category
Categories (12, object): ['April', 'August', 'December', 'February', ..., 'May', 'November', 'October', 'September']

In [31]:
df_data.head().T

Unnamed: 0,0,1,2,3,4
IsCanceled,0,0,0,0,0
ArrivalDateYear,2015,2015,2015,2015,2015
ArrivalDateMonth,July,July,July,July,July
ArrivalDateWeekNumber,27,27,27,27,27
ArrivalDateDayOfMonth,1,1,1,1,1
StaysInWeekendNights,0,0,0,0,0
StaysInWeekNights,0,0,1,1,1
Adults,2,2,1,1,2
Children,0.0,0.0,0.0,0.0,0.0
Babies,0,0,0,0,0


# Feature Engineering: Rolling Averages, Rolling Standard Deviation, and Lag

---

> To help capture the time series-related data from my ADR, I will also introduce rolling averages, rolling standard deviations, and apply exponential smooothing to create new features.
>
> This approach does use the target feature for engineering, but as long as I split my data on the arrival date, I'm confident that I can avoid data leakage.

---

In [32]:
# # Lag features
# df_data['ADR_lag_1'] = df_data['ADR'].shift(1)
# df_data['ADR_lag_7'] = df_data['ADR'].shift(7)

# # 3-day rolling average (past 3 days)
# df_data['ADR_7d_avg'] = df_data['ADR'].shift(1).rolling(window=3).mean().round(2)
# # 7-day rolling average (past 7 days)
# df_data['ADR_30d_avg'] = df_data['ADR'].shift(1).rolling(window=7).mean().round(2)
# # 3-day moving standard deviation (past 3 days)
# df_data['ADR_7d_std'] = df_data['ADR'].shift(1).rolling(window=3).std().round(2)
# # 7-day moving standard deviation (past 7 days)
# df_data['ADR_30d_std'] = df_data['ADR'].shift(1).rolling(window=7).std().round(2)

# # Exponential smoothing
# df_data['ADR_ewm_3'] = df_data['ADR'].shift(1).ewm(span=3, adjust=False).mean().round(2)
# df_data['ADR_ewm_7'] = df_data['ADR'].shift(1).ewm(span=7, adjust=False).mean().round(2)

# df_data

# Prepare to Save Data

In [33]:
df_data = df_data.reset_index(drop = True)
df_data

Unnamed: 0,IsCanceled,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,Date_DaysBeforeHoliday,Date_DaysAfterHoliday,DateWeek,DateDay,DateDayName,DateMonthName
0,0,2015,July,27,1,0,0,2,0.0,0,...,45,21,22,44,45,21,27,3,Wednesday,July
1,0,2015,July,27,1,0,0,2,0.0,0,...,45,21,52,14,45,21,27,3,Wednesday,July
2,0,2015,July,27,1,0,1,1,0.0,0,...,44,22,52,14,45,21,27,3,Wednesday,July
3,0,2015,July,27,1,0,1,1,0.0,0,...,44,22,58,8,45,21,27,3,Wednesday,July
4,0,2015,July,27,1,0,1,2,0.0,0,...,44,22,57,9,45,21,27,3,Wednesday,July
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,0,2017,August,35,31,4,10,2,0.0,0,...,21,30,22,81,23,28,37,2,Tuesday,September
528643,0,2017,August,35,31,4,10,2,0.0,0,...,21,30,72,31,22,29,37,3,Wednesday,September
528644,0,2017,August,35,31,4,10,2,0.0,0,...,21,30,22,81,22,29,37,3,Wednesday,September
528645,0,2017,August,35,31,4,10,2,0.0,0,...,21,30,72,31,21,30,37,4,Thursday,September


# Final Inspection

---

I extracted a good deal of information about booking and stay dates, as well as adding temporal features. While this approach does add a significant number of features, I am confident that the additional data will be worthwhile.

---

In [34]:
df_data.to_parquet('../../data/5.2_exploded_temporal_update.parquet', compression = 'zstd')

# df_data.to_excel('../../data/5.2_exploded_temporal_update.xlsx', index = False)