# Feature Engineering: Datetime Features

---

**Calculating Dates**

The source data does not include specific datetime features. Instead, it offers a selection of different pieces that I can use to create new datetime features:
* The arrival year, month name, and day of month.
* The number of weekday and weekend nights.
* The "booking lead time," or how far in advance the guest booked their reservation.

Using these features, I can create three new datetime features to start: the arrival, departure, and booking dates.

---

**Extracting Date Details**

For each of these separate dates, I can go into more detail:
* Calculating the number of days since the last holiday and the number of days until the next.
* Determining the week of the year, day of the week, etc. to help capture more temporal features.
* Calculating the number of days between the last reservation changes and the arrival date (for those reservations changed on or before the arrival date).

---

**Final Considerations**

This process will create many new features, potentially limiting future modeling performance. Prior to modeling, I may need to use feature selection methods to use only the most impactful details.

By the end of this notebook, I will have a new set of temporally-focused data to use for more extensive modeling and forecasting in the next steps of the workflow.

---

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import sys
import os

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils

In [3]:
import datetime as dt
import holidays
import pandas as pd

# *TEMPORARILY CHANGED TO USE BACKUP FILES* | Read Data from DuckDB

In [4]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [5]:
# ## Select subset of data for review
# q = 'SELECT IsCanceled FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     display(conn.execute(q).df())

In [6]:
# ## Convert Arrival columns to strings

# q = ('''
# SELECT uuid, ArrivalDateYear, ArrivalDateMonth, ArrivalDateDayOfMonth,
# StaysInWeekNights, StaysInWeekendNights, LeadTime 
# FROM res_data''')

# with db_utils.duckdb_connection(db_path) as conn:
#     df_data = conn.execute(q).df()

# # df_data = arrival_cols.astype(str)
# df_data.head()

In [7]:
date_features = ['UUID','ReservationStatusDate', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
                 'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime']
date_features

['UUID',
 'ReservationStatusDate',
 'ArrivalDateYear',
 'ArrivalDateMonth',
 'ArrivalDateDayOfMonth',
 'StaysInWeekNights',
 'StaysInWeekendNights',
 'LeadTime']

In [8]:
path_to_backup_files = '../data/data_condensed_with_uuid.parquet'

df_data = pd.read_parquet(path_to_backup_files, columns=date_features)

df_data

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015,July,1,0,0,342
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015,July,1,0,0,737
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015,July,1,1,0,7
3,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015,July,1,1,0,13
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015,July,1,2,0,14
...,...,...,...,...,...,...,...,...
119385,2ccdf728-5829-47d6-ae2c-016f9706c24d,2017-09-06,2017,August,30,5,2,23
119386,be937240-f461-4eee-9971-a83c8c09bf04,2017-09-07,2017,August,31,5,2,102
119387,04e0baed-c9a6-487c-a1af-82cb0f830fe4,2017-09-07,2017,August,31,5,2,34
119388,4d96b250-c5c4-46e5-bf52-e4334991d81b,2017-09-07,2017,August,31,5,2,109


In [9]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   UUID                   119390 non-null  object
 1   ReservationStatusDate  119390 non-null  object
 2   ArrivalDateYear        119390 non-null  int64 
 3   ArrivalDateMonth       119390 non-null  object
 4   ArrivalDateDayOfMonth  119390 non-null  int64 
 5   StaysInWeekNights      119390 non-null  int64 
 6   StaysInWeekendNights   119390 non-null  int64 
 7   LeadTime               119390 non-null  int64 
dtypes: int64(5), object(3)
memory usage: 7.3+ MB


## Convert ReservationStatusDate to Datetime Format

In [10]:
df_data['ReservationStatusDate'] = pd.to_datetime(df_data['ReservationStatusDate'], yearfirst = True)
df_data['ReservationStatusDate']

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: ReservationStatusDate, Length: 119390, dtype: datetime64[ns]

# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [11]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = df_data['ArrivalDateYear'].str.cat(df_data[['ArrivalDateMonth',
                                                                     'ArrivalDateDayOfMonth']],
                                                            '-')

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015,July,1,0,0,342,2015-07-01
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015,July,1,0,0,737,2015-07-01
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015,July,1,1,0,7,2015-07-01
3,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015,July,1,1,0,13,2015-07-01
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015,July,1,2,0,14,2015-07-01


## Departure Date

In [12]:
timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

df_data['DepartureDate'] = df_data.loc[:, 'ArrivalDate'] + timedelta_wk + timedelta_wknd

df_data['DepartureDate'] 

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: DepartureDate, Length: 119390, dtype: datetime64[ns]

## Booking Date

### Convert `LeadTime` to TimeDelta Datatype

In [13]:
## Initial datatype - integer
df_data['LeadTime']

0         342
1         737
2           7
3          13
4          14
         ... 
119385     23
119386    102
119387     34
119388    109
119389    205
Name: LeadTime, Length: 119390, dtype: int64

In [14]:
## Convert to TimeDelta
df_data['LeadTimeDelta'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')
df_data['LeadTimeDelta']

0        342 days
1        737 days
2          7 days
3         13 days
4         14 days
           ...   
119385    23 days
119386   102 days
119387    34 days
119388   109 days
119389   205 days
Name: LeadTimeDelta, Length: 119390, dtype: timedelta64[ns]

### Calculate BookingDate Using LeadTimeDelta

In [15]:
## Subtract LeadTimeDelta from ArrivalDate to calculate BookingDate
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTimeDelta']
df_data['BookingDate']

0        2014-07-24
1        2013-06-24
2        2015-06-24
3        2015-06-18
4        2015-06-17
            ...    
119385   2017-08-07
119386   2017-05-21
119387   2017-07-28
119388   2017-05-14
119389   2017-02-05
Name: BookingDate, Length: 119390, dtype: datetime64[ns]

# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [16]:
df_data['ReservationStatusDate'].head()

0   2015-07-01
1   2015-07-01
2   2015-07-02
3   2015-07-02
4   2015-07-03
Name: ReservationStatusDate, dtype: datetime64[ns]

In [17]:
change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

avg_resstatdate_before_arrival = (change_filter
                                  .mean()
                                  .round(2))
print(f'The average number of reservations last changed before arrival is: {avg_resstatdate_before_arrival:.0%}.')

The average number of reservations last changed before arrival is: 35%.


## Read-In `IsCanceled` Data to Match Reservations

In [18]:
path_to_backup_files = '../data/data_condensed_with_uuid.parquet'

df_UUID = pd.read_parquet(path_to_backup_files, columns = ['UUID', 'IsCanceled'])

df_UUID.head()

Unnamed: 0,UUID,IsCanceled
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,0
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,0
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,0
3,dd6424ee-6838-4007-ad85-de9ff96be14b,0
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,0


In [19]:
uuids = df_data[change_filter]['UUID'].to_list()
uuids[:5]

['199901a6-9abc-4fec-85d7-9d79f2471abf',
 'aef8ee02-1762-4f17-a8d7-5550a84a182e',
 'a28bca44-b17a-4aff-ab7c-b7505b294554',
 '8b3cf29f-9540-4a2e-9b6c-eef0bc7649ec',
 '67f2d23d-e2cb-44bc-8452-b7481c756194']

In [20]:
uuid_filter = df_UUID['UUID'].isin(uuids)

avg_cxl = df_UUID[uuid_filter]['IsCanceled'].mean()

print((f'''The average number of canceled reservations with a ReservationStatusDate earlier than the arrival date is: {avg_cxl:.2%}\n'''))

if avg_cxl >= .9:
    print(f'''This feature is too strongly indicative of the `IsCanceled` feature. It should not be used for modeling.''')
elif avg_cxl >= .75 and avg_cxl < .9:
    print('''This feature is strongly related to the `IsCanceled` feature. Make sure to review it in more detail to determine whether to use it.''')
else:
    print('''This feature is too strongly indicative of the `IsCanceled` feature. It should not be used for modeling.''')

The average number of canceled reservations with a ReservationStatusDate earlier than the arrival date is: 100.00%

This feature is too strongly indicative of the `IsCanceled` feature. It should not be used for modeling.


## ReservationStatusDate Earlier Than Arrival Date

In [21]:
after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

avg_resstatdate_after_arrival = (after_arrival_filter
                                  .mean()
                                  .round(2))
print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

The average number of reservations changed after arrival is: 62%.


In [22]:
uuids_post_arv = df_data[after_arrival_filter]['UUID'].to_list()
# uuids[:10]

In [23]:
uuid_filter = df_UUID['UUID'].isin(uuids_post_arv)

avg_cxl = df_UUID[uuid_filter]['IsCanceled'].mean()

print((f'''The average number of canceled reservations with a ReservationStatusDate after than the arrival date is: {avg_cxl:.2%}\n'''))

The average number of canceled reservations with a ReservationStatusDate after than the arrival date is: 0.00%



# Future Work: Investigating Canceled Reservations

---

Although the `ReservationStatusDate` feature is not appropriate for feature engineering, I could use it to calculate the number of days between booking and cancellation (for canceled reservations only).

This would be outside of the scope of the current feature engineering, but I am noting it as future work for analysis.

---

In [24]:
df_UUID.head()

Unnamed: 0,UUID,IsCanceled
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,0
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,0
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,0
3,dd6424ee-6838-4007-ad85-de9ff96be14b,0
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,0


In [25]:
## Create list of UUIDs from cancelled reservations
cxl_filter = df_UUID['IsCanceled'] == 1
cxl_uuids = df_data[cxl_filter]['UUID'].to_list()
cxl_uuids[:10]

['199901a6-9abc-4fec-85d7-9d79f2471abf',
 'aef8ee02-1762-4f17-a8d7-5550a84a182e',
 'a28bca44-b17a-4aff-ab7c-b7505b294554',
 '8b3cf29f-9540-4a2e-9b6c-eef0bc7649ec',
 '67f2d23d-e2cb-44bc-8452-b7481c756194',
 '4a9b4cd8-44e8-4f00-95f9-866264b2a7c0',
 'fda23171-fb72-448a-99b6-d5a82a1fe37a',
 'f68ee930-968d-46f5-9ecb-8591f8539f48',
 '71af9319-8482-45b1-8590-6a1b85cdd292',
 '3379ddc2-e265-44d3-929a-9fb8784e90b9']

In [26]:
## Subset the date-engineered dataframe for cancelled reservations
cxl_res = df_data[df_data['UUID'].isin(cxl_uuids)]
cxl_res = cxl_res.reset_index(drop = True)
cxl_res.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate
0,199901a6-9abc-4fec-85d7-9d79f2471abf,2015-05-06,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07
1,aef8ee02-1762-4f17-a8d7-5550a84a182e,2015-04-22,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17
2,a28bca44-b17a-4aff-ab7c-b7505b294554,2015-06-23,2015,July,1,4,0,23,2015-07-01,2015-07-05,23 days,2015-06-08
3,8b3cf29f-9540-4a2e-9b6c-eef0bc7649ec,2015-05-11,2015,July,1,5,2,60,2015-07-01,2015-07-08,60 days,2015-05-02
4,67f2d23d-e2cb-44bc-8452-b7481c756194,2015-05-29,2015,July,1,8,2,96,2015-07-01,2015-07-11,96 days,2015-03-27


In [27]:
cxl_res.loc[:, 'ReservationStatusDate'] = pd.to_datetime(cxl_res.loc[:, 'ReservationStatusDate'])
cxl_res.loc[:, 'BookingDate'] = pd.to_datetime(cxl_res.loc[:, 'BookingDate'])

In [28]:
cxl_res.loc[8, 'ReservationStatusDate']

Timestamp('2015-05-18 00:00:00')

In [29]:
cxl_res.loc[:, 'BookingDate']

0       2015-04-07
1       2015-04-17
2       2015-06-08
3       2015-05-02
4       2015-03-27
           ...    
44219   2016-12-14
44220   2017-06-01
44221   2017-05-24
44222   2017-07-11
44223   2017-08-02
Name: BookingDate, Length: 44224, dtype: datetime64[ns]

In [30]:
## Calculate number of days between booking and cancellation
age_at_cxl = (cxl_res['ReservationStatusDate'] - cxl_res['BookingDate']).dt.days

age_at_cxl.name = 'DaysOldAtCancelation'

cxl_res = pd.concat([cxl_res, age_at_cxl], axis = 1)
cxl_res.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth,StaysInWeekNights,StaysInWeekendNights,LeadTime,ArrivalDate,DepartureDate,LeadTimeDelta,BookingDate,DaysOldAtCancelation
0,199901a6-9abc-4fec-85d7-9d79f2471abf,2015-05-06,2015,July,1,3,0,85,2015-07-01,2015-07-04,85 days,2015-04-07,29
1,aef8ee02-1762-4f17-a8d7-5550a84a182e,2015-04-22,2015,July,1,3,0,75,2015-07-01,2015-07-04,75 days,2015-04-17,5
2,a28bca44-b17a-4aff-ab7c-b7505b294554,2015-06-23,2015,July,1,4,0,23,2015-07-01,2015-07-05,23 days,2015-06-08,15
3,8b3cf29f-9540-4a2e-9b6c-eef0bc7649ec,2015-05-11,2015,July,1,5,2,60,2015-07-01,2015-07-08,60 days,2015-05-02,9
4,67f2d23d-e2cb-44bc-8452-b7481c756194,2015-05-29,2015,July,1,8,2,96,2015-07-01,2015-07-11,96 days,2015-03-27,63


In [31]:
cxl_res[['UUID', 'DaysOldAtCancelation']].head()

Unnamed: 0,UUID,DaysOldAtCancelation
0,199901a6-9abc-4fec-85d7-9d79f2471abf,29
1,aef8ee02-1762-4f17-a8d7-5550a84a182e,5
2,a28bca44-b17a-4aff-ab7c-b7505b294554,15
3,8b3cf29f-9540-4a2e-9b6c-eef0bc7649ec,9
4,67f2d23d-e2cb-44bc-8452-b7481c756194,63


In [32]:
cxl_res[['UUID', 'DaysOldAtCancelation']].to_parquet('../data/cxl_res_age.parquet', compression = 'snappy')

## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

# Subset DataFrame to Focus on Engineered Dates

In [33]:
drop_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth',
'StaysInWeekNights', 'StaysInWeekendNights', 'LeadTime', 'LeadTimeDelta']
df_data = df_data.drop(columns = drop_cols)
df_data.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDate,DepartureDate,BookingDate
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015-07-01,2015-07-01,2014-07-24
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015-07-01,2015-07-01,2013-06-24
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015-07-01,2015-07-02,2015-06-24
3,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015-07-01,2015-07-02,2015-06-18
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015-07-01,2015-07-03,2015-06-17


# FE: Holidays

In [34]:
min_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .min()
                 .min()
                 .year)
min_year

2013

In [35]:
max_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .max()
                 .max()
                 .year)
max_year

2017

In [36]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years= [year for year in range(min_year, max_year+1)]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)

def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [37]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14
3,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7
...,...,...,...,...,...,...,...,...,...,...,...
119385,2ccdf728-5829-47d6-ae2c-016f9706c24d,2017-09-06,2017-08-30,2017-09-06,2017-08-07,36,15,29,22,8,53
119386,be937240-f461-4eee-9971-a83c8c09bf04,2017-09-07,2017-08-31,2017-09-07,2017-05-21,35,16,28,23,20,20
119387,04e0baed-c9a6-487c-a1af-82cb0f830fe4,2017-09-07,2017-08-31,2017-09-07,2017-07-28,35,16,28,23,18,43
119388,4d96b250-c5c4-46e5-bf52-e4334991d81b,2017-09-07,2017-08-31,2017-09-07,2017-05-14,35,16,28,23,27,13


# FE: ISO Day of Week, ISO Week of Year

In [38]:
df_data['ArrivalDate'].dt.dayofweek.head()

0    2
1    2
2    2
3    2
4    2
Name: ArrivalDate, dtype: int32

In [39]:
df_data['ArrivalDate'].dt.isocalendar().head()

Unnamed: 0,year,week,day
0,2015,27,3
1,2015,27,3
2,2015,27,3
3,2015,27,3
4,2015,27,3


In [40]:
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    df_data[f'{column}_WeekNumber'] = df_data[column].dt.isocalendar()['week']
    df_data[f'{column}_DayOfWeek'] = df_data[column].dt.isocalendar()['day']
    
df_data.head()

Unnamed: 0,UUID,ReservationStatusDate,ArrivalDate,DepartureDate,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_WeekNumber,ArrivalDate_DayOfWeek,DepartureDate_WeekNumber,DepartureDate_DayOfWeek,BookingDate_WeekNumber,BookingDate_DayOfWeek
0,8ca998d6-fae7-4ee4-a706-3765721aaff5,2015-07-01,2015-07-01,2015-07-01,2014-07-24,45,21,45,21,22,44,27,3,27,3,30,4
1,e535835e-b19a-4e32-9e9f-6d70a0182d4b,2015-07-01,2015-07-01,2015-07-01,2013-06-24,45,21,45,21,52,14,27,3,27,3,26,1
2,9429383d-0efd-4c37-bb9b-0aaa63d5aade,2015-07-02,2015-07-01,2015-07-02,2015-06-24,45,21,44,22,52,14,27,3,27,4,26,3
3,dd6424ee-6838-4007-ad85-de9ff96be14b,2015-07-02,2015-07-01,2015-07-02,2015-06-18,45,21,44,22,58,8,27,3,27,4,25,4
4,50ff56ee-6a72-40dc-8ff1-4246b831c779,2015-07-03,2015-07-01,2015-07-03,2015-06-17,45,21,43,23,59,7,27,3,27,5,25,3


# Saving Results

In [41]:
df_data.to_parquet('../data/engineered_data_dates.parquet', compression = 'snappy')