# **Feature Engineering: Datetime Features**

---

**Creating Datetime Features**

The source data lacks specific datetime features, but it does provide key components like arrival year, month, day, the number of weekday and weekend nights, and booking lead time. Using these, I will create new datetime features, including arrival, departure, and booking dates.

---

**Extracting Temporal Details**

From these dates, I can derive additional temporal features:
- Days since the last holiday and until the next.
- Week of the year, day of the week, etc., to capture more temporal patterns.
- Days between the last reservation change and the arrival date (for reservations changed on or before arrival).

---

**Final Considerations**

This process will generate many new features, which could impact modeling performance. Before modeling, I may need to apply feature selection to focus on the most relevant details. By the end of this notebook, I will have a new set of temporal data ready for more extensive modeling and forecasting.

---

In [None]:
import datetime as dt
import holidays
import numpy as np
import pandas as pd

In [3]:
path = '../../data/source/full_data.parquet'
df_data = pd.read_parquet(path)
df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber
0,342,2015,July,27,1,0,0,2,0.0,0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
1,737,2015,July,27,1,0,0,2,0.0,0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
2,7,2015,July,27,1,0,1,1,0.0,0,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
3,13,2015,July,27,1,0,1,1,0.0,0,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
4,14,2015,July,27,1,0,2,2,0.0,0,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,23,2017,August,35,30,2,5,2,0.0,0,...,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2
119386,102,2017,August,35,31,2,5,3,0.0,0,...,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2
119387,34,2017,August,35,31,2,5,2,0.0,0,...,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2
119388,109,2017,August,35,31,2,5,2,0.0,0,...,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2


In [4]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 31 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   LeadTime                     119390 non-null  int64  
 1   ArrivalDateYear              119390 non-null  int64  
 2   ArrivalDateMonth             119390 non-null  object 
 3   ArrivalDateWeekNumber        119390 non-null  int64  
 4   ArrivalDateDayOfMonth        119390 non-null  int64  
 5   StaysInWeekendNights         119390 non-null  int64  
 6   StaysInWeekNights            119390 non-null  int64  
 7   Adults                       119390 non-null  int64  
 8   Children                     119386 non-null  float64
 9   Babies                       119390 non-null  int64  
 10  Meal                         119390 non-null  object 
 11  Country                      118902 non-null  object 
 12  MarketSegment                119390 non-null  object 
 13 

## Convert ReservationStatusDate to Datetime Format

In [5]:
df_data['ReservationStatusDate'] = pd.to_datetime(df_data['ReservationStatusDate'], yearfirst = True)
df_data['ReservationStatusDate']

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: ReservationStatusDate, Length: 119390, dtype: datetime64[ns]

# Feature Engineering: Arrival, Departure, and Booking Dates

## Arrival Date

In [6]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = (df_data['ArrivalDateYear']
                          .str.cat(df_data[['ArrivalDateMonth',
                                            'ArrivalDateDayOfMonth']],
                                   '-')
                          )

df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst = True)

df_data = df_data.sort_values(by = 'ArrivalDate', ignore_index = False)

df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate
0,342,2015,July,27,1,0,0,2,0.0,0,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01
75559,257,2015,July,27,1,0,2,1,0.0,0,...,,0,Transient,80.00,0,0,Check-Out,2015-07-03,H2,2015-07-01
75560,257,2015,July,27,1,0,2,2,0.0,0,...,,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01
75561,257,2015,July,27,1,0,2,2,0.0,0,...,,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01
75562,257,2015,July,27,1,0,2,2,0.0,0,...,,0,Transient,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,108,2017,August,35,31,2,5,2,0.0,0,...,,0,Transient,207.03,0,1,Check-Out,2017-09-07,H1,2017-08-31
40040,194,2017,August,35,31,2,5,2,1.0,0,...,,0,Transient,312.29,1,1,Check-Out,2017-09-07,H1,2017-08-31
13794,17,2017,August,35,31,0,3,2,0.0,0,...,,0,Transient,207.00,0,2,Canceled,2017-08-14,H1,2017-08-31
40038,191,2017,August,35,31,2,5,2,0.0,0,...,,0,Contract,114.80,0,0,Check-Out,2017-09-07,H1,2017-08-31


In [6]:
# ## Drop features post-conversion
# df_data = (df_data
#            .drop(columns = ['ArrivalDateMonth',
#                             'ArrivalDateWeekNumber',
#                             'ArrivalDateDayOfMonth']))
# df_data

## Departure Date

In [7]:
## Convert number of nights stays to timedelta,
## then use to calculate departure date and stay length

timedelta_wknd = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekendNights'],
                    unit = 'D')
timedelta_wk = pd.to_timedelta(
                    df_data.loc[:, 'StaysInWeekNights'],
                    unit = 'D')

df_data['DepartureDate'] = (df_data.loc[:, 'ArrivalDate'] 
                            + timedelta_wk 
                            + timedelta_wknd)

df_data['Length of Stay'] = df_data['StaysInWeekendNights'] + df_data['StaysInWeekNights']

df_data = df_data.drop(columns = ['StaysInWeekendNights', 'StaysInWeekNights'])

df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,Country,...,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,DepartureDate,Length of Stay
0,342,2015,July,27,1,2,0.0,0,BB,PRT,...,Transient,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,2015-07-01,0
75559,257,2015,July,27,1,1,0.0,0,HB,PRT,...,Transient,80.00,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2
75560,257,2015,July,27,1,2,0.0,0,HB,PRT,...,Transient,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2
75561,257,2015,July,27,1,2,0.0,0,HB,PRT,...,Transient,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2
75562,257,2015,July,27,1,2,0.0,0,HB,PRT,...,Transient,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,108,2017,August,35,31,2,0.0,0,HB,GBR,...,Transient,207.03,0,1,Check-Out,2017-09-07,H1,2017-08-31,2017-09-07,7
40040,194,2017,August,35,31,2,1.0,0,HB,ITA,...,Transient,312.29,1,1,Check-Out,2017-09-07,H1,2017-08-31,2017-09-07,7
13794,17,2017,August,35,31,2,0.0,0,HB,ESP,...,Transient,207.00,0,2,Canceled,2017-08-14,H1,2017-08-31,2017-09-03,3
40038,191,2017,August,35,31,2,0.0,0,HB,GBR,...,Contract,114.80,0,0,Check-Out,2017-09-07,H1,2017-08-31,2017-09-07,7


## `BookingDate` from `LeadTime`

In [8]:
## Convert to TimeDelta
df_data['LeadTime'] = pd.to_timedelta(df_data['LeadTime'], unit = 'D')

## Subtract LeadTime from ArrivalDate to calculate BookingDate
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTime']
df_data['BookingDate']

# df_data = df_data.drop(columns = 'LeadTime')

df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,Country,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,342 days,2015,July,27,1,2,0.0,0,BB,PRT,...,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,2015-07-01,0,2014-07-24
75559,257 days,2015,July,27,1,1,0.0,0,HB,PRT,...,80.00,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2,2014-10-17
75560,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2,2014-10-17
75561,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2,2014-10-17
75562,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,101.50,0,0,Check-Out,2015-07-03,H2,2015-07-01,2015-07-03,2,2014-10-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,108 days,2017,August,35,31,2,0.0,0,HB,GBR,...,207.03,0,1,Check-Out,2017-09-07,H1,2017-08-31,2017-09-07,7,2017-05-15
40040,194 days,2017,August,35,31,2,1.0,0,HB,ITA,...,312.29,1,1,Check-Out,2017-09-07,H1,2017-08-31,2017-09-07,7,2017-02-18
13794,17 days,2017,August,35,31,2,0.0,0,HB,ESP,...,207.00,0,2,Canceled,2017-08-14,H1,2017-08-31,2017-09-03,3,2017-08-14
40038,191 days,2017,August,35,31,2,0.0,0,HB,GBR,...,114.80,0,0,Check-Out,2017-09-07,H1,2017-08-31,2017-09-07,7,2017-02-21


# Feature Inspection: `ReservationStatusDate`

The final date-like data from the original source is the `ReservationStatusDate`, indicating the date on which the reservation was changed last.

As it is a date feature, it may be useful for feature engineering. However, based on my domain knowledge, I suspect that any of these dates that occur prior to the arrival date will indicate that a reservation was cancelled. This would provide too much information for modeling, but I may be able to generate a different feature from this information.

I will start by identifying those reservations where the `ReservationStatusDate` feature is earlier than the arrival date. Then, I will get the `IsCanceled' feature from the dataset and filter for those reservations. Finally, I will calculate the average number of cancelled reservations. If the average is very high (90-100%), then I will consider how to generate a new feature from this data.



In [9]:
## Review data prior to changes
df_data['ReservationStatusDate'].head(10)

0       2015-07-01
75559   2015-07-03
75560   2015-07-03
75561   2015-07-03
75562   2015-07-03
75563   2015-07-03
75564   2015-07-03
75565   2015-07-03
75566   2015-07-03
75558   2015-07-03
Name: ReservationStatusDate, dtype: datetime64[ns]

## Compare with `ReservationStatus` Data

In [10]:
# ## Identify reservations changed after arrival
# change_filter = (df_data['ReservationStatusDate'] < df_data['ArrivalDate'])

# ## Calculate average number of reservations changed after arrival
# avg_resstatdate_before_arrival = change_filter.mean()

# ## Calculate average number of canceled reservations
# avg_cxl = df_data['ReservationStatus'].mean()

# print((f'''The overall average number of canceled reservations is: {avg_cxl:.2%}\n'''))

# print(' '.join(['The average number of canceled reservations with a ReservationStatusDate',
#              f'prior to the arrival date is: {avg_resstatdate_before_arrival:.2%}\n''']))

# ## Print advice based on results
# if avg_cxl >= .9:
#     print(' '.join('The `ReservationStatusDate` feature is too strongly indicative of the `ReservationStatus` feature.',
#           'It should not be used for modeling.'))
# elif avg_cxl >= .25 and avg_cxl < .9:
#     print(' '.join(['This feature is related to the `ReservationStatus` feature.',
#           'Make sure to review it in more detail to determine whether to use it.']))
# else:
#     print('The `ReservationStatusDate` feature is unlikely to be predictive of the `ReservationStatus` feature.')

### EDA Questions

In [11]:
# ## What is the breakdown of reservation statuses for those reservations with matching Arrival and Status Dates?
# ## (A.K.A. "same-day departures" or "day-use reservations.")

# sameday_status = (df_data['ReservationStatusDate'] == df_data['ArrivalDate'])

# (df_data[sameday_status]
#  .value_counts(subset = 'ReservationStatus',normalize = True)
#  .round(2))

In [12]:
# ## What is the breakdown of IsCanceled statuses
# ## for those reservations with matching Arrival and Status Dates?

# (df_data[sameday_status]
#  .value_counts(subset = 'IsCanceled',normalize = True)
#  .round(2))

In [13]:
# ## What is the average rate for these day-use/same-day-departure reservations?

# sameday_adr = (sameday_status & (df_data['ReservationStatus'] == 'Check-Out'))

# sameday_adr_median = df_data[sameday_adr]['ADR'].median()

# print(f'The median ADR for same-day reservations is: ${sameday_adr_median:.2f}')

# sameday_adr_gt_zero = (df_data[sameday_adr]['ADR'] > 0).mean().round(2)

# print(f'The number of same-day reservations with an ADR greater than zero is: {sameday_adr_gt_zero:.1%}')

In [14]:
# sameday_departure = (df_data['ReservationStatusDate'] == df_data['DepartureDate'])
# df_data[sameday_departure].value_counts(subset = 'IsCanceled', normalize =True).round(4)

In [15]:
# df_data[sameday_departure].value_counts(subset = 'ReservationStatus', normalize =True).round(4)

In [16]:
# df_data[(sameday_departure & (df_data['ReservationStatus'] != 'Check-Out'))]

## ReservationStatusDate Earlier Than Arrival Date

In [17]:
# after_arrival_filter = (df_data['ReservationStatusDate'] > df_data['ArrivalDate'])

# avg_resstatdate_after_arrival = (after_arrival_filter.mean())

# print(f'The average number of reservations changed after arrival is: {avg_resstatdate_after_arrival:.0%}.')

## Inspection Results: `ReservationStatusDate`

---

Based on the average number of reservations last changed before arrival, it is clear that this feature will be almost exactly indicative of whether a reservation canceled prior to arrival.

Additionally, the number of days between the ArrivalDate and ReservationStatusDate features matches the length of stay, which already exists.

However, I did calculate the age of each reservation and can use this information for further analysis of cancelled reservations.

**Final Determination:** The feature `ReservationStatusDate` is not appropriate for feature engineering, and considering its nearly-exact correlation to the cancelation status of a reservation, it should be dropped from future datasets prior to modeling.

---

In [18]:
df_data = df_data.drop(columns = 'ReservationStatusDate')
df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,Country,...,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,HotelNumber,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,342 days,2015,July,27,1,2,0.0,0,BB,PRT,...,Transient,0.00,0,0,Check-Out,H1,2015-07-01,2015-07-01,0,2014-07-24
75559,257 days,2015,July,27,1,1,0.0,0,HB,PRT,...,Transient,80.00,0,0,Check-Out,H2,2015-07-01,2015-07-03,2,2014-10-17
75560,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,Transient,101.50,0,0,Check-Out,H2,2015-07-01,2015-07-03,2,2014-10-17
75561,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,Transient,101.50,0,0,Check-Out,H2,2015-07-01,2015-07-03,2,2014-10-17
75562,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,Transient,101.50,0,0,Check-Out,H2,2015-07-01,2015-07-03,2,2014-10-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,108 days,2017,August,35,31,2,0.0,0,HB,GBR,...,Transient,207.03,0,1,Check-Out,H1,2017-08-31,2017-09-07,7,2017-05-15
40040,194 days,2017,August,35,31,2,1.0,0,HB,ITA,...,Transient,312.29,1,1,Check-Out,H1,2017-08-31,2017-09-07,7,2017-02-18
13794,17 days,2017,August,35,31,2,0.0,0,HB,ESP,...,Transient,207.00,0,2,Canceled,H1,2017-08-31,2017-09-03,3,2017-08-14
40038,191 days,2017,August,35,31,2,0.0,0,HB,GBR,...,Contract,114.80,0,0,Check-Out,H1,2017-08-31,2017-09-07,7,2017-02-21


# Feature Engineering: Holidays

In [19]:
min_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .min()
                 .min()
                 .year)

max_year = (df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']]
                 .max()
                 .max()
                 .year)

min_year, max_year

(2013, 2017)

In [22]:
# Fetch holidays for the specific country and range of years (2013-2017)
country_code = 'PT'
years= [year for year in range(min_year, max_year+1)]

pt_holidays = holidays.CountryHoliday(country = country_code, years = years)

def holiday_past(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest past holiday
    past_holidays = [(date - h_date).days for h_date in holidays if h_date <= date]
    
    if past_holidays:
        days_after = min((d for d in past_holidays if d >= 0), default=None)
    else:
        days_after = None
   
    return days_after


def holiday_upcoming(date, holidays):

    # Convert Timestamp to datetime.date
    date = date.date()

    # Find the closest upcoming holiday
    future_holidays = [(h_date - date).days for h_date in holidays if h_date > date]
        
    if future_holidays:
        days_before = min((d for d in future_holidays if d >= 0), default=None)
    else:
        days_before = None

    return days_before


# Function to calculate the proximity to holidays for a list of dates
def calculate_holiday_proximity(dates, holidays):
    days_after_recent_holiday = []
    days_before_next_holiday = []

    for dt in dates:

        days_after_recent_holiday.append(holiday_past(dt, holidays))
        days_before_next_holiday.append(holiday_upcoming(dt, holidays))
    
    return days_after_recent_holiday, days_before_next_holiday

In [23]:
# Apply the function to each date column in the dataframe
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    after, before = calculate_holiday_proximity(df_data[column], pt_holidays)
    df_data[f'{column}_DaysBeforeHoliday'] = before
    df_data[f'{column}_DaysAfterHoliday'] = after

df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,Country,...,ArrivalDate,DepartureDate,Length of Stay,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday
0,342 days,2015,July,27,1,2,0.0,0,BB,PRT,...,2015-07-01,2015-07-01,0,2014-07-24,45,21,45,21,22,44
75559,257 days,2015,July,27,1,1,0.0,0,HB,PRT,...,2015-07-01,2015-07-03,2,2014-10-17,45,21,43,23,52,63
75560,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-01,2015-07-03,2,2014-10-17,45,21,43,23,52,63
75561,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-01,2015-07-03,2,2014-10-17,45,21,43,23,52,63
75562,257 days,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-01,2015-07-03,2,2014-10-17,45,21,43,23,52,63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,108 days,2017,August,35,31,2,0.0,0,HB,GBR,...,2017-08-31,2017-09-07,7,2017-05-15,35,16,28,23,26,14
40040,194 days,2017,August,35,31,2,1.0,0,HB,ITA,...,2017-08-31,2017-09-07,7,2017-02-18,35,16,28,23,55,48
13794,17 days,2017,August,35,31,2,0.0,0,HB,ESP,...,2017-08-31,2017-09-03,3,2017-08-14,35,16,32,19,1,60
40038,191 days,2017,August,35,31,2,0.0,0,HB,GBR,...,2017-08-31,2017-09-07,7,2017-02-21,35,16,28,23,52,51


# Feature Engineering: Convert LeadTime to Numeric

In [24]:
df_data['LeadTime']

0        342 days
75559    257 days
75560    257 days
75561    257 days
75562    257 days
           ...   
40039    108 days
40040    194 days
13794     17 days
40038    191 days
117424     3 days
Name: LeadTime, Length: 119390, dtype: timedelta64[ns]

In [25]:
df_data['LeadTime'] = df_data['LeadTime'].dt.days
df_data['LeadTime']

0         342
75559     257
75560     257
75561     257
75562     257
         ... 
40039     108
40040     194
13794      17
40038     191
117424      3
Name: LeadTime, Length: 119390, dtype: int64

# Feature Engineering: ISO Day of Week

In [26]:
df_data['ArrivalDate'].dt.dayofweek.head()

0        2
75559    2
75560    2
75561    2
75562    2
Name: ArrivalDate, dtype: int32

In [27]:
arrival_isocal = (df_data['ArrivalDate']
                  .dt.isocalendar()[['day']]
                  .rename(columns = {'day': 'ArrivalDate_DayOfWeek'}))
arrival_isocal

Unnamed: 0,ArrivalDate_DayOfWeek
0,3
75559,3
75560,3
75561,3
75562,3
...,...
40039,4
40040,4
13794,4
40038,4


In [28]:
df_data = pd.concat([df_data, arrival_isocal], axis = 1)
df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,Country,...,DepartureDate,Length of Stay,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_DayOfWeek
0,342,2015,July,27,1,2,0.0,0,BB,PRT,...,2015-07-01,0,2014-07-24,45,21,45,21,22,44,3
75559,257,2015,July,27,1,1,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
75560,257,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
75561,257,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
75562,257,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,108,2017,August,35,31,2,0.0,0,HB,GBR,...,2017-09-07,7,2017-05-15,35,16,28,23,26,14,4
40040,194,2017,August,35,31,2,1.0,0,HB,ITA,...,2017-09-07,7,2017-02-18,35,16,28,23,55,48,4
13794,17,2017,August,35,31,2,0.0,0,HB,ESP,...,2017-09-03,3,2017-08-14,35,16,32,19,1,60,4
40038,191,2017,August,35,31,2,0.0,0,HB,GBR,...,2017-09-07,7,2017-02-21,35,16,28,23,52,51,4


# Final Preparations

In [29]:
df_data = df_data.reset_index(drop = True)
df_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,Country,...,DepartureDate,Length of Stay,BookingDate,ArrivalDate_DaysBeforeHoliday,ArrivalDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,ArrivalDate_DayOfWeek
0,342,2015,July,27,1,2,0.0,0,BB,PRT,...,2015-07-01,0,2014-07-24,45,21,45,21,22,44,3
1,257,2015,July,27,1,1,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
2,257,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
3,257,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
4,257,2015,July,27,1,2,0.0,0,HB,PRT,...,2015-07-03,2,2014-10-17,45,21,43,23,52,63,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,108,2017,August,35,31,2,0.0,0,HB,GBR,...,2017-09-07,7,2017-05-15,35,16,28,23,26,14,4
119386,194,2017,August,35,31,2,1.0,0,HB,ITA,...,2017-09-07,7,2017-02-18,35,16,28,23,55,48,4
119387,17,2017,August,35,31,2,0.0,0,HB,ESP,...,2017-09-03,3,2017-08-14,35,16,32,19,1,60,4
119388,191,2017,August,35,31,2,0.0,0,HB,GBR,...,2017-09-07,7,2017-02-21,35,16,28,23,52,51,4


# Final Inspection

---

I successfully extracted valuable information from booking and stay dates, along with several temporal features. Although this approach has increased the number of features, I believe that the additional data will significantly enhance the modeling process and provide deeper insights into reservation patterns.

---

In [30]:
df_data.to_parquet('../../data/3.1_temporally_updated_data.parquet', compression = 'zstd')