# Feature Engineering: Temporal Features

---

## Overview

This notebook creates comprehensive temporal features from hotel booking data, including basic date features, cyclical encodings, Fourier transformations, and polynomial trends to enhance predictive modeling capabilities.

---

## Features Created

**Core Date Features (3):**
- `ArrivalDate`: Constructed from year, month, and day components
- `DepartureDate`: Calculated using arrival date and length of stay
- `BookingDate`: Derived from arrival date minus lead time

**Basic Temporal Features (10):**
- Day of week, Month, Quarter, Week of year, Season
- Weekend indicators and proximity
- Holiday proximity (days before/after)
- Holiday indicators (boolean flags)

**Advanced Modeling Features (35+):**
- **Cyclical encodings**: Sin/cos transformations for day of week, month, day of year
- **Fourier features**: Multi-frequency periodic patterns (annual, semi-annual, quarterly, monthly, weekly) with 2nd & 3rd harmonics
- **Polynomial trends**: Linear, quadratic, and cubic time progression
- **Interaction features**: Time trends combined with seasonal patterns

---

## Data Quality & Methodology

- **Data validation**: Checks for missing values, invalid dates, and range anomalies
- **Data leakage prevention**: `ReservationStatusDate` excluded (correlates perfectly with target)
- **Memory optimization**: Features converted to appropriate types (int8, int16, float32, bool)
- **Cyclical encoding rationale**: Preserves circular nature of time (e.g., December → January continuity)
- **Fourier transformations**: Capture multiple seasonal frequencies simultaneously

---

## Expected Outcome

By the end of this notebook, the dataset will include **60+ engineered temporal features** optimized for machine learning models, including tree-based algorithms (Random Forest, XGBoost) and neural networks.

In [1]:
import bisect
import datetime as dt
import holidays
import numpy as np
import pandas as pd

In [2]:
path = '../../data/raw/combined.parquet'
df_data = pd.read_parquet(path)
df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0.0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.0,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0.0,...,No Deposit,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0.0,...,No Deposit,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0.0,...,No Deposit,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,0,102,2017,August,35,31,2,5,3,0.0,...,No Deposit,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,0,34,2017,August,35,31,2,5,2,0.0,...,No Deposit,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,0,109,2017,August,35,31,2,5,2,0.0,...,No Deposit,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07


In [3]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 31 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   IsCanceled                   119390 non-null  int64  
 1   LeadTime                     119390 non-null  int64  
 2   ArrivalDateYear              119390 non-null  int64  
 3   ArrivalDateMonth             119390 non-null  object 
 4   ArrivalDateWeekNumber        119390 non-null  int64  
 5   ArrivalDateDayOfMonth        119390 non-null  int64  
 6   StaysInWeekendNights         119390 non-null  int64  
 7   StaysInWeekNights            119390 non-null  int64  
 8   Adults                       119390 non-null  int64  
 9   Children                     119386 non-null  float64
 10  Babies                       119390 non-null  int64  
 11  Meal                         119390 non-null  object 
 12  Country                      118902 non-null  object 
 13 

## Data Validation

Verify data quality before proceeding with feature engineering.

In [4]:
# Check for invalid date combinations (will be validated after datetime conversion)
# Create a test conversion to catch any invalid dates
try:
    test_dates = pd.to_datetime(
        df_data['ArrivalDateYear'].astype(str) + '-' + 
        df_data['ArrivalDateMonth'].astype(str) + '-' + 
        df_data['ArrivalDateDayOfMonth'].astype(str),
        yearfirst=True,
        errors='coerce'
    )
    invalid_dates = test_dates.isnull().sum()
    
    if invalid_dates > 0:
        print(f"WARNING: {invalid_dates} invalid date combinations found")
    else:
        print("✓ All date combinations are valid")
except Exception as e:
    print(f"ERROR during date validation: {e}")

✓ All date combinations are valid


In [5]:
# Validate data ranges
print("Data Range Validation:\n")

# Check for negative values
if (df_data['LeadTime'] < 0).any():
    print(f"WARNING: {(df_data['LeadTime'] < 0).sum()} negative lead times found")
else:
    print("✓ All lead times are non-negative")

if (df_data['StaysInWeekendNights'] < 0).any() or (df_data['StaysInWeekNights'] < 0).any():
    print("WARNING: Negative stay durations found")
else:
    print("✓ All stay durations are non-negative")
    
# Check for zero-night stays
zero_nights = ((df_data['StaysInWeekendNights'] == 0) & 
               (df_data['StaysInWeekNights'] == 0)).sum()
print(f"\nRecords with zero-night stays: {zero_nights:,} ({zero_nights/len(df_data)*100:.2f}%)")

# Display value ranges
print(f"\nLead Time range: {df_data['LeadTime'].min()} to {df_data['LeadTime'].max()} days")
print(f"Year range: {df_data['ArrivalDateYear'].min()} to {df_data['ArrivalDateYear'].max()}")

Data Range Validation:

✓ All lead times are non-negative
✓ All stay durations are non-negative

Records with zero-night stays: 715 (0.60%)

Lead Time range: 0 to 737 days
Year range: 2015 to 2017


In [6]:
# Check for missing values in critical date components
date_components = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth', 
                   'LeadTime', 'StaysInWeekendNights', 'StaysInWeekNights']

missing_values = df_data[date_components].isnull().sum()
if missing_values.any():
    print("WARNING: Missing values found in date components:")
    print(missing_values[missing_values > 0])
else:
    print("✓ No missing values in date components")
    
# Display the count
print(f"\nTotal records: {len(df_data):,}")

✓ No missing values in date components

Total records: 119,390


## Analysis: `ReservationStatusDate` Feature

The `ReservationStatusDate` indicates when a reservation was last modified. Initial analysis revealed that this feature is almost perfectly correlated with cancellation status - reservations changed before the arrival date are almost always cancellations.

**Key Findings:**
- Reservations with status dates before arrival are nearly 100% cancellations
- This creates data leakage for predictive modeling
- The feature provides no additional value beyond the target variable

**Decision:** This feature will be dropped from the dataset to prevent data leakage in modeling.

In [7]:
# Drop ReservationStatusDate to prevent data leakage
df_data = df_data.drop(columns='ReservationStatusDate')

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus
0,0,342,2015,July,27,1,0,0,2,0.0,...,3,No Deposit,,,0,Transient,0.00,0,0,Check-Out
1,0,737,2015,July,27,1,0,0,2,0.0,...,4,No Deposit,,,0,Transient,0.00,0,0,Check-Out
2,0,7,2015,July,27,1,0,1,1,0.0,...,0,No Deposit,,,0,Transient,75.00,0,0,Check-Out
3,0,13,2015,July,27,1,0,1,1,0.0,...,0,No Deposit,304,,0,Transient,75.00,0,0,Check-Out
4,0,14,2015,July,27,1,0,2,2,0.0,...,0,No Deposit,240,,0,Transient,98.00,0,1,Check-Out
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0.0,...,0,No Deposit,394,,0,Transient,96.14,0,0,Check-Out
119386,0,102,2017,August,35,31,2,5,3,0.0,...,0,No Deposit,9,,0,Transient,225.43,0,2,Check-Out
119387,0,34,2017,August,35,31,2,5,2,0.0,...,0,No Deposit,9,,0,Transient,157.71,0,4,Check-Out
119388,0,109,2017,August,35,31,2,5,2,0.0,...,0,No Deposit,89,,0,Transient,104.40,0,0,Check-Out


# Feature Engineering: Core Date Features

Create fundamental date features from the source data components.

## `ArrivalDate`

In [8]:
# Combine year, month, and day columns to create arrival date
arrival_details = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']
df_data[arrival_details] = df_data[arrival_details].astype(str)

df_data['ArrivalDate'] = (
    df_data['ArrivalDateYear']
    .str.cat(df_data[['ArrivalDateMonth', 'ArrivalDateDayOfMonth']], sep='-')
)

# Convert to datetime and sort by arrival date
df_data['ArrivalDate'] = pd.to_datetime(df_data['ArrivalDate'], yearfirst=True)
df_data = df_data.sort_values(by='ArrivalDate', ignore_index=False)

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ArrivalDate
0,0,342,2015,July,27,1,0,0,2,0.0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
75559,0,257,2015,July,27,1,0,2,1,0.0,...,No Deposit,6,,0,Transient,80.00,0,0,Check-Out,2015-07-01
75560,0,257,2015,July,27,1,0,2,2,0.0,...,No Deposit,6,,0,Transient,101.50,0,0,Check-Out,2015-07-01
75561,0,257,2015,July,27,1,0,2,2,0.0,...,No Deposit,6,,0,Transient,101.50,0,0,Check-Out,2015-07-01
75562,0,257,2015,July,27,1,0,2,2,0.0,...,No Deposit,6,,0,Transient,101.50,0,0,Check-Out,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108,2017,August,35,31,2,5,2,0.0,...,No Deposit,241,,0,Transient,207.03,0,1,Check-Out,2017-08-31
40040,0,194,2017,August,35,31,2,5,2,1.0,...,No Deposit,240,,0,Transient,312.29,1,1,Check-Out,2017-08-31
13794,1,17,2017,August,35,31,0,3,2,0.0,...,No Deposit,240,,0,Transient,207.00,0,2,Canceled,2017-08-31
40038,0,191,2017,August,35,31,2,5,2,0.0,...,No Deposit,40,,0,Contract,114.80,0,0,Check-Out,2017-08-31


## `Departure Date`

In [9]:
# Calculate departure date by adding weekend and weekday nights to arrival date
timedelta_weekend = pd.to_timedelta(df_data['StaysInWeekendNights'], unit='D')
timedelta_weekday = pd.to_timedelta(df_data['StaysInWeekNights'], unit='D')

df_data['DepartureDate'] = df_data['ArrivalDate'] + timedelta_weekday + timedelta_weekend

# Calculate total length of stay and drop individual night columns
df_data['Length of Stay'] = df_data['StaysInWeekendNights'] + df_data['StaysInWeekNights']
df_data = df_data.drop(columns=['StaysInWeekendNights', 'StaysInWeekNights'])

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ArrivalDate,DepartureDate,Length of Stay
0,0,342,2015,July,27,1,2,0.0,0,BB,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01,0
75559,0,257,2015,July,27,1,1,0.0,0,HB,...,,0,Transient,80.00,0,0,Check-Out,2015-07-01,2015-07-03,2
75560,0,257,2015,July,27,1,2,0.0,0,HB,...,,0,Transient,101.50,0,0,Check-Out,2015-07-01,2015-07-03,2
75561,0,257,2015,July,27,1,2,0.0,0,HB,...,,0,Transient,101.50,0,0,Check-Out,2015-07-01,2015-07-03,2
75562,0,257,2015,July,27,1,2,0.0,0,HB,...,,0,Transient,101.50,0,0,Check-Out,2015-07-01,2015-07-03,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108,2017,August,35,31,2,0.0,0,HB,...,,0,Transient,207.03,0,1,Check-Out,2017-08-31,2017-09-07,7
40040,0,194,2017,August,35,31,2,1.0,0,HB,...,,0,Transient,312.29,1,1,Check-Out,2017-08-31,2017-09-07,7
13794,1,17,2017,August,35,31,2,0.0,0,HB,...,,0,Transient,207.00,0,2,Canceled,2017-08-31,2017-09-03,3
40038,0,191,2017,August,35,31,2,0.0,0,HB,...,,0,Contract,114.80,0,0,Check-Out,2017-08-31,2017-09-07,7


## `BookingDate` from `LeadTime`

In [10]:
# Calculate booking date by subtracting lead time from arrival date
df_data['LeadTime'] = pd.to_timedelta(df_data['LeadTime'], unit='D')
df_data['BookingDate'] = df_data['ArrivalDate'] - df_data['LeadTime']

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,...,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ArrivalDate,DepartureDate,Length of Stay,BookingDate
0,0,342 days,2015,July,27,1,2,0.0,0,BB,...,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01,0,2014-07-24
75559,0,257 days,2015,July,27,1,1,0.0,0,HB,...,0,Transient,80.00,0,0,Check-Out,2015-07-01,2015-07-03,2,2014-10-17
75560,0,257 days,2015,July,27,1,2,0.0,0,HB,...,0,Transient,101.50,0,0,Check-Out,2015-07-01,2015-07-03,2,2014-10-17
75561,0,257 days,2015,July,27,1,2,0.0,0,HB,...,0,Transient,101.50,0,0,Check-Out,2015-07-01,2015-07-03,2,2014-10-17
75562,0,257 days,2015,July,27,1,2,0.0,0,HB,...,0,Transient,101.50,0,0,Check-Out,2015-07-01,2015-07-03,2,2014-10-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108 days,2017,August,35,31,2,0.0,0,HB,...,0,Transient,207.03,0,1,Check-Out,2017-08-31,2017-09-07,7,2017-05-15
40040,0,194 days,2017,August,35,31,2,1.0,0,HB,...,0,Transient,312.29,1,1,Check-Out,2017-08-31,2017-09-07,7,2017-02-18
13794,1,17 days,2017,August,35,31,2,0.0,0,HB,...,0,Transient,207.00,0,2,Canceled,2017-08-31,2017-09-03,3,2017-08-14
40038,0,191 days,2017,August,35,31,2,0.0,0,HB,...,0,Contract,114.80,0,0,Check-Out,2017-08-31,2017-09-07,7,2017-02-21


# Feature Engineering: Holiday Proximity Features

Portuguese holiday data will be used to calculate proximity features for each date column (Arrival, Departure, and Booking dates). These features capture how many days before or after a holiday each event occurs, which may be useful for predicting booking patterns and cancellations.

In [11]:
# Get the date range for holidays
min_year = df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']].min().min().year
max_year = df_data[['ArrivalDate', 'DepartureDate', 'BookingDate']].max().max().year

# Fetch Portuguese holidays for the date range in the dataset
country_code = 'PT'
years = range(min_year, max_year + 1)
pt_holidays = holidays.CountryHoliday(country=country_code, years=years)

# Convert holidays to a sorted list for efficient searching
holiday_dates = sorted(pt_holidays.keys())


def calculate_days_from_holidays(date, holiday_dates):
    """
    Calculate days before next holiday and days after most recent holiday.
    
    Parameters:
    -----------
    date : pd.Timestamp
        The date to calculate holiday proximity for
    holiday_dates : list
        Sorted list of holiday dates
    
    Returns:
    --------
    tuple : (days_after_recent_holiday, days_before_next_holiday)
    """
    date = date.date()
    
    # Find position where this date would be inserted in the sorted holiday list
    idx = bisect.bisect_left(holiday_dates, date)
    
    # Calculate days after most recent holiday
    if idx > 0:
        days_after = (date - holiday_dates[idx - 1]).days
    else:
        days_after = None
    
    # Calculate days before next holiday
    if idx < len(holiday_dates):
        days_before = (holiday_dates[idx] - date).days
        # If the date is a holiday itself, days_before will be 0
        if days_before == 0 and idx + 1 < len(holiday_dates):
            days_before = (holiday_dates[idx + 1] - date).days
    else:
        days_before = None
    
    return days_after, days_before


# Apply holiday proximity calculation to each date column
for column in ['ArrivalDate', 'DepartureDate', 'BookingDate']:
    results = df_data[column].apply(lambda x: calculate_days_from_holidays(x, holiday_dates))
    df_data[f'{column}_DaysAfterHoliday'] = results.apply(lambda x: x[0])
    df_data[f'{column}_DaysBeforeHoliday'] = results.apply(lambda x: x[1])

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,...,ArrivalDate,DepartureDate,Length of Stay,BookingDate,ArrivalDate_DaysAfterHoliday,ArrivalDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday
0,0,342 days,2015,July,27,1,2,0.0,0,BB,...,2015-07-01,2015-07-01,0,2014-07-24,21,45,21,45,44,22
75559,0,257 days,2015,July,27,1,1,0.0,0,HB,...,2015-07-01,2015-07-03,2,2014-10-17,21,45,23,43,63,52
75560,0,257 days,2015,July,27,1,2,0.0,0,HB,...,2015-07-01,2015-07-03,2,2014-10-17,21,45,23,43,63,52
75561,0,257 days,2015,July,27,1,2,0.0,0,HB,...,2015-07-01,2015-07-03,2,2014-10-17,21,45,23,43,63,52
75562,0,257 days,2015,July,27,1,2,0.0,0,HB,...,2015-07-01,2015-07-03,2,2014-10-17,21,45,23,43,63,52
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108 days,2017,August,35,31,2,0.0,0,HB,...,2017-08-31,2017-09-07,7,2017-05-15,16,35,23,28,14,26
40040,0,194 days,2017,August,35,31,2,1.0,0,HB,...,2017-08-31,2017-09-07,7,2017-02-18,16,35,23,28,48,55
13794,1,17 days,2017,August,35,31,2,0.0,0,HB,...,2017-08-31,2017-09-03,3,2017-08-14,16,35,19,32,60,1
40038,0,191 days,2017,August,35,31,2,0.0,0,HB,...,2017-08-31,2017-09-07,7,2017-02-21,16,35,23,28,51,52


# Special Indicator Features - Holidays, Seasons, Weekends

In [12]:
# Create binary indicators for whether dates fall on a public holiday
df_data['ArrivalDate_IsHoliday'] = df_data['ArrivalDate'].apply(
    lambda x: x.date() in holiday_dates
)

df_data['DepartureDate_IsHoliday'] = df_data['DepartureDate'].apply(
    lambda x: x.date() in holiday_dates
)

df_data['BookingDate_IsHoliday'] = df_data['BookingDate'].apply(
    lambda x: x.date() in holiday_dates
)

# Display holiday statistics
print(f"Arrivals on holidays: {df_data['ArrivalDate_IsHoliday'].sum():,} ({df_data['ArrivalDate_IsHoliday'].mean()*100:.2f}%)")
print(f"Departures on holidays: {df_data['DepartureDate_IsHoliday'].sum():,} ({df_data['DepartureDate_IsHoliday'].mean()*100:.2f}%)")
print(f"Bookings on holidays: {df_data['BookingDate_IsHoliday'].sum():,} ({df_data['BookingDate_IsHoliday'].mean()*100:.2f}%)")

Arrivals on holidays: 3,932 (3.29%)
Departures on holidays: 4,698 (3.94%)
Bookings on holidays: 2,756 (2.31%)


In [13]:
# Create season feature based on month (Northern Hemisphere seasons)
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:  # 9, 10, 11
        return 'Fall'

df_data['ArrivalDate_Season'] = df_data['ArrivalDate'].dt.month.apply(get_season)

df_data[['ArrivalDate', 'ArrivalDate_Season']].value_counts().sort_index()

ArrivalDate  ArrivalDate_Season
2015-07-01   Summer                122
2015-07-02   Summer                 93
2015-07-03   Summer                 56
2015-07-04   Summer                 88
2015-07-05   Summer                 53
                                  ... 
2017-08-27   Summer                174
2017-08-28   Summer                211
2017-08-29   Summer                125
2017-08-30   Summer                 89
2017-08-31   Summer                134
Name: count, Length: 793, dtype: int64

In [14]:
# Extract ISO day of week from arrival date (1=Monday, 7=Sunday)
arrival_isocal = (
    df_data['ArrivalDate']
    .dt.isocalendar()[['day']]
    .rename(columns={'day': 'ArrivalDate_DayOfWeek'})
)

# Add day of week to the dataframe
df_data = pd.concat([df_data, arrival_isocal], axis=1)

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,...,ArrivalDate_DaysBeforeHoliday,DepartureDate_DaysAfterHoliday,DepartureDate_DaysBeforeHoliday,BookingDate_DaysAfterHoliday,BookingDate_DaysBeforeHoliday,ArrivalDate_IsHoliday,DepartureDate_IsHoliday,BookingDate_IsHoliday,ArrivalDate_Season,ArrivalDate_DayOfWeek
0,0,342 days,2015,July,27,1,2,0.0,0,BB,...,45,21,45,44,22,False,False,False,Summer,3
75559,0,257 days,2015,July,27,1,1,0.0,0,HB,...,45,23,43,63,52,False,False,False,Summer,3
75560,0,257 days,2015,July,27,1,2,0.0,0,HB,...,45,23,43,63,52,False,False,False,Summer,3
75561,0,257 days,2015,July,27,1,2,0.0,0,HB,...,45,23,43,63,52,False,False,False,Summer,3
75562,0,257 days,2015,July,27,1,2,0.0,0,HB,...,45,23,43,63,52,False,False,False,Summer,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40039,0,108 days,2017,August,35,31,2,0.0,0,HB,...,35,23,28,14,26,False,False,False,Summer,4
40040,0,194 days,2017,August,35,31,2,1.0,0,HB,...,35,23,28,48,55,False,False,False,Summer,4
13794,1,17 days,2017,August,35,31,2,0.0,0,HB,...,35,19,32,60,1,False,False,False,Summer,4
40038,0,191 days,2017,August,35,31,2,0.0,0,HB,...,35,23,28,51,52,False,False,False,Summer,4


In [15]:
# Create weekend arrival indicator (Friday=5, Saturday=6, Sunday=7)
df_data['ArrivalDate_IsWeekend'] = df_data['ArrivalDate_DayOfWeek'].isin([5, 6, 7])

# Calculate days until next weekend and days since last weekend
df_data['ArrivalDate_DaysUntilWeekend'] = (5 - df_data['ArrivalDate_DayOfWeek']) % 7
df_data['ArrivalDate_DaysSinceWeekend'] = df_data['ArrivalDate_DayOfWeek'].apply(
    lambda x: 0 if x in [5, 6, 7] else x - 1 if x == 1 else x - 1
)

df_data[['ArrivalDate_DayOfWeek', 'ArrivalDate_IsWeekend', 
         'ArrivalDate_DaysUntilWeekend', 'ArrivalDate_DaysSinceWeekend']].head(10)

Unnamed: 0,ArrivalDate_DayOfWeek,ArrivalDate_IsWeekend,ArrivalDate_DaysUntilWeekend,ArrivalDate_DaysSinceWeekend
0,3,False,2,2
75559,3,False,2,2
75560,3,False,2,2
75561,3,False,2,2
75562,3,False,2,2
75563,3,False,2,2
75564,3,False,2,2
75565,3,False,2,2
75566,3,False,2,2
75558,3,False,2,2


In [16]:
# Extract quarter, month, and week of year from arrival date
df_data['ArrivalDate_Quarter'] = df_data['ArrivalDate'].dt.quarter
df_data['ArrivalDate_Month'] = df_data['ArrivalDate'].dt.month
df_data['ArrivalDate_WeekOfYear'] = df_data['ArrivalDate'].dt.isocalendar().week

df_data[['ArrivalDate', 'ArrivalDate_Quarter', 'ArrivalDate_Month', 'ArrivalDate_WeekOfYear']].head()

Unnamed: 0,ArrivalDate,ArrivalDate_Quarter,ArrivalDate_Month,ArrivalDate_WeekOfYear
0,2015-07-01,3,7,27
75559,2015-07-01,3,7,27
75560,2015-07-01,3,7,27
75561,2015-07-01,3,7,27
75562,2015-07-01,3,7,27


# Feature Engineering: Advanced Temporal Features for Modeling

Create sophisticated temporal features including cyclical encodings, Fourier transformations, and polynomial trends.

## Feature Engineering Rationale

**Why Fourier Transformations?**
- Capture multiple seasonal patterns simultaneously (yearly, quarterly, monthly, weekly)
- Each frequency/harmonic combination detects different cyclic behaviors
- Essential for models that can't learn periodicity naturally (linear models, neural nets)
- Tree-based models benefit from smooth, continuous features

**Why Cyclical Encoding (sin/cos)?**
- Preserves circular nature: December (12) and January (1) are actually adjacent
- Raw numeric encoding creates artificial gaps (11 units apart)
- Sin/cos encoding: December and January have similar values
- Enables distance-based algorithms to work correctly with time

**Why Polynomial Trends?**
- Captures non-linear changes over time (bookings may increase/decrease at varying rates)
- Polynomial terms allow flexible trend fitting without overfitting
- Normalized to [0,1] for numerical stability

**Why Interaction Features?**
- Time × Seasonality: Seasonal effects may strengthen/weaken over time
- LeadTime × Seasonality: Booking windows may vary by season
- Captures complex, real-world booking patterns

**Model-Specific Considerations:**
- **Tree-based (RF, XGBoost)**: Can use all features; automatically find interactions
- **Linear models**: Fourier + interactions critical for good performance
- **Neural networks**: Cyclical encodings essential; Fourier features helpful
- **Feature selection recommended**: Not all features will be important; use correlation analysis or model-based selection

## Cyclical Encoding

Encode cyclical features (month, day of week, day of year) using sine and cosine transformations to preserve their circular nature.

In [17]:
# Cyclical encoding for day of week (7 days)
df_data['ArrivalDate_DayOfWeek_Sin'] = np.sin(2 * np.pi * df_data['ArrivalDate_DayOfWeek'] / 7)
df_data['ArrivalDate_DayOfWeek_Cos'] = np.cos(2 * np.pi * df_data['ArrivalDate_DayOfWeek'] / 7)

# Cyclical encoding for month (12 months)
df_data['ArrivalDate_Month_Sin'] = np.sin(2 * np.pi * df_data['ArrivalDate_Month'] / 12)
df_data['ArrivalDate_Month_Cos'] = np.cos(2 * np.pi * df_data['ArrivalDate_Month'] / 12)

# Cyclical encoding for day of year (365 days)
df_data['ArrivalDate_DayOfYear'] = df_data['ArrivalDate'].dt.dayofyear
df_data['ArrivalDate_DayOfYear_Sin'] = np.sin(2 * np.pi * df_data['ArrivalDate_DayOfYear'] / 365.25)
df_data['ArrivalDate_DayOfYear_Cos'] = np.cos(2 * np.pi * df_data['ArrivalDate_DayOfYear'] / 365.25)

# Display sample
print("Cyclical encodings for day of week, month, and day of year:")
df_data[['ArrivalDate_DayOfWeek', 'ArrivalDate_DayOfWeek_Sin', 'ArrivalDate_DayOfWeek_Cos',
         'ArrivalDate_Month', 'ArrivalDate_Month_Sin', 'ArrivalDate_Month_Cos']].head()

Cyclical encodings for day of week, month, and day of year:


Unnamed: 0,ArrivalDate_DayOfWeek,ArrivalDate_DayOfWeek_Sin,ArrivalDate_DayOfWeek_Cos,ArrivalDate_Month,ArrivalDate_Month_Sin,ArrivalDate_Month_Cos
0,3,0.433884,-0.900969,7,-0.5,-0.866025
75559,3,0.433884,-0.900969,7,-0.5,-0.866025
75560,3,0.433884,-0.900969,7,-0.5,-0.866025
75561,3,0.433884,-0.900969,7,-0.5,-0.866025
75562,3,0.433884,-0.900969,7,-0.5,-0.866025


In [18]:
# Compare raw month vs cyclical encoding
print("Month Encoding Comparison:")
print("="*60)
print("\nRaw month values (December=12 to January=1):")
print("  Problem: Distance from Dec (12) to Jan (1) = 11 (wrong!)")
print("  Reality: They are adjacent months, distance should be ~1")

print("\nCyclical encoding (sin/cos):")
sample_months = df_data[df_data['ArrivalDate_Month'].isin([12, 1])].head(10)
print("\nSample December and January encodings:")
print(sample_months[['ArrivalDate_Month', 'ArrivalDate_Month_Sin', 'ArrivalDate_Month_Cos']].to_string())

print("\n✓ Cyclical encoding preserves month adjacency for ML models")

Month Encoding Comparison:

Raw month values (December=12 to January=1):
  Problem: Distance from Dec (12) to Jan (1) = 11 (wrong!)
  Reality: They are adjacent months, distance should be ~1

Cyclical encoding (sin/cos):

Sample December and January encodings:
       ArrivalDate_Month  ArrivalDate_Month_Sin  ArrivalDate_Month_Cos
81148                 12          -2.449294e-16                    1.0
18858                 12          -2.449294e-16                    1.0
81075                 12          -2.449294e-16                    1.0
81146                 12          -2.449294e-16                    1.0
81078                 12          -2.449294e-16                    1.0
45948                 12          -2.449294e-16                    1.0
81081                 12          -2.449294e-16                    1.0
81072                 12          -2.449294e-16                    1.0
3274                  12          -2.449294e-16                    1.0
45949                 12     

## Polynomial Time Trend Features

Create polynomial features to capture non-linear trends over time.

In [19]:
# Calculate days since start of dataset for time-based features
dataset_start = df_data['ArrivalDate'].min()
df_data['DaysSinceStart'] = (df_data['ArrivalDate'] - dataset_start).dt.days

# Normalize DaysSinceStart for polynomial features (0-1 scale)
max_days = df_data['DaysSinceStart'].max()
df_data['TimeTrend_Normalized'] = df_data['DaysSinceStart'] / max_days

# Create polynomial features (degree 2 and 3)
df_data['TimeTrend_Squared'] = df_data['TimeTrend_Normalized'] ** 2
df_data['TimeTrend_Cubed'] = df_data['TimeTrend_Normalized'] ** 3

print("✓ Created polynomial time trend features (normalized)")
print(f"\nTime trend range: {df_data['TimeTrend_Normalized'].min():.3f} to {df_data['TimeTrend_Normalized'].max():.3f}")
df_data[['DaysSinceStart', 'TimeTrend_Normalized', 'TimeTrend_Squared', 'TimeTrend_Cubed']].describe()

✓ Created polynomial time trend features (normalized)

Time trend range: 0.000 to 1.000


Unnamed: 0,DaysSinceStart,TimeTrend_Normalized,TimeTrend_Squared,TimeTrend_Cubed
count,119390.0,119390.0,119390.0,119390.0
mean,424.694279,0.53623,0.367287,0.279222
std,223.654607,0.282392,0.300646,0.29097
min,0.0,0.0,0.0,0.0
25%,256.0,0.323232,0.104479,0.033771
50%,433.0,0.546717,0.2989,0.163414
75%,626.0,0.790404,0.624739,0.493796
max,792.0,1.0,1.0,1.0


## Fourier Features

Create Fourier features to capture periodic patterns at different frequencies (yearly, semi-annual, quarterly, monthly cycles).

In [20]:
# Create Fourier features for different periodicities
# Annual cycle (365.25 days)
df_data['Fourier_Annual_Sin_1'] = np.sin(2 * np.pi * df_data['DaysSinceStart'] / 365.25)
df_data['Fourier_Annual_Cos_1'] = np.cos(2 * np.pi * df_data['DaysSinceStart'] / 365.25)

# Semi-annual cycle (182.625 days)
df_data['Fourier_SemiAnnual_Sin_1'] = np.sin(2 * np.pi * df_data['DaysSinceStart'] / 182.625)
df_data['Fourier_SemiAnnual_Cos_1'] = np.cos(2 * np.pi * df_data['DaysSinceStart'] / 182.625)

# Quarterly cycle (~91 days)
df_data['Fourier_Quarterly_Sin_1'] = np.sin(2 * np.pi * df_data['DaysSinceStart'] / 91.31)
df_data['Fourier_Quarterly_Cos_1'] = np.cos(2 * np.pi * df_data['DaysSinceStart'] / 91.31)

# Monthly cycle (~30 days)
df_data['Fourier_Monthly_Sin_1'] = np.sin(2 * np.pi * df_data['DaysSinceStart'] / 30.44)
df_data['Fourier_Monthly_Cos_1'] = np.cos(2 * np.pi * df_data['DaysSinceStart'] / 30.44)

# Weekly cycle (7 days)
df_data['Fourier_Weekly_Sin_1'] = np.sin(2 * np.pi * df_data['DaysSinceStart'] / 7)
df_data['Fourier_Weekly_Cos_1'] = np.cos(2 * np.pi * df_data['DaysSinceStart'] / 7)

print(f"Created Fourier features for 5 different periodicities (10 features total)")
print(f"\nSample Fourier features:")
df_data[['DaysSinceStart', 'Fourier_Annual_Sin_1', 'Fourier_Annual_Cos_1', 
         'Fourier_Weekly_Sin_1', 'Fourier_Weekly_Cos_1']].head()

Created Fourier features for 5 different periodicities (10 features total)

Sample Fourier features:


Unnamed: 0,DaysSinceStart,Fourier_Annual_Sin_1,Fourier_Annual_Cos_1,Fourier_Weekly_Sin_1,Fourier_Weekly_Cos_1
0,0,0.0,1.0,0.0,1.0
75559,0,0.0,1.0,0.0,1.0
75560,0,0.0,1.0,0.0,1.0
75561,0,0.0,1.0,0.0,1.0
75562,0,0.0,1.0,0.0,1.0


## Interaction Features

Create interaction features between time trends and cyclical patterns.

In [21]:
# Interaction: Time trend with annual seasonality
df_data['Interaction_Trend_Annual_Sin'] = df_data['TimeTrend_Normalized'] * df_data['Fourier_Annual_Sin_1']
df_data['Interaction_Trend_Annual_Cos'] = df_data['TimeTrend_Normalized'] * df_data['Fourier_Annual_Cos_1']

# Interaction: Time trend with monthly cycle
df_data['Interaction_Trend_Month_Sin'] = df_data['TimeTrend_Normalized'] * df_data['ArrivalDate_Month_Sin']
df_data['Interaction_Trend_Month_Cos'] = df_data['TimeTrend_Normalized'] * df_data['ArrivalDate_Month_Cos']

# Interaction: Lead time with seasonality
df_data['Interaction_LeadTime_Annual_Sin'] = df_data['LeadTime'] * df_data['Fourier_Annual_Sin_1']
df_data['Interaction_LeadTime_Annual_Cos'] = df_data['LeadTime'] * df_data['Fourier_Annual_Cos_1']

print("✓ Created 6 interaction features between time trends and cyclical patterns")

✓ Created 6 interaction features between time trends and cyclical patterns


## Higher-Order Fourier Harmonics

Add second and third harmonics for annual cycles to capture more complex seasonal patterns.

In [22]:
# Second harmonic for annual cycle
df_data['Fourier_Annual_Sin_2'] = np.sin(4 * np.pi * df_data['DaysSinceStart'] / 365.25)
df_data['Fourier_Annual_Cos_2'] = np.cos(4 * np.pi * df_data['DaysSinceStart'] / 365.25)

# Third harmonic for annual cycle
df_data['Fourier_Annual_Sin_3'] = np.sin(6 * np.pi * df_data['DaysSinceStart'] / 365.25)
df_data['Fourier_Annual_Cos_3'] = np.cos(6 * np.pi * df_data['DaysSinceStart'] / 365.25)

# Second harmonic for monthly cycle
df_data['Fourier_Monthly_Sin_2'] = np.sin(4 * np.pi * df_data['DaysSinceStart'] / 30.44)
df_data['Fourier_Monthly_Cos_2'] = np.cos(4 * np.pi * df_data['DaysSinceStart'] / 30.44)

print("✓ Added higher-order harmonics (6 additional features)")
print(f"\nTotal Fourier features: {len([col for col in df_data.columns if 'Fourier_' in col])}")

✓ Added higher-order harmonics (6 additional features)

Total Fourier features: 16


## Feature Summary

Display summary of all advanced temporal features created.

In [23]:
# Count all temporal feature categories
cyclical_features = [col for col in df_data.columns if '_Sin' in col or '_Cos' in col]
fourier_features = [col for col in df_data.columns if 'Fourier_' in col]
polynomial_features = [col for col in df_data.columns if 'TimeTrend_' in col]
interaction_features = [col for col in df_data.columns if 'Interaction_' in col]

print("="*70)
print("Advanced Temporal Features Summary")
print("="*70)
print(f"Cyclical encodings (sin/cos):        {len(cyclical_features)} features")
print(f"  - Fourier transforms:               {len(fourier_features)} features")
print(f"  - Other cyclical:                   {len(cyclical_features) - len(fourier_features)} features")
print(f"Polynomial trend features:            {len(polynomial_features)} features")
print(f"Interaction features:                 {len(interaction_features)} features")
print(f"\nTotal advanced temporal features:     {len(cyclical_features) + len(polynomial_features) + len(interaction_features)} features")
print("="*70)

Advanced Temporal Features Summary
Cyclical encodings (sin/cos):        28 features
  - Fourier transforms:               16 features
  - Other cyclical:                   12 features
Polynomial trend features:            3 features
Interaction features:                 6 features

Total advanced temporal features:     37 features


# Final Preparations

Finalize the dataset with optimized data types and prepare for export.

In [24]:
# Reset index for clean sequential indexing
df_data = df_data.reset_index(drop=True)

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,Adults,Children,Babies,Meal,...,Interaction_Trend_Month_Sin,Interaction_Trend_Month_Cos,Interaction_LeadTime_Annual_Sin,Interaction_LeadTime_Annual_Cos,Fourier_Annual_Sin_2,Fourier_Annual_Cos_2,Fourier_Annual_Sin_3,Fourier_Annual_Cos_3,Fourier_Monthly_Sin_2,Fourier_Monthly_Cos_2
0,0,342 days,2015,July,27,1,2,0.0,0,BB,...,-0.000000,-0.0,0 days 00:00:00,342 days 00:00:00,0.000000,1.000000,0.000000,1.00000,0.000000,1.000000
1,0,257 days,2015,July,27,1,1,0.0,0,HB,...,-0.000000,-0.0,0 days 00:00:00,257 days 00:00:00,0.000000,1.000000,0.000000,1.00000,0.000000,1.000000
2,0,257 days,2015,July,27,1,2,0.0,0,HB,...,-0.000000,-0.0,0 days 00:00:00,257 days 00:00:00,0.000000,1.000000,0.000000,1.00000,0.000000,1.000000
3,0,257 days,2015,July,27,1,2,0.0,0,HB,...,-0.000000,-0.0,0 days 00:00:00,257 days 00:00:00,0.000000,1.000000,0.000000,1.00000,0.000000,1.000000
4,0,257 days,2015,July,27,1,2,0.0,0,HB,...,-0.000000,-0.0,0 days 00:00:00,257 days 00:00:00,0.000000,1.000000,0.000000,1.00000,0.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,108 days,2017,August,35,31,2,0.0,0,HB,...,-0.866025,-0.5,94 days 02:32:30.488652077,52 days 23:47:28.419637815,0.855075,-0.518505,-0.032249,-0.99948,0.229128,0.973396
119386,0,194 days,2017,August,35,31,2,1.0,0,HB,...,-0.866025,-0.5,169 days 01:00:36.988875028,95 days 04:30:49.938979037,0.855075,-0.518505,-0.032249,-0.99948,0.229128,0.973396
119387,1,17 days,2017,August,35,31,2,0.0,0,HB,...,-0.866025,-0.5,14 days 19:30:40.354695234,8 days 08:11:21.695683730,0.855075,-0.518505,-0.032249,-0.99948,0.229128,0.973396
119388,0,191 days,2017,August,35,31,2,0.0,0,HB,...,-0.866025,-0.5,166 days 10:16:22.808634692,93 days 17:11:10.816211320,0.855075,-0.518505,-0.032249,-0.99948,0.229128,0.973396


In [25]:
# Display comprehensive feature breakdown
temporal_features = [col for col in df_data.columns if any(x in col for x in 
    ['Arrival', 'Departure', 'Booking', 'Days', 'Week', 'Month', 'Quarter', 
     'Season', 'Holiday', 'Fourier', 'TimeTrend', 'Interaction', 'Sin', 'Cos'])]

print("\n" + "="*70)
print("COMPLETE TEMPORAL FEATURE INVENTORY")
print("="*70)
print(f"\nTotal temporal features: {len(temporal_features)}")
print(f"Dataset shape: {df_data.shape}")
print(f"Records: {len(df_data):,}")
print("\nFeature types:")
print(f"  Date/Datetime: {df_data.select_dtypes(include=['datetime64']).shape[1]}")
print(f"  Integer (int8): {len([c for c in df_data.columns if df_data[c].dtype == 'int8'])}")
print(f"  Integer (int16/Int16): {len([c for c in df_data.columns if str(df_data[c].dtype) in ['int16', 'Int16']])}")
print(f"  Float (float32): {len([c for c in df_data.columns if df_data[c].dtype == 'float32'])}")
print(f"  Boolean: {len([c for c in df_data.columns if df_data[c].dtype == 'bool'])}")
print(f"  Category: {len([c for c in df_data.columns if df_data[c].dtype.name == 'category'])}")
print("="*70)


COMPLETE TEMPORAL FEATURE INVENTORY

Total temporal features: 60
Dataset shape: (119390, 82)
Records: 119,390

Feature types:
  Date/Datetime: 3
  Integer (int8): 0
  Integer (int16/Int16): 0
  Float (float32): 0
  Boolean: 3
  Category: 0


In [26]:
# Display sample of key engineered features
sample_cols = ['ArrivalDate', 'ArrivalDate_Month', 'ArrivalDate_Month_Sin', 'ArrivalDate_Month_Cos',
               'ArrivalDate_Season', 'Fourier_Annual_Sin_1', 'Fourier_Weekly_Sin_1',
               'TimeTrend_Normalized', 'ArrivalDate_IsWeekend', 'ArrivalDate_IsHoliday']

print("\nSample of engineered temporal features:")
print(df_data[sample_cols].head(10).to_string())

print(f"\n✓ Dataset ready with {len(df_data.columns)} total features")


Sample of engineered temporal features:
  ArrivalDate  ArrivalDate_Month  ArrivalDate_Month_Sin  ArrivalDate_Month_Cos ArrivalDate_Season  Fourier_Annual_Sin_1  Fourier_Weekly_Sin_1  TimeTrend_Normalized  ArrivalDate_IsWeekend  ArrivalDate_IsHoliday
0  2015-07-01                  7                   -0.5              -0.866025             Summer                   0.0                   0.0                   0.0                  False                  False
1  2015-07-01                  7                   -0.5              -0.866025             Summer                   0.0                   0.0                   0.0                  False                  False
2  2015-07-01                  7                   -0.5              -0.866025             Summer                   0.0                   0.0                   0.0                  False                  False
3  2015-07-01                  7                   -0.5              -0.866025             Summer                   0.0

In [27]:
# Capture initial memory usage
initial_memory = df_data.memory_usage(deep=True).sum() / 1024**2  # Convert to MB

print(f"Initial memory usage: {initial_memory:.2f} MB")
print(f"\nMemory usage by column (top 15):")
print(df_data.memory_usage(deep=True).sort_values(ascending=False).head(15) / 1024**2)

Initial memory usage: 154.42 MB

Memory usage by column (top 15):
AssignedRoomType         7.400846
ReservedRoomType         7.400846
DepositType              7.286987
Agent                    6.831551
Company                  6.831551
CustomerType             6.741505
MarketSegment            6.606083
Meal                     6.603832
ReservationStatus        6.560506
ArrivalDateMonth         6.251231
ArrivalDate_Season       6.207968
DistributionChannel      6.187484
ArrivalDateYear          6.034536
Country                  5.906426
ArrivalDateDayOfMonth    5.773314
dtype: float64


In [28]:
# Convert categorical season feature
df_data['ArrivalDate_Season'] = df_data['ArrivalDate_Season'].astype('category')

print("✓ Season feature converted to category dtype")

✓ Season feature converted to category dtype


In [29]:
# Convert boolean features
df_data['ArrivalDate_IsWeekend'] = df_data['ArrivalDate_IsWeekend'].astype('bool')
df_data['ArrivalDate_IsHoliday'] = df_data['ArrivalDate_IsHoliday'].astype('bool')
df_data['DepartureDate_IsHoliday'] = df_data['DepartureDate_IsHoliday'].astype('bool')
df_data['BookingDate_IsHoliday'] = df_data['BookingDate_IsHoliday'].astype('bool')

print("✓ Boolean features converted to bool dtype")

✓ Boolean features converted to bool dtype


In [30]:
# Convert holiday proximity features to int16 (handle potential large values and NaN)
holiday_proximity_cols = [
    'ArrivalDate_DaysBeforeHoliday', 'ArrivalDate_DaysAfterHoliday',
    'DepartureDate_DaysBeforeHoliday', 'DepartureDate_DaysAfterHoliday',
    'BookingDate_DaysBeforeHoliday', 'BookingDate_DaysAfterHoliday'
]

for col in holiday_proximity_cols:
    if col in df_data.columns:
        # Check max value to determine appropriate type
        max_val = df_data[col].max()
        if pd.notna(max_val):
            if max_val <= 32767:
                df_data[col] = df_data[col].astype('Int16')  # Nullable integer type
                
print(f"✓ Holiday proximity features converted to Int16")

✓ Holiday proximity features converted to Int16


# Data Type Optimization

Optimize memory usage by converting features to appropriate data types.

In [31]:
# Convert integer features to appropriate smaller types
# int8 can hold values from -128 to 127
# int16 can hold values from -32,768 to 32,767
# uint8 can hold values from 0 to 255
# uint16 can hold values from 0 to 65,535

# Day of week (1-7)
df_data['ArrivalDate_DayOfWeek'] = df_data['ArrivalDate_DayOfWeek'].astype('int8')

# Quarter (1-4)
df_data['ArrivalDate_Quarter'] = df_data['ArrivalDate_Quarter'].astype('int8')

# Month (1-12)
df_data['ArrivalDate_Month'] = df_data['ArrivalDate_Month'].astype('int8')

# Week of year (1-53)
df_data['ArrivalDate_WeekOfYear'] = df_data['ArrivalDate_WeekOfYear'].astype('int8')

# Days until/since weekend (0-6)
df_data['ArrivalDate_DaysUntilWeekend'] = df_data['ArrivalDate_DaysUntilWeekend'].astype('int8')
df_data['ArrivalDate_DaysSinceWeekend'] = df_data['ArrivalDate_DaysSinceWeekend'].astype('int8')

# Convert DaysSinceStart and DayOfYear to int16
df_data['DaysSinceStart'] = df_data['DaysSinceStart'].astype('int16')
df_data['ArrivalDate_DayOfYear'] = df_data['ArrivalDate_DayOfYear'].astype('int16')

# Length of stay (check max value first)
max_stay = df_data['Length of Stay'].max()
if max_stay <= 127:
    df_data['Length of Stay'] = df_data['Length of Stay'].astype('int8')
elif max_stay <= 32767:
    df_data['Length of Stay'] = df_data['Length of Stay'].astype('int16')
    
print(f"✓ Length of Stay (max: {max_stay}) converted to {df_data['Length of Stay'].dtype}")

✓ Length of Stay (max: 69) converted to int8


In [32]:
# Calculate final memory usage and savings
final_memory = df_data.memory_usage(deep=True).sum() / 1024**2  # Convert to MB
memory_saved = initial_memory - final_memory
pct_reduction = (memory_saved / initial_memory) * 100

print(f"\n{'='*60}")
print(f"Memory Optimization Results:")
print(f"{'='*60}")
print(f"Initial memory usage:  {initial_memory:>10.2f} MB")
print(f"Final memory usage:    {final_memory:>10.2f} MB")
print(f"Memory saved:          {memory_saved:>10.2f} MB")
print(f"Reduction:             {pct_reduction:>10.2f}%")
print(f"{'='*60}")


Memory Optimization Results:
Initial memory usage:      154.42 MB
Final memory usage:        140.24 MB
Memory saved:               14.18 MB
Reduction:                   9.18%


# Summary

---

## Features Successfully Created

This notebook successfully extracted and engineered comprehensive temporal features from the hotel booking data:

### Core Date Features (3)
- `ArrivalDate`, `DepartureDate`, `BookingDate`
- `Length of Stay` (consolidated from weekend/weekday nights)

### Basic Temporal Features (10)
- Day of week, Month, Quarter, Week of year, Day of year
- Weekend indicators and proximity
- Season (categorical)
- `DaysSinceStart` (time progression)

### Holiday Features (9)
- Holiday proximity (6 features: days before/after for each date type)
- Holiday indicators (3 boolean features)

### Advanced Modeling Features (35+)
- **Cyclical encodings** (8): Sin/cos transformations for day of week, month, day of year
- **Fourier features** (16): Multi-frequency periodic patterns (annual, semi-annual, quarterly, monthly, weekly) with harmonics
- **Polynomial trends** (3): Linear, squared, and cubed time trends
- **Interaction features** (6): Time trends × seasonality, Lead time × seasonality

### Data Quality & Optimization
- Data validation checks at start
- Removed `ReservationStatusDate` to prevent data leakage
- Optimized data types (int8, int16, float32, bool, category)
- Memory-efficient storage

**Total temporal features created: 60+**

---

## Advanced Features for Machine Learning

The Fourier transformations and cyclical encodings are particularly valuable for:
- **Capturing seasonality** at multiple time scales
- **Avoiding feature discontinuity** (e.g., December/January boundary)
- **Tree-based models**: Can leverage these smooth features
- **Linear/neural models**: Essential for learning temporal patterns

The polynomial and interaction features enable:
- **Non-linear trend detection**
- **Time-varying seasonal effects**
- **Complex booking behavior patterns**

---

## Next Steps

The enhanced dataset with rich temporal features is ready for:
- Feature importance analysis and selection
- Advanced classification modeling (Random Forest, XGBoost, Neural Networks)
- Time series forecasting
- Pattern discovery and customer segmentation

---

In [33]:
# Export the enhanced dataset with temporal features
df_data.to_parquet('../../data/3.1_temporally_updated_data.parquet', compression='zstd')

## Best Practices for Using These Features

**Feature Selection Recommendations:**
1. Start with feature importance analysis (Random Forest, XGBoost)
2. Check correlation matrix to remove highly correlated features
3. Consider LASSO or elastic net for automatic feature selection
4. May not need all Fourier harmonics - test incrementally

**Modeling Tips:**
- For tree-based models: All features can coexist; model will select what's useful
- For linear models: Consider standardization and regularization
- For neural networks: Normalize/standardize continuous features
- Watch for multicollinearity between raw and encoded features

**Memory Considerations:**
- Current optimization reduces float64 → float32 (50% savings on these features)
- For very large datasets, consider storing only the most important features
- Parquet format with compression is highly efficient for mixed types

**Validation:**
- Use time-based splits (not random) for temporal data
- Ensure test set is chronologically after training set
- Watch for concept drift across years