# Dataset Explosion

---

**Introduction**

In this section, I focus on transforming the original reservation data to enhance our ability to analyze and model hotel occupancy and reservation patterns.

The process begins by engineering the arrival date of each reservation from existing features that represent the year, month, and day of month. This ensures that I have a precise starting point for each guest’s stay. Next, I calculate the length of stay by summing the weekday nights and weekend nights, which gives us a clear picture of the total duration of each reservation.

With the arrival date and length of stay determined, I proceed to create a new timedelta feature, which allows us to calculate the departure date for each reservation. This transformation is essential for accurately capturing the span of each guest's stay. To further refine our dataset, I generate a series of dates representing each day of the guest's stay using the `pd.date_range` function. Finally, I apply the `DataFrame.explode` method, which converts each reservation into separate rows corresponding to each individual date of the guest's stay.

This process of expanding the dataset allows me to analyze occupancy and other time-sensitive metrics on a daily basis, rather than just at the reservation level. By doing so, I aim to extract more granular insights and features that can be leveraged in subsequent modeling tasks, improving my model's ability to predict key outcomes such as cancellations.

---

# Import Packages and Load Data

In [1]:
import pandas as pd

In [2]:
data = pd.read_parquet('../../data/source/full_data.parquet')
data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber
0,342,2015,July,27,1,0,0,2,0.0,0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
1,737,2015,July,27,1,0,0,2,0.0,0,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,H1
2,7,2015,July,27,1,0,1,1,0.0,0,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
3,13,2015,July,27,1,0,1,1,0.0,0,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,H1
4,14,2015,July,27,1,0,2,2,0.0,0,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,H1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,23,2017,August,35,30,2,5,2,0.0,0,...,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06,H2
119386,102,2017,August,35,31,2,5,3,0.0,0,...,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07,H2
119387,34,2017,August,35,31,2,5,2,0.0,0,...,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07,H2
119388,109,2017,August,35,31,2,5,2,0.0,0,...,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07,H2


# Create ArrivalDate Column Using Existing Features

In [3]:
arrival_date_cols = ['ArrivalDateYear', 'ArrivalDateMonth',	'ArrivalDateDayOfMonth']

data[arrival_date_cols]

Unnamed: 0,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth
0,2015,July,1
1,2015,July,1
2,2015,July,1
3,2015,July,1
4,2015,July,1
...,...,...,...
119385,2017,August,30
119386,2017,August,31
119387,2017,August,31
119388,2017,August,31


In [4]:
## Combine the columns into a single datetime column
data['ArrivalDate'] = pd.to_datetime(
                        (data[arrival_date_cols]
                         .astype(str)
                         .agg('-'.join,
                              axis=1)))
data['ArrivalDate']

0        2015-07-01
1        2015-07-01
2        2015-07-01
3        2015-07-01
4        2015-07-01
            ...    
119385   2017-08-30
119386   2017-08-31
119387   2017-08-31
119388   2017-08-31
119389   2017-08-29
Name: ArrivalDate, Length: 119390, dtype: datetime64[ns]

# Calculate Length of Stay (LoS)

## As Numeric

In [5]:
data['LoS_Numeric'] = data[['StaysInWeekendNights','StaysInWeekNights']].sum(axis = 1)
data['LoS_Numeric']

0         0
1         0
2         1
3         1
4         2
         ..
119385    7
119386    7
119387    7
119388    7
119389    9
Name: LoS_Numeric, Length: 119390, dtype: int64

## As TimeDelta

In [6]:
data['LoS_Days'] = pd.to_timedelta(data['LoS_Numeric'], unit='D')
data['LoS_Days']

0        0 days
1        0 days
2        1 days
3        1 days
4        2 days
          ...  
119385   7 days
119386   7 days
119387   7 days
119388   7 days
119389   9 days
Name: LoS_Days, Length: 119390, dtype: timedelta64[ns]

# Calculate Departure Date

In [7]:
# Calculate DepartureDate
data['DepartureDate'] = data['ArrivalDate'] + data['LoS_Days']
data['DepartureDate']

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: DepartureDate, Length: 119390, dtype: datetime64[ns]

In [8]:
data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate
0,342,2015,July,27,1,0,0,2,0.0,0,...,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01
1,737,2015,July,27,1,0,0,2,0.0,0,...,0.00,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01
2,7,2015,July,27,1,0,1,1,0.0,0,...,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02
3,13,2015,July,27,1,0,1,1,0.0,0,...,75.00,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02
4,14,2015,July,27,1,0,2,2,0.0,0,...,98.00,0,1,Check-Out,2015-07-03,H1,2015-07-01,2,2 days,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,23,2017,August,35,30,2,5,2,0.0,0,...,96.14,0,0,Check-Out,2017-09-06,H2,2017-08-30,7,7 days,2017-09-06
119386,102,2017,August,35,31,2,5,3,0.0,0,...,225.43,0,2,Check-Out,2017-09-07,H2,2017-08-31,7,7 days,2017-09-07
119387,34,2017,August,35,31,2,5,2,0.0,0,...,157.71,0,4,Check-Out,2017-09-07,H2,2017-08-31,7,7 days,2017-09-07
119388,109,2017,August,35,31,2,5,2,0.0,0,...,104.40,0,0,Check-Out,2017-09-07,H2,2017-08-31,7,7 days,2017-09-07


# Explode the Dataset

## Create Date Range per Reservation

In [9]:
## Ensure ArrivalDate and DepartureDate are in datetime format
data['ArrivalDate'] = pd.to_datetime(data['ArrivalDate'])
data['DepartureDate'] = pd.to_datetime(data['DepartureDate'])

## Create a date range for each row
data['DateRange'] = data.apply(lambda row: pd.date_range(row['ArrivalDate'],
                                                         row['DepartureDate']), 
                               axis=1)

data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,DateRange
0,342,2015,July,27,1,0,0,2,0.0,0,...,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01,"DatetimeIndex(['2015-07-01'], dtype='datetime6..."
1,737,2015,July,27,1,0,0,2,0.0,0,...,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01,"DatetimeIndex(['2015-07-01'], dtype='datetime6..."
2,7,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,"DatetimeIndex(['2015-07-01', '2015-07-02'], dt..."
3,13,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,"DatetimeIndex(['2015-07-01', '2015-07-02'], dt..."
4,14,2015,July,27,1,0,2,2,0.0,0,...,0,1,Check-Out,2015-07-03,H1,2015-07-01,2,2 days,2015-07-03,"DatetimeIndex(['2015-07-01', '2015-07-02', '20..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,23,2017,August,35,30,2,5,2,0.0,0,...,0,0,Check-Out,2017-09-06,H2,2017-08-30,7,7 days,2017-09-06,"DatetimeIndex(['2017-08-30', '2017-08-31', '20..."
119386,102,2017,August,35,31,2,5,3,0.0,0,...,0,2,Check-Out,2017-09-07,H2,2017-08-31,7,7 days,2017-09-07,"DatetimeIndex(['2017-08-31', '2017-09-01', '20..."
119387,34,2017,August,35,31,2,5,2,0.0,0,...,0,4,Check-Out,2017-09-07,H2,2017-08-31,7,7 days,2017-09-07,"DatetimeIndex(['2017-08-31', '2017-09-01', '20..."
119388,109,2017,August,35,31,2,5,2,0.0,0,...,0,0,Check-Out,2017-09-07,H2,2017-08-31,7,7 days,2017-09-07,"DatetimeIndex(['2017-08-31', '2017-09-01', '20..."


## Explode the Data on `DateRange`

In [10]:
## Explode the DataFrame
exploded_data = data.explode('DateRange')

## Rename the exploded column to 'Date'
exploded_data = exploded_data.rename(columns={'DateRange': 'Date'})

## Reset the index
exploded_data = exploded_data.reset_index(drop=True)

exploded_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,Date
0,342,2015,July,27,1,0,0,2,0.0,0,...,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01,2015-07-01
1,737,2015,July,27,1,0,0,2,0.0,0,...,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01,2015-07-01
2,7,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,2015-07-01
3,7,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,2015-07-02
4,13,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,205,2017,August,35,29,2,7,2,0.0,0,...,0,2,Check-Out,2017-09-07,H2,2017-08-29,9,9 days,2017-09-07,2017-09-03
528643,205,2017,August,35,29,2,7,2,0.0,0,...,0,2,Check-Out,2017-09-07,H2,2017-08-29,9,9 days,2017-09-07,2017-09-04
528644,205,2017,August,35,29,2,7,2,0.0,0,...,0,2,Check-Out,2017-09-07,H2,2017-08-29,9,9 days,2017-09-07,2017-09-05
528645,205,2017,August,35,29,2,7,2,0.0,0,...,0,2,Check-Out,2017-09-07,H2,2017-08-29,9,9 days,2017-09-07,2017-09-06


# Sort Data by Date, ArrivalDate

In [11]:
exploded_data = (exploded_data.sort_values(by=['Date',
                                               'ArrivalDate',
                                               'DepartureDate'])
                 .reset_index(drop=True))
exploded_data

Unnamed: 0,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,...,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,ArrivalDate,LoS_Numeric,LoS_Days,DepartureDate,Date
0,342,2015,July,27,1,0,0,2,0.0,0,...,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01,2015-07-01
1,737,2015,July,27,1,0,0,2,0.0,0,...,0,0,Check-Out,2015-07-01,H1,2015-07-01,0,0 days,2015-07-01,2015-07-01
2,7,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,2015-07-01
3,13,2015,July,27,1,0,1,1,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,2015-07-01
4,12,2015,July,27,1,0,1,2,0.0,0,...,0,0,Check-Out,2015-07-02,H1,2015-07-01,1,1 days,2015-07-02,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
528642,161,2017,August,35,31,4,10,2,0.0,0,...,0,0,Check-Out,2017-09-14,H1,2017-08-31,14,14 days,2017-09-14,2017-09-12
528643,211,2017,August,35,31,4,10,2,0.0,0,...,0,1,Check-Out,2017-09-14,H1,2017-08-31,14,14 days,2017-09-14,2017-09-13
528644,161,2017,August,35,31,4,10,2,0.0,0,...,0,0,Check-Out,2017-09-14,H1,2017-08-31,14,14 days,2017-09-14,2017-09-13
528645,211,2017,August,35,31,4,10,2,0.0,0,...,0,1,Check-Out,2017-09-14,H1,2017-08-31,14,14 days,2017-09-14,2017-09-14


# Save Results

In [14]:
exploded_data.to_parquet('../../data/5.1_dataset_exploded.parquet', compression = 'zstd')

# exploded_data.to_excel('../../data/5.1_dataset_exploded.xlsx', index = False)