# Pre-Pre-Processing

Before I start performing any EDA or modeling, I will pre-emptively split my data to have a separate training set of data.

Since the data consists of temporal data (hotel reservations across multiple years), I will split the data into two groups based on two datetime features: Booking Date and the Arrival date.

---

# Consulting ChatGPT

I asked ChatGPT to help me decide if one feature would be better than the other. It's responses are as follows:

## Booking Date

### Pros:
> * *Forecasting Lead Time:* Splitting by booking date allows you to model the lead time between booking and arrival, which is valuable for forecasting future demand well in advance.
> * *Real-world Scenario Modeling:* This approach mirrors real-world scenarios more closely, where predictions are made as bookings occur, regardless of the future arrival date.
> * *Dynamic Pricing and Revenue Management:* Particularly useful for applications in dynamic pricing and revenue management, where understanding booking patterns ahead of time can inform strategic decisions.

### Cons:
> * *Increased Complexity:* The time gap between booking and arrival dates introduces additional complexity, as you need to account for changes over time that could affect the stay (e.g., cancellations, modifications).



## Arrival Date

### Pros:
> * *Cleaner Data Relationships:* Utilizing the arrival date might result in cleaner, more straightforward modeling, as it directly ties the prediction to the period of service or product consumption.
> * *Seasonality and Trend Analysis:* Easier to model seasonal trends and effects, as the arrival date directly corresponds to the period being analyzed.
> * *Accuracy in Performance Metrics:* Predictions based on arrival date can be more closely aligned with actual occupancy and revenue, potentially improving model accuracy in terms of performance metrics.

### Cons:

> * *Reduced Forecasting Horizon:* The model may be less effective at predicting bookings well in advance since it's oriented around the arrival date. This could limit its usefulness for long-term planning.
> * *Possible Lag in Actionable Insights:* May not provide as much lead time for implementing strategies based on the predictions, such as staffing or promotional offers, since the focus is on the period closer to the actual stay.

---

# Consultation Conclusions

After considering ChatGPT's suggestions and insights, I will take a more greedy approach and create separate datasets for both the booking and arrival dates. This will give me more flexibility when modeling as I will have different time perspectives to utilize for different purposes (e.g., future forecasting vs. analyzing actualized performance).

---

# Date Preparation

Before I can split the datasets, I need to perform some slight feature engineering. The source datasets do not have an exact datetime feature for the arrival date, only for the booking date. I will use the separate Year, Month, and Day of Month features to create an `Arrival_Date` feature, then use this feature for splitting my data.

In [1]:
import pandas as pd

In [2]:
## Sharing 'df_data' to reuse code from a prior notebook

## Maintaing separate hotel data

hotel_number = '1'
# hotel_number = '2'

path = f'./data/H{hotel_number}.parquet'
df_data = pd.read_parquet(path)

df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [3]:
## Convert Arrival columns to strings

arrival_date_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

arrival_date_cols_str = df_data[arrival_date_cols].astype(str)
arrival_date_cols_str.head()

Unnamed: 0,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth
0,2015,July,1
1,2015,July,1
2,2015,July,1
3,2015,July,1
4,2015,July,1


In [4]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_date_full_str = arrival_date_cols_str['ArrivalDateYear'] + '-' + \
                        arrival_date_cols_str['ArrivalDateMonth'] + '-' + \
                        arrival_date_cols_str['ArrivalDateDayOfMonth']

arrival_date_dt = pd.to_datetime(arrival_date_full_str, yearfirst = True)
arrival_date_dt.name = 'Arrival_Date'
arrival_date_dt

0       2015-07-01
1       2015-07-01
2       2015-07-01
3       2015-07-01
4       2015-07-01
           ...    
40055   2017-08-31
40056   2017-08-30
40057   2017-08-29
40058   2017-08-31
40059   2017-08-31
Name: Arrival_Date, Length: 40060, dtype: datetime64[ns]

In [5]:
## Concatenate new column
df_data = pd.concat([df_data, arrival_date_dt], axis = 1)
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date
0,0,342,2015,July,27,1,0,0,2,0,...,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01
3,0,13,2015,July,27,1,0,1,1,0,...,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01
4,0,14,2015,July,27,1,0,2,2,0,...,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,2015-07-01


In [6]:
df_data = df_data.drop(columns=[ 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth'])
df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,...,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date
0,0,342,27,0,0,2,0,0,BB,PRT,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01
1,0,737,27,0,0,2,0,0,BB,PRT,...,,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01
2,0,7,27,0,1,1,0,0,BB,GBR,...,,,0,Transient,75.00,0,0,Check-Out,2015-07-02,2015-07-01
3,0,13,27,0,1,1,0,0,BB,GBR,...,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02,2015-07-01
4,0,14,27,0,2,2,0,0,BB,GBR,...,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03,2015-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,35,2,8,2,1,0,BB,GBR,...,143,,0,Transient,89.75,0,0,Check-Out,2017-09-10,2017-08-31
40056,0,169,35,2,9,2,0,0,BB,IRL,...,250,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10,2017-08-30
40057,0,204,35,4,10,2,0,0,BB,IRL,...,250,,0,Transient,153.57,0,3,Check-Out,2017-09-12,2017-08-29
40058,0,211,35,4,10,2,0,0,HB,GBR,...,40,,0,Contract,112.80,0,1,Check-Out,2017-09-14,2017-08-31


In [7]:
## Create timedelta series based on number of weekday/end nights.
timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

In [8]:
## Calculate the departure date by adding the timedeltas to the arrival date
departure_date = df_data.loc[:, 'Arrival_Date'] + timedelta_wk + timedelta_wknd
departure_date.name = 'Departure_Date'
departure_date

0       2015-07-01
1       2015-07-01
2       2015-07-02
3       2015-07-02
4       2015-07-03
           ...    
40055   2017-09-10
40056   2017-09-10
40057   2017-09-12
40058   2017-09-14
40059   2017-09-14
Name: Departure_Date, Length: 40060, dtype: datetime64[ns]

In [9]:
## Concatenate with original dataframe
df_data = pd.concat([df_data, departure_date], axis = 1)
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date
0,0,342,27,0,0,2,0,0,BB,PRT,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01,2015-07-01
1,0,737,27,0,0,2,0,0,BB,PRT,...,,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01,2015-07-01
2,0,7,27,0,1,1,0,0,BB,GBR,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01,2015-07-02
3,0,13,27,0,1,1,0,0,BB,GBR,...,,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01,2015-07-02
4,0,14,27,0,2,2,0,0,BB,GBR,...,,0,Transient,98.0,0,1,Check-Out,2015-07-03,2015-07-01,2015-07-03


In [10]:
df_data = df_data.drop(columns=['StaysInWeekendNights', 'StaysInWeekNights'])
df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date
0,0,342,27,2,0,0,BB,PRT,Direct,Direct,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01,2015-07-01
1,0,737,27,2,0,0,BB,PRT,Direct,Direct,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01,2015-07-01
2,0,7,27,1,0,0,BB,GBR,Direct,Direct,...,,0,Transient,75.00,0,0,Check-Out,2015-07-02,2015-07-01,2015-07-02
3,0,13,27,1,0,0,BB,GBR,Corporate,Corporate,...,,0,Transient,75.00,0,0,Check-Out,2015-07-02,2015-07-01,2015-07-02
4,0,14,27,2,0,0,BB,GBR,Online TA,TA/TO,...,,0,Transient,98.00,0,1,Check-Out,2015-07-03,2015-07-01,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,35,2,1,0,BB,GBR,Offline TA/TO,TA/TO,...,,0,Transient,89.75,0,0,Check-Out,2017-09-10,2017-08-31,2017-09-10
40056,0,169,35,2,0,0,BB,IRL,Direct,Direct,...,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10,2017-08-30,2017-09-10
40057,0,204,35,2,0,0,BB,IRL,Direct,Direct,...,,0,Transient,153.57,0,3,Check-Out,2017-09-12,2017-08-29,2017-09-12
40058,0,211,35,2,0,0,HB,GBR,Offline TA/TO,TA/TO,...,,0,Contract,112.80,0,1,Check-Out,2017-09-14,2017-08-31,2017-09-14


In [11]:
leadtime_timedelta = pd.to_timedelta(df_data['LeadTime'], unit = 'D')
leadtime_timedelta

0       342 days
1       737 days
2         7 days
3        13 days
4        14 days
          ...   
40055   212 days
40056   169 days
40057   204 days
40058   211 days
40059   161 days
Name: LeadTime, Length: 40060, dtype: timedelta64[ns]

In [12]:
df_data['Booking_Date'] = df_data['Arrival_Date'] - leadtime_timedelta
df_data['Booking_Date']

0       2014-07-24
1       2013-06-24
2       2015-06-24
3       2015-06-18
4       2015-06-17
           ...    
40055   2017-01-31
40056   2017-03-14
40057   2017-02-06
40058   2017-02-01
40059   2017-03-23
Name: Booking_Date, Length: 40060, dtype: datetime64[ns]

In [13]:
df_data = df_data.drop(columns = ['LeadTime'])

df_data.head(10)

Unnamed: 0,IsCanceled,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,...,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date
0,0,27,2,0,0,BB,PRT,Direct,Direct,0,...,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01,2015-07-01,2014-07-24
1,0,27,2,0,0,BB,PRT,Direct,Direct,0,...,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01,2015-07-01,2013-06-24
2,0,27,1,0,0,BB,GBR,Direct,Direct,0,...,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01,2015-07-02,2015-06-24
3,0,27,1,0,0,BB,GBR,Corporate,Corporate,0,...,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01,2015-07-02,2015-06-18
4,0,27,2,0,0,BB,GBR,Online TA,TA/TO,0,...,0,Transient,98.0,0,1,Check-Out,2015-07-03,2015-07-01,2015-07-03,2015-06-17
5,0,27,2,0,0,BB,GBR,Online TA,TA/TO,0,...,0,Transient,98.0,0,1,Check-Out,2015-07-03,2015-07-01,2015-07-03,2015-06-17
6,0,27,2,0,0,BB,PRT,Direct,Direct,0,...,0,Transient,107.0,0,0,Check-Out,2015-07-03,2015-07-01,2015-07-03,2015-07-01
7,0,27,2,0,0,FB,PRT,Direct,Direct,0,...,0,Transient,103.0,0,1,Check-Out,2015-07-03,2015-07-01,2015-07-03,2015-06-22
8,1,27,2,0,0,BB,PRT,Online TA,TA/TO,0,...,0,Transient,82.0,0,1,Canceled,2015-05-06,2015-07-01,2015-07-04,2015-04-07
9,1,27,2,0,0,HB,PRT,Offline TA/TO,TA/TO,0,...,0,Transient,105.5,0,0,Canceled,2015-04-22,2015-07-01,2015-07-04,2015-04-17


# Subset Data by Booking_Date and Arrival_Date

In [14]:
df_data['Arrival_Date']

0       2015-07-01
1       2015-07-01
2       2015-07-01
3       2015-07-01
4       2015-07-01
           ...    
40055   2017-08-31
40056   2017-08-30
40057   2017-08-29
40058   2017-08-31
40059   2017-08-31
Name: Arrival_Date, Length: 40060, dtype: datetime64[ns]

In [15]:
df_data = (df_data.set_index(keys = ['Arrival_Date'])
           .sort_index())
df_data

Unnamed: 0_level_0,IsCanceled,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,...,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Departure_Date,Booking_Date
Arrival_Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-07-01,0,27,2,0,0,BB,PRT,Direct,Direct,0,...,,0,Transient,0.00,0,0,Check-Out,2015-07-01,2015-07-01,2014-07-24
2015-07-01,0,27,2,0,0,HB,GBR,Offline TA/TO,TA/TO,0,...,,0,Contract,94.95,0,1,Check-Out,2015-07-01,2015-07-08,2015-02-24
2015-07-01,0,27,2,0,0,BB,PRT,Offline TA/TO,TA/TO,0,...,,0,Transient,63.60,1,0,Check-Out,2015-07-08,2015-07-08,2015-04-14
2015-07-01,0,27,2,0,0,BB,IRL,Offline TA/TO,TA/TO,0,...,,0,Contract,79.50,0,0,Check-Out,2015-07-08,2015-07-08,2015-05-14
2015-07-01,1,27,2,0,0,BB,PRT,Online TA,TA/TO,0,...,,0,Transient,107.00,0,2,Canceled,2015-05-11,2015-07-08,2015-05-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017-08-31,0,35,2,0,0,BB,NLD,Online TA,TA/TO,0,...,,0,Transient,175.00,0,2,Check-Out,2017-09-02,2017-09-02,2017-08-21
2017-08-31,0,35,2,0,0,BB,GBR,Online TA,TA/TO,0,...,,0,Transient,187.19,0,1,Check-Out,2017-09-05,2017-09-05,2017-06-28
2017-08-31,0,35,2,0,0,BB,GBR,Offline TA/TO,TA/TO,0,...,,0,Contract,116.50,0,0,Check-Out,2017-09-05,2017-09-05,2017-07-11
2017-08-31,1,35,2,0,0,BB,GBR,Online TA,TA/TO,0,...,,0,Transient,174.00,0,1,Canceled,2017-07-24,2017-09-05,2017-07-17


In [17]:
# Assuming `df` is your DataFrame and it has a datetime index
max_date = df_data.index.max()
cutoff_date = max_date - pd.Timedelta(days=90)

# Split the dataset
train_df = df_data[df_data.index <= cutoff_date]
holdout_df = df_data[df_data.index > cutoff_date]

In [20]:
train_df = train_df.reset_index(drop=False)
holdout_df = holdout_df.reset_index(drop=False)

In [None]:
training_path = f'./data/H{hotel_number}_Training.parquet'
holdout_path = f'./data/H{hotel_number_Validation.parquet'

train_df.to_parquet(training_path, engine='pyarrow', compression='brotli')
holdout_df.to_parquet(holdout_path, engine='pyarrow', compression='brotli')