# Calculating Hotel Occupancy

---

> Hotel occupancy is a critical factor during the booking process and can provide additional insight into the likelihood of cancellations and/or forecasting future ADR.
> 
> However, *there's no clear indication of the total number of guest rooms for either hotel.*
>
> 
> I will determine the maximum number of rooms occupied for each date for each hotel, which can be used as a placeholder max occupancy number.

---

# Import Packages and Read Data

In [1]:
## Used to upload 
%load_ext autoreload
%autoreload 2

In [2]:
## Data Handling
import pandas as pd
import numpy as np

import datetime as dt

## Visualizations
import matplotlib.pyplot as plt
import plotly.express as px
# import seaborn as sns

## Custom-made Functions
# from src import eda

In [3]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

## Read-In Data

---

***NOTE:** Load data for one hotel at a time - need to calculate occupancies for each hotel separately.*

---

In [5]:
# date_column = 'Arrival_Date'
date_column = 'Booking_Date'

# hotel_number = 1
hotel_number = 2

In [6]:
data_path = f'../data/Datasets_for_{date_column}/H{hotel_number}_Training.parquet'

df_data = pd.read_parquet(data_path)
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date
0,1,283,31,3,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,1,Non Refund,1,,0,Transient-Party,84.0,0,0,Canceled,2015-07-02,2015-07-27,2015-07-29,2014-10-17
1,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17
2,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17
3,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17
4,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17


In [7]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71397 entries, 0 to 71396
Data columns (total 29 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   IsCanceled                   71397 non-null  int64         
 1   LeadTime                     71397 non-null  int64         
 2   ArrivalDateWeekNumber        71397 non-null  int64         
 3   Adults                       71397 non-null  int64         
 4   Children                     71393 non-null  float64       
 5   Babies                       71397 non-null  int64         
 6   Meal                         71397 non-null  object        
 7   Country                      71377 non-null  object        
 8   MarketSegment                71397 non-null  object        
 9   DistributionChannel          71397 non-null  object        
 10  IsRepeatedGuest              71397 non-null  int64         
 11  PreviousCancellations        71397 non-nu

# Feature Engineering Based on Dates

---

**IMPORTANT NOTE:**

Much of the date-related feature engineering was moved to the pre-EDA/-cleaning notebook used to subset the data based on either the Arrival Date or Booking Date features.

The code is kept here as backup, but is not intended to be run.

---

## Calculate Arrival Date

In [8]:
# ## Convert Arrival columns to strings

# arrival_date_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

# arrival_date_cols_str = df_data[arrival_date_cols].astype(str)
# arrival_date_cols_str.head()

In [9]:
# ## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

# arrival_date_full_str = arrival_date_cols_str['ArrivalDateYear'] + '-' + \
#                         arrival_date_cols_str['ArrivalDateMonth'] + '-' + \
#                         arrival_date_cols_str['ArrivalDateDayOfMonth']

# arrival_date_dt = pd.to_datetime(arrival_date_full_str, yearfirst = True)
# arrival_date_dt.name = 'Arrival_Date'
# arrival_date_dt

In [10]:
# ## Concatenate new column
# df_data = pd.concat([df_data, arrival_date_dt], axis = 1)
# df_data.head()

In [11]:
# df_data = df_data.drop(columns=[ 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth'])
# df_data

## Calculate Departure Date

In [12]:
# ## Testing creation of timedelta series to add additional nights to arrival date
# pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')

In [13]:
# ## Create timedelta series based on number of weekday/end nights.
# timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
# timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

In [14]:
# ## Calculate the departure date by adding the timedeltas to the arrival date
# departure_date = df_data.loc[:, 'Arrival_Date'] + timedelta_wk + timedelta_wknd
# departure_date.name = 'Departure_Date'
# departure_date

In [15]:
# ## Concatenate with original dataframe
# df_data = pd.concat([df_data, departure_date], axis = 1)
# df_data.head()

In [16]:
# df_data = df_data.drop(columns=['StaysInWeekendNights', 'StaysInWeekNights'])
# df_data

# Calculate Booking Date

In [17]:
# leadtime_timedelta = pd.to_timedelta(df_data['LeadTime'], unit = 'D')
# leadtime_timedelta

In [18]:
# df_data['Booking_Date'] = df_data['Arrival_Date'] - leadtime_timedelta
# df_data['Booking_Date']

# Calculate Day of Week for Booking, Arrival, and Departure Dates

---

**IMPORTANT NOTE:**

The date-related feature engineering resumes at this point.

---

In [19]:
df_data['Booking_Date_DoW'] = df_data['Booking_Date'].dt.isocalendar().day
df_data['Arrival_Date_DoW'] = df_data['Arrival_Date'].dt.isocalendar().day
df_data['Departure_Date_DoW'] = df_data['Departure_Date'].dt.isocalendar().day

df_data[['Booking_Date_DoW', 'Arrival_Date_DoW', 'Departure_Date_DoW']]

Unnamed: 0,Booking_Date_DoW,Arrival_Date_DoW,Departure_Date_DoW
0,5,1,3
1,5,4,6
2,5,4,6
3,5,4,6
4,5,4,6
...,...,...,...
71392,1,4,5
71393,1,7,2
71394,1,5,6
71395,1,3,5


In [20]:
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date,Booking_Date_DoW,Arrival_Date_DoW,Departure_Date_DoW
0,1,283,31,3,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,1,Non Refund,1,,0,Transient-Party,84.0,0,0,Canceled,2015-07-02,2015-07-27,2015-07-29,2014-10-17,5,1,3
1,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6
2,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6
3,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6
4,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6


In [21]:
df_data['Booking_Date_DoY'] = df_data['Booking_Date'].dt.dayofyear
df_data['Arrival_Date_DoY'] = df_data['Arrival_Date'].dt.dayofyear
df_data['Departure_Date_DoY'] = df_data['Departure_Date'].dt.dayofyear

df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date,Booking_Date_DoW,Arrival_Date_DoW,Departure_Date_DoW,Booking_Date_DoY,Arrival_Date_DoY,Departure_Date_DoY
0,1,283,31,3,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,1,Non Refund,1,,0,Transient-Party,84.0,0,0,Canceled,2015-07-02,2015-07-27,2015-07-29,2014-10-17,5,1,3,290,208,210
1,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192
2,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192
3,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192
4,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192


## Calculate Daily Occupancies

In [22]:
min_date = df_data['Arrival_Date'].min()
max_date = df_data['Departure_Date'].max()

print(f'The earliest date is: {min_date}.\nThe latest date is: {max_date}.')

The earliest date is: 2015-07-01 00:00:00.
The latest date is: 2017-09-07 00:00:00.


### ChatGPT-Generated Code

#### Walkthrough of Code Below

Certainly! Let's break down the corrected code into its key steps, explaining each part and its purpose in calculating the number of active hotel reservations for each date:

**1. Sample Data Creation**

```python
df_data = pd.DataFrame({
    'Arrival_Date': ['2023-01-01', '2023-01-02', '2023-01-05', '2023-01-05'],
    'Departure_Date': ['2023-01-04', '2023-01-03', '2023-01-06', '2023-01-07']
})
```
- This step initializes `df_data`, a pandas DataFrame with two columns: `Arrival_Date` and `Departure_Date`. Each row represents a reservation with its arrival and departure dates.

**2. Convert Dates to Datetime Format**

```python
df_data['Arrival_Date'] = pd.to_datetime(df_data['Arrival_Date'])
df_data['Departure_Date'] = pd.to_datetime(df_data['Departure_Date'])
```
- Converts the date columns from strings (or any other format they might be in) to pandas datetime objects, allowing for date arithmetic and other time-series operations.

**3. Generate Counts for Arrivals and Departures**

```python
arrivals = df_data['Arrival_Date'].value_counts().rename('count')
departures = df_data['Departure_Date'].value_counts().rename('count')
```
- Counts how many reservations start (`arrivals`) and end (`departures`) on each date. The `value_counts()` method tallies occurrences of each date, and `rename('count')` changes the Series name to `'count'`, which aids in clarity for later operations.

**4. Combine Arrival and Departure Counts**

```python
df_counts = pd.concat([arrivals, -departures]).sort_index().reset_index()
df_counts.columns = ['Date', 'Count']
```
- Combines the arrivals and departures into a single DataFrame, `df_counts`, with arrivals contributing positively to the count and departures negatively (indicating the end of a reservation). The data is then sorted by date.

**5. Aggregate Counts on the Same Date**

```python
df_counts = df_counts.groupby('Date').sum()
```
- Since the combination of arrivals and departures could result in multiple entries for the same date, this step aggregates (sums) all counts for each date. This ensures each date is unique, addressing the initial issue of duplicate labels.

**6. Generate a Complete Date Range and Reindex**

```python
date_range = pd.date_range(start=df_counts.index.min(), end=df_counts.index.max())
df_counts = df_counts.reindex(date_range, fill_value=0)
```
- Creates a continuous range of dates covering the entire period from the earliest to the latest date in `df_counts`. It then reindexes `df_counts` to include every date in this range, filling any dates without data with `0`, ensuring there's a record for every single day in the period.

**7. Calculate Cumulative Sum for Active Reservations**

```python
df_counts['Active_Reservations'] = df_counts['Count'].cumsum()
```
- Computes the cumulative sum of the daily net reservation counts (`Count`). This step effectively calculates the total number of active reservations for each date by adding up the arrivals and subtracting the departures as they occur over time.

**Conclusion**

- The final output, `df_counts['Active_Reservations']`, shows the total number of active reservations for each date in the range. This method is efficient and avoids the problem of duplicate labels by ensuring that each date is unique before the reindexing step, leveraging pandas' capabilities for handling time series data.

### Code

In [23]:
# Generate counts for arrivals and departures on their respective dates
arrivals = df_data['Arrival_Date'].value_counts().rename('count')
departures = df_data['Departure_Date'].value_counts().rename('count')
arrivals.head(), departures.head()

(Arrival_Date
 2015-10-16    340
 2016-10-13    322
 2016-11-07    320
 2015-09-18    311
 2015-08-14    289
 Name: count, dtype: int64,
 Departure_Date
 2016-06-17    370
 2015-09-30    344
 2015-10-18    330
 2016-11-10    319
 2016-10-16    306
 Name: count, dtype: int64)

In [24]:
# Create a DataFrame from arrivals and departures, marking departures as negative
df_counts = pd.concat([arrivals, -departures]).sort_index().reset_index()
df_counts.columns = ['Date', 'Count']
df_counts

Unnamed: 0,Date,Count
0,2015-07-01,79
1,2015-07-02,49
2,2015-07-03,16
3,2015-07-03,-73
4,2015-07-04,-41
...,...,...
1586,2017-09-03,-25
1587,2017-09-04,-15
1588,2017-09-05,-5
1589,2017-09-06,-3


In [25]:
# Aggregate counts on the same date to avoid duplicate labels
df_counts = df_counts.groupby('Date').sum()
df_counts

Unnamed: 0_level_0,Count
Date,Unnamed: 1_level_1
2015-07-01,79
2015-07-02,49
2015-07-03,-57
2015-07-04,-3
2015-07-05,-11
...,...
2017-09-03,-25
2017-09-04,-15
2017-09-05,-5
2017-09-06,-3


In [26]:
# Generate a complete date range covering the period in df_data
date_range = pd.date_range(start=df_counts.index.min(), end=df_counts.index.max())
date_range

DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-04',
               '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08',
               '2015-07-09', '2015-07-10',
               ...
               '2017-08-29', '2017-08-30', '2017-08-31', '2017-09-01',
               '2017-09-02', '2017-09-03', '2017-09-04', '2017-09-05',
               '2017-09-06', '2017-09-07'],
              dtype='datetime64[ns]', length=800, freq='D')

In [27]:
# Reindex the aggregated count DataFrame to include all dates in the range, filling missing dates with 0
df_counts = df_counts.reindex(date_range, fill_value=0)
df_counts

Unnamed: 0,Count
2015-07-01,79
2015-07-02,49
2015-07-03,-57
2015-07-04,-3
2015-07-05,-11
...,...
2017-09-03,-25
2017-09-04,-15
2017-09-05,-5
2017-09-06,-3


In [28]:
# Calculate the cumulative sum to determine active reservations for each date
df_counts['Active_Reservations'] = df_counts['Count'].cumsum()

df_counts['Active_Reservations']

2015-07-01     79
2015-07-02    128
2015-07-03     71
2015-07-04     68
2015-07-05     57
             ... 
2017-09-03     26
2017-09-04     11
2017-09-05      6
2017-09-06      3
2017-09-07      0
Freq: D, Name: Active_Reservations, Length: 800, dtype: int64

#### Groupby.Sum vs. Cumsum

The use of both `groupby().sum()` and the `cumsum()` methods serves two different purposes in the process of calculating the total number of active reservations for each date. Here's a clarification of the roles each step plays in the computation:

**1. GroupBy().sum()**

- **Purpose:** This step aggregates the daily net changes in reservations (arrivals and departures) for each unique date. Since arrivals are counted positively and departures negatively, the sum for each date tells us the net reservation change on that day. 
- **What It Solves:** If, for instance, 5 reservations start (arrive) and 3 end (depart) on a particular date, the net change in reservations for that day would be +2. This calculation consolidates all changes into a single value per date, ensuring there's no duplication of dates in the dataset, which is necessary for the next steps.

**2. cumsum()**

- **Purpose:** The cumulative sum (`cumsum()`) takes these daily net changes and accumulates them over the entire period to calculate the total number of active reservations for each date. It essentially adds up the net changes from the start date, rolling forward, to show how many reservations are active on any given day.
- **What It Solves:** This step provides the running total of active reservations. It accounts for the ongoing balance of reservations as they begin and end over time, showing the total active reservations on each date. This is crucial for understanding the capacity or occupancy on any given day.

**Illustrative Example:**

Let's say you have data for three days:

- **Day 1:** 5 arrivals, 0 departures (net +5)
- **Day 2:** 3 arrivals, 1 departure (net +2)
- **Day 3:** 2 arrivals, 4 departures (net -2)

After `groupby().sum()`, you'd have a net change sequence of [+5, +2, -2].

Applying `cumsum()` to this sequence gives you the total active reservations for each day: [5, 7, 5]. This demonstrates how the occupancy evolves:

- **Day 1:** Starts with 5,
- **Day 2:** Increases to 7,
- **Day 3:** Decreases back to 5.

**Conclusion:**

- **`groupby().sum()`** is used for condensing the dataset into a form where each date has a single net change value, resolving any issues with duplicate dates.
- **`cumsum()`** transforms these net changes into a running total of active reservations, reflecting how the number of active reservations builds up or reduces over time.

## Adding Arrival/Departure Occupancies to Original Data

In [29]:
df_counts['Active_Reservations'].head(10)

2015-07-01     79
2015-07-02    128
2015-07-03     71
2015-07-04     68
2015-07-05     57
2015-07-06     80
2015-07-07     92
2015-07-08     77
2015-07-09    116
2015-07-10     81
Freq: D, Name: Active_Reservations, dtype: int64

In [30]:
# Define active_reservations
active_reservations = df_counts['Active_Reservations']

# Find the maximum occupancy
max_occupancy = active_reservations.max()

# Map the occupancy on arrival and departure dates to each reservation
df_data['occupancy_at_arrival'] = df_data['Arrival_Date'].map(active_reservations)
df_data['occupancy_at_departure'] = df_data['Departure_Date'].map(active_reservations)

# Convert these occupancies to percentages of the maximum occupancy
df_data['occupancy_pct_at_arrival'] = (df_data['occupancy_at_arrival'] / max_occupancy) * 100
df_data['occupancy_pct_at_departure'] = (df_data['occupancy_at_departure'] / max_occupancy) * 100

df_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date,Booking_Date_DoW,Arrival_Date_DoW,Departure_Date_DoW,Booking_Date_DoY,Arrival_Date_DoY,Departure_Date_DoY,occupancy_at_arrival,occupancy_at_departure,occupancy_pct_at_arrival,occupancy_pct_at_departure
0,1,283,31,3,0.00,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,1,Non Refund,1,,0,Transient-Party,84.00,0,0,Canceled,2015-07-02,2015-07-27,2015-07-29,2014-10-17,5,1,3,290,208,210,244,36,40.20,5.93
1,1,265,28,2,0.00,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.00,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
2,1,265,28,2,0.00,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.00,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
3,1,265,28,2,0.00,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.00,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
4,1,265,28,2,0.00,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.00,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71392,0,3,10,1,0.00,0,SC,BRA,Direct,Direct,0,0,0,A,A,0,No Deposit,14,,0,Transient,88.00,0,0,Check-Out,2017-03-10,2017-03-09,2017-03-10,2017-03-06,1,4,5,65,68,69,294,340,48.43,56.01
71393,0,13,12,1,0.00,0,BB,GBR,Online TA,TA/TO,0,0,0,D,D,0,No Deposit,8,,0,Transient,110.18,0,0,Check-Out,2017-03-21,2017-03-19,2017-03-21,2017-03-06,1,7,2,65,78,80,283,319,46.62,52.55
71394,0,53,17,1,0.00,0,BB,FIN,Online TA,TA/TO,0,0,0,A,A,1,No Deposit,9,,0,Transient,126.00,0,1,Check-Out,2017-04-29,2017-04-28,2017-04-29,2017-03-06,1,5,6,65,118,119,440,358,72.49,58.98
71395,0,2,10,1,0.00,0,BB,PRT,Groups,TA/TO,0,0,0,A,A,0,No Deposit,1,,0,Transient,65.00,0,0,Check-Out,2017-03-10,2017-03-08,2017-03-10,2017-03-06,1,3,5,65,67,69,268,340,44.15,56.01


In [31]:
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date,Booking_Date_DoW,Arrival_Date_DoW,Departure_Date_DoW,Booking_Date_DoY,Arrival_Date_DoY,Departure_Date_DoY,occupancy_at_arrival,occupancy_at_departure,occupancy_pct_at_arrival,occupancy_pct_at_departure
0,1,283,31,3,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,1,Non Refund,1,,0,Transient-Party,84.0,0,0,Canceled,2015-07-02,2015-07-27,2015-07-29,2014-10-17,5,1,3,290,208,210,244,36,40.2,5.93
1,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
2,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
3,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16
4,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16


In [32]:
df_data[['LeadTime', 'Arrival_Date', 'Departure_Date']]

Unnamed: 0,LeadTime,Arrival_Date,Departure_Date
0,283,2015-07-27,2015-07-29
1,265,2015-07-09,2015-07-11
2,265,2015-07-09,2015-07-11
3,265,2015-07-09,2015-07-11
4,265,2015-07-09,2015-07-11
...,...,...,...
71392,3,2017-03-09,2017-03-10
71393,13,2017-03-19,2017-03-21
71394,53,2017-04-28,2017-04-29
71395,2,2017-03-08,2017-03-10


In [33]:
# ## Used to generate a sub-sample of the dataset for inspection or use with ChatGPT
# df_data.loc[:1000,:].to_excel(f'../data/Feature_Engineering/df_data_H{hotel_number}.xlsx', index = False)

# Time Series Metrics and Analysis

In [34]:
# Assume df_data is the DataFrame name, and the data is already sorted by Arrival_Date in ascending order
# and that Arrival_Date is in datetime format

# Calculate the 7-day rolling average of occupancy_pct_at_arrival
df_data['occupancy_pct_at_arrival_7d_avg'] = df_data['occupancy_pct_at_arrival'].rolling(window=7, min_periods=1).mean()

# Display the updated DataFrame to verify the calculation
df_data[['Arrival_Date', 'occupancy_pct_at_arrival', 'occupancy_pct_at_arrival_7d_avg']].head()


Unnamed: 0,Arrival_Date,occupancy_pct_at_arrival,occupancy_pct_at_arrival_7d_avg
0,2015-07-27,40.2,40.2
1,2015-07-09,19.11,29.65
2,2015-07-09,19.11,26.14
3,2015-07-09,19.11,24.38
4,2015-07-09,19.11,23.33


In [35]:
df_data_rollavg = df_data[['Arrival_Date', 'occupancy_pct_at_arrival']].copy()
df_data_rollavg

Unnamed: 0,Arrival_Date,occupancy_pct_at_arrival
0,2015-07-27,40.20
1,2015-07-09,19.11
2,2015-07-09,19.11
3,2015-07-09,19.11
4,2015-07-09,19.11
...,...,...
71392,2017-03-09,48.43
71393,2017-03-19,46.62
71394,2017-04-28,72.49
71395,2017-03-08,44.15


In [36]:
df_data_rollavg = df_data_rollavg.set_index(keys = 'Arrival_Date')
df_data_rollavg

Unnamed: 0_level_0,occupancy_pct_at_arrival
Arrival_Date,Unnamed: 1_level_1
2015-07-27,40.20
2015-07-09,19.11
2015-07-09,19.11
2015-07-09,19.11
2015-07-09,19.11
...,...
2017-03-09,48.43
2017-03-19,46.62
2017-04-28,72.49
2017-03-08,44.15


In [37]:
df_data_rollavg = df_data_rollavg.resample('D').mean()
df_data_rollavg

Unnamed: 0_level_0,occupancy_pct_at_arrival
Arrival_Date,Unnamed: 1_level_1
2015-07-01,13.01
2015-07-02,21.09
2015-07-03,11.70
2015-07-04,11.20
2015-07-05,9.39
...,...
2017-08-27,21.58
2017-08-28,31.96
2017-08-29,29.16
2017-08-30,26.69


In [38]:
df_data_rollavg.index

DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-04',
               '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08',
               '2015-07-09', '2015-07-10',
               ...
               '2017-08-22', '2017-08-23', '2017-08-24', '2017-08-25',
               '2017-08-26', '2017-08-27', '2017-08-28', '2017-08-29',
               '2017-08-30', '2017-08-31'],
              dtype='datetime64[ns]', name='Arrival_Date', length=793, freq='D')

In [39]:
# Assume df_data is the DataFrame name, and the data is already sorted by Arrival_Date in ascending order
# and that Arrival_Date is in datetime format

# Calculate the 7-day rolling average of occupancy_pct_at_arrival
df_data_rollavg['occupancy_pct_at_arrival_3d_avg'] = df_data_rollavg['occupancy_pct_at_arrival'].rolling(window=3, min_periods=1).mean()
df_data_rollavg['occupancy_pct_at_arrival_7d_avg'] = df_data_rollavg['occupancy_pct_at_arrival'].rolling(window=7, min_periods=1).mean()
df_data_rollavg['occupancy_pct_at_arrival_14d_avg'] = df_data_rollavg['occupancy_pct_at_arrival'].rolling(window=14, min_periods=1).mean()
df_data_rollavg['occupancy_pct_at_arrival_28d_avg'] = df_data_rollavg['occupancy_pct_at_arrival'].rolling(window=28, min_periods=1).mean()
# df_data_rollavg['occupancy_pct_at_arrival_90d_avg'] = df_data_rollavg['occupancy_pct_at_arrival'].rolling(window=90, min_periods=1).mean()

# Display the updated DataFrame to verify the calculation
df_data_rollavg.head()

Unnamed: 0_level_0,occupancy_pct_at_arrival,occupancy_pct_at_arrival_3d_avg,occupancy_pct_at_arrival_7d_avg,occupancy_pct_at_arrival_14d_avg,occupancy_pct_at_arrival_28d_avg
Arrival_Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-07-01,13.01,13.01,13.01,13.01,13.01
2015-07-02,21.09,17.05,17.05,17.05,17.05
2015-07-03,11.7,15.27,15.27,15.27,15.27
2015-07-04,11.2,14.66,14.25,14.25,14.25
2015-07-05,9.39,10.76,13.28,13.28,13.28


In [40]:
# px.line(df_data_rollavg)

In [41]:
rolling_avg_bookings = df_data_rollavg['occupancy_pct_at_arrival_7d_avg']

In [42]:
# Define high-demand threshold as the 75th percentile of the 7-day rolling average
high_demand_threshold = rolling_avg_bookings.quantile(0.75).round(4)
high_demand_threshold

56.6016

In [43]:
# Identify days that are considered high demand
high_demand_days = rolling_avg_bookings[rolling_avg_bookings > high_demand_threshold].index
high_demand_days

DatetimeIndex(['2015-09-20', '2015-09-21', '2015-09-22', '2015-09-23',
               '2015-09-24', '2015-09-25', '2015-09-26', '2015-09-27',
               '2015-09-28', '2015-09-29',
               ...
               '2017-04-14', '2017-04-16', '2017-04-17', '2017-04-18',
               '2017-04-27', '2017-04-28', '2017-04-29', '2017-04-30',
               '2017-05-01', '2017-05-02'],
              dtype='datetime64[ns]', name='Arrival_Date', length=198, freq=None)

In [44]:
# Initialize the indicator column with 0 (normal pricing)
df_data['Dynamic_Pricing_Indicator'] = 0

# For each booking, check if the booking date falls within a high-demand period
# Assuming Booking_Date is already in datetime format and corresponds to the date the booking was made
for booking_date in df_data['Booking_Date']:
    if booking_date in high_demand_days:
        df_data.loc[df_data['Booking_Date'] == booking_date, 'Dynamic_Pricing_Indicator'] = 1


In [45]:
df_data.loc[:,'Dynamic_Pricing_Indicator'].value_counts(dropna=0, normalize = 1, ascending=0)

Dynamic_Pricing_Indicator
0   0.76
1   0.24
Name: proportion, dtype: float64

In [46]:
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateWeekNumber,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,Arrival_Date,Departure_Date,Booking_Date,Booking_Date_DoW,Arrival_Date_DoW,Departure_Date_DoW,Booking_Date_DoY,Arrival_Date_DoY,Departure_Date_DoY,occupancy_at_arrival,occupancy_at_departure,occupancy_pct_at_arrival,occupancy_pct_at_departure,occupancy_pct_at_arrival_7d_avg,Dynamic_Pricing_Indicator
0,1,283,31,3,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,1,Non Refund,1,,0,Transient-Party,84.0,0,0,Canceled,2015-07-02,2015-07-27,2015-07-29,2014-10-17,5,1,3,290,208,210,244,36,40.2,5.93,40.2,0
1,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16,29.65,0
2,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16,26.14,0
3,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16,24.38,0
4,1,265,28,2,0.0,0,BB,PRT,Groups,TA/TO,0,1,0,A,A,0,No Deposit,1,,0,Contract,62.0,0,0,Canceled,2015-01-01,2015-07-09,2015-07-11,2014-10-17,5,4,6,290,190,192,116,92,19.11,15.16,23.33,0


In [47]:
df_data.to_parquet(f'../data/Datasets_for_{date_column}/Feature_Engineering/H{hotel_number}_T_Date_Features.parquet', engine='pyarrow', compression='snappy')