# Calculating Hotel Occupancy

---

The length of a guest's stay is a key factor in understanding booking patterns and can offer valuable insights into customer behavior, revenue management, and the likelihood of cancellations.

However, the dataset does not directly provide the total stay duration for each reservation. Instead, it includes separate counts for weekday and weekend nights.

I will calculate the total stay duration by summing the number of weekday and weekend nights for each booking. This derived feature will be crucial for further analysis, enabling more accurate modeling of guest behavior, forecasting occupancy rates, and enhancing predictions related to cancellations and revenue.

---

# Import Packages and Read Data

In [None]:
## Data Handling
import datetime as dt
import pandas as pd

## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')

## Load Pre-Reviewed Data

In [None]:
data_path = '../../data/3.1_temporally_updated_data.parquet'
df_data = pd.read_parquet(data_path)
df_data

# Calculate Daily Occupancies

#### Workflow Walkthrough


**1. Generate Counts for Arrivals and Departures**

```python
arrivals = df_data_h1['ArrivalDate'].value_counts().rename('count')
departures = df_data_h1['DepartureDate'].value_counts().rename('count')
```
- Counts how many reservations start (`arrivals`) and end (`departures`) on each date. The `value_counts()` method tallies occurrences of each date, and `rename('count')` changes the Series name to `'count'`, which aids in clarity for later operations.

**2. Combine Arrival and Departure Counts**

```python
df_counts = pd.concat([arrivals, -departures]).sort_index().reset_index()
df_counts.columns = ['Date', 'Count']
```
- Combines the arrivals and departures into a single DataFrame, `df_counts`, with arrivals contributing positively to the count and departures negatively (indicating the end of a reservation). The data is then sorted by date.

**3. Aggregate Counts on the Same Date**

```python
df_counts = df_counts.groupby('Date').sum()
```
- Since the combination of arrivals and departures could result in multiple entries for the same date, this step aggregates (sums) all counts for each date. This ensures each date is unique, addressing the initial issue of duplicate labels.

**4. Generate a Complete Date Range and Reindex**

```python
date_range = pd.date_range(start=df_counts.index.min(), end=df_counts.index.max())
df_counts = df_counts.reindex(date_range, fill_value=0)
```
- Creates a continuous range of dates covering the entire period from the earliest to the latest date in `df_counts`. It then reindexes `df_counts` to include every date in this range, filling any dates without data with `0`, ensuring there's a record for every single day in the period.

**5. Calculate Cumulative Sum for Active Reservations**

```python
df_counts['Active_Reservations'] = df_counts['Count'].cumsum()
```
- Computes the cumulative sum of the daily net reservation counts (`Count`). This step effectively calculates the total number of active reservations for each date by adding up the arrivals and subtracting the departures as they occur over time.

**Conclusion**

- The final output, `df_counts['Active_Reservations']`, shows the total number of active reservations for each date in the range. This method is efficient and avoids the problem of duplicate labels by ensuring that each date is unique before the reindexing step, leveraging pandas' capabilities for handling time series data.

#### Groupby.Sum vs. Cumsum


The use of both `groupby().sum()` and the `cumsum()` methods serves two different purposes in the process of calculating the total number of active reservations for each date. Here's a clarification of the roles each step plays in the computation:

**1. GroupBy().sum()**

- **Purpose:** This step aggregates the daily net changes in reservations (arrivals and departures) for each unique date. Since arrivals are counted positively and departures negatively, the sum for each date tells us the net reservation change on that day. 
- **What It Solves:** If, for instance, 5 reservations start (arrive) and 3 end (depart) on a particular date, the net change in reservations for that day would be +2. This calculation consolidates all changes into a single value per date, ensuring there's no duplication of dates in the dataset, which is necessary for the next steps.

**2. cumsum()**

- **Purpose:** The cumulative sum (`cumsum()`) takes these daily net changes and accumulates them over the entire period to calculate the total number of active reservations for each date. It essentially adds up the net changes from the start date, rolling forward, to show how many reservations are active on any given day.
- **What It Solves:** This step provides the running total of active reservations. It accounts for the ongoing balance of reservations as they begin and end over time, showing the total active reservations on each date. This is crucial for understanding the capacity or occupancy on any given day.

**Illustrative Example:**

Let's say you have data for three days:

- **Day 1:** 5 arrivals, 0 departures (net +5)
- **Day 2:** 3 arrivals, 1 departure (net +2)
- **Day 3:** 2 arrivals, 4 departures (net -2)

After `groupby().sum()`, you'd have a net change sequence of [+5, +2, -2].

Applying `cumsum()` to this sequence gives you the total active reservations for each day: [5, 7, 5]. This demonstrates how the occupancy evolves:

- **Day 1:** Starts with 5,
- **Day 2:** Increases to 7,
- **Day 3:** Decreases back to 5.

**Conclusion:**

- **`groupby().sum()`** is used for condensing the dataset into a form where each date has a single net change value, resolving any issues with duplicate dates.
- **`cumsum()`** transforms these net changes into a running total of active reservations, reflecting how the number of active reservations builds up or reduces over time.

### Functions: Calculating Occupancies

In [None]:
def get_counts(dataframe, arrivaldate, departuredate, name = 'count'):
    
    '''Generate counts for arrivals and departures on their respective dates.'''
    
    arrivals = (dataframe[arrivaldate]
                .value_counts()
                .rename(name))
    
    departures = (dataframe[departuredate]
                  .value_counts()
                  .rename(name))
    
    return arrivals, departures


def aggregate_counts_by_date(arrivals, departures):
    
    '''Create a DataFrame from arrivals and departures,
    marking departures as negative.'''
    
    df_counts = (pd.concat([arrivals, -departures])
                 .sort_index()
                 .reset_index())
    
    df_counts.columns = ['Date', 'Count']
    
    return df_counts.groupby('Date').sum()


def generate_date_range(df_counts):
    '''Generate a complete date range covering the period in df_data.'''
    
    return pd.date_range(start=df_counts.index.min(),
                         end=df_counts.index.max())


def reindex_and_fill_zero(df_counts, date_range):
    '''Reindex the aggregated count DataFrame to include all dates in the range,
    filling missing dates with 0'''
    
    return df_counts.reindex(date_range, fill_value=0)


def calculate_daily_active_res(df_counts):
    '''Calculate the cumulative sum to determine active reservations for each date'''
   
    return df_counts['Count'].cumsum()


def calculate_daily_occupancy(dataframe, arrivaldate, departuredate, name = 'count'):
    
    arrivals, departures = get_counts(dataframe, arrivaldate, departuredate, name = 'count')
    
    daily_counts = aggregate_counts_by_date(arrivals, departures)
    
    date_range = generate_date_range((daily_counts))
    
    df_reindexed = reindex_and_fill_zero(daily_counts, date_range)
    
    return calculate_daily_active_res(df_reindexed)

# Calculating Occupancy for Hotel 1

In [None]:
## Subset the data

hotel = 'H1'

hotel_filter = (df_data['HotelNumber'] == hotel)

df_data_h1 = df_data[hotel_filter]

df_data_h1

## Calculate Daily and Max Occupancies

In [None]:
df_counts = calculate_daily_occupancy(df_data_h1, 'ArrivalDate', 'DepartureDate')
df_counts.name = 'Active_Reservations'
df_counts.head(10)

In [None]:
# Define active_reservations
active_reservations = df_counts

# Find the maximum occupancy
max_occupancy = active_reservations.max()
max_occupancy

## Append Occupancies to Subset Data

In [None]:
# Map the occupancy on arrival and departure dates to each reservation
df_data_h1.loc[:, 'occupancy_at_arrival'] = df_data_h1.loc[:, 'ArrivalDate'].map(active_reservations)
df_data_h1.loc[:, 'occupancy_at_departure'] = df_data_h1.loc[:, 'DepartureDate'].map(active_reservations)

# Convert these occupancies to percentages of the maximum occupancy
df_data_h1.loc[:, 'occupancy_pct_at_arrival'] = (df_data_h1.loc[:, 'occupancy_at_arrival'] / max_occupancy)
df_data_h1.loc[:, 'occupancy_pct_at_departure'] = (df_data_h1.loc[:, 'occupancy_at_departure'] / max_occupancy)

df_data_h1

# Calculating Occupancy for Hotel 2

In [None]:
hotel = 'H2'

hotel_filter = (df_data['HotelNumber'] == hotel)

df_data_h2 = df_data[hotel_filter]

df_counts_h2 = calculate_daily_occupancy(df_data_h2, 'ArrivalDate', 'DepartureDate')
df_counts_h2.name = 'Active_Reservations'
df_counts_h2

## Calculate Daily and Max Occupancies

In [None]:
# Define active_reservations
active_reservations = df_counts

# Find the maximum occupancy
max_occupancy = active_reservations.max()

## Append Occupancies to Subset Data

In [None]:
# Map the occupancy on arrival and departure dates to each reservation
df_data_h2.loc[:, 'occupancy_at_arrival'] = df_data_h2.loc[:, 'ArrivalDate'].map(active_reservations)
df_data_h2.loc[:, 'occupancy_at_departure'] = df_data_h2.loc[:, 'DepartureDate'].map(active_reservations)

# Convert these occupancies to percentages of the maximum occupancy
df_data_h2.loc[:, 'occupancy_pct_at_arrival'] = (df_data_h2.loc[:, 'occupancy_at_arrival'] / max_occupancy)
df_data_h2.loc[:, 'occupancy_pct_at_departure'] = (df_data_h2.loc[:, 'occupancy_at_departure'] / max_occupancy)

df_data_h2

# Combine Subset Datasets into Single DataFrame

In [None]:
full_dataset = pd.concat([df_data_h1, df_data_h2], axis = 0)

full_dataset.to_parquet('../../data/3.2_data_with_occupancies.parquet', compression = 'zstd')

# **Final Review**

---

The calculation of occupancy percentages at the time of arrival and departure has provided valuable insights into the hotel’s operational dynamics during guests' stays. By determining occupancy levels at these critical points, I've gained a clearer understanding of how room availability fluctuates and how it might influence guest behavior, including cancellations. 

These occupancy metrics will play an important role in refining the predictive models and improving the accuracy of forecasts related to booking patterns, cancellations, and revenue management. The addition of these features strengthens the dataset, making it more robust for the next stages of analysis and modeling.

---