# Calculating Hotel Occupancy

---

> Hotel occupancy is a critical factor during the booking process and can provide additional insight into the likelihood of cancellations and/or forecasting future ADR.
> 
> However, *there's no clear indication of the total number of guest rooms for either hotel.*
>
> 
> I will determine the maximum number of rooms occupied for each date for each hotel, which can be used as a placeholder max occupancy number.

---

# Import Packages and Read Data

In [1]:
## Used to upload 
%load_ext autoreload
%autoreload 2

In [2]:
## Data Handling
import pandas as pd
import numpy as np

import datetime as dt

## Visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

## Custom-made Functions
# from bmc_functions import eda

In [3]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

In [4]:
data_path_H1 = './data/H1.csv'

data1 = pd.read_csv(data_path_H1)
data1['HotelNumber'] = 1
# data1.head()

In [5]:
data_path_H2 = './data/H2.csv'

data2 = pd.read_csv(data_path_H2)
data2['HotelNumber'] = 2
# data2.head()

In [6]:
## Combine both datasets
df_data = pd.concat([data1, data2], axis = 0).reset_index(drop=True)
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber
0,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1
1,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1
2,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1
3,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1
4,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1


In [7]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   IsCanceled                   119390 non-null  int64  
 1   LeadTime                     119390 non-null  int64  
 2   ArrivalDateYear              119390 non-null  int64  
 3   ArrivalDateMonth             119390 non-null  object 
 4   ArrivalDateWeekNumber        119390 non-null  int64  
 5   ArrivalDateDayOfMonth        119390 non-null  int64  
 6   StaysInWeekendNights         119390 non-null  int64  
 7   StaysInWeekNights            119390 non-null  int64  
 8   Adults                       119390 non-null  int64  
 9   Children                     119386 non-null  float64
 10  Babies                       119390 non-null  int64  
 11  Meal                         119390 non-null  object 
 12  Country                      118902 non-null  object 
 13 

# Begin Time Series Resampling

## Calculate Arrival Date

In [8]:
## Convert Arrival columns to strings

arrival_date_cols = ['ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateDayOfMonth']

arrival_date_cols_str = df_data[arrival_date_cols].astype(str)
arrival_date_cols_str.head()

Unnamed: 0,ArrivalDateYear,ArrivalDateMonth,ArrivalDateDayOfMonth
0,2015,July,1
1,2015,July,1
2,2015,July,1
3,2015,July,1
4,2015,July,1


In [9]:
## Create new column of strings formatted as YYYY-MM-DD, then convert to datetime

arrival_date_full_str = arrival_date_cols_str['ArrivalDateYear'] + '-' + \
                        arrival_date_cols_str['ArrivalDateMonth'] + '-' + \
                        arrival_date_cols_str['ArrivalDateDayOfMonth']

arrival_date_dt = pd.to_datetime(arrival_date_full_str, yearfirst = True)
arrival_date_dt.name = 'Arrival_Date'
arrival_date_dt

0        2015-07-01
1        2015-07-01
2        2015-07-01
3        2015-07-01
4        2015-07-01
            ...    
119385   2017-08-30
119386   2017-08-31
119387   2017-08-31
119388   2017-08-31
119389   2017-08-29
Name: Arrival_Date, Length: 119390, dtype: datetime64[ns]

In [10]:
## Concatenate new column
df_data = pd.concat([df_data, arrival_date_dt], axis = 1)
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,Arrival_Date
0,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,2015-07-01
3,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,2015-07-01
4,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,2015-07-01


## Calculate Departure Date

In [11]:
## Testing creation of timedelta series to add additional nights to arrival date
pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')

0        0 days
1        0 days
2        0 days
3        0 days
4        0 days
          ...  
119385   2 days
119386   2 days
119387   2 days
119388   2 days
119389   2 days
Name: StaysInWeekendNights, Length: 119390, dtype: timedelta64[ns]

In [12]:
## Create timedelta series based on number of weekday/end nights.
timedelta_wknd = pd.to_timedelta(df_data.loc[:, 'StaysInWeekendNights'], unit = 'D')
timedelta_wk = pd.to_timedelta(df_data.loc[:, 'StaysInWeekNights'], unit = 'D')

In [13]:
## Calculate the departure date by adding the timedeltas to the arrival date
departure_date = df_data.loc[:, 'Arrival_Date'] + timedelta_wk + timedelta_wknd
departure_date.name = 'Departure_Date'
departure_date

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: Departure_Date, Length: 119390, dtype: datetime64[ns]

In [14]:
## Concatenate with original dataframe
df_data = pd.concat([df_data, departure_date], axis = 1)
df_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,Babies,Meal,Country,MarketSegment,DistributionChannel,IsRepeatedGuest,PreviousCancellations,PreviousBookingsNotCanceled,ReservedRoomType,AssignedRoomType,BookingChanges,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate,HotelNumber,Arrival_Date,Departure_Date
0,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,2015-07-01,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,1,2015-07-01,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,2015-07-01,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,1,2015-07-01,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,1,2015-07-01,2015-07-03


## Calculate Daily Occupancies

In [15]:
min_date = df_data['Arrival_Date'].min()
max_date = df_data['Departure_Date'].max()

print(f'The earliest date is: {min_date}.\nThe latest date is: {max_date}.')

The earliest date is: 2015-07-01 00:00:00.
The latest date is: 2017-09-14 00:00:00.


### ChatGPT-Generated Code

#### Walkthrough of Code Below

Certainly! Let's break down the corrected code into its key steps, explaining each part and its purpose in calculating the number of active hotel reservations for each date:

### **1. Sample Data Creation**

```python
df_data = pd.DataFrame({
    'Arrival_Date': ['2023-01-01', '2023-01-02', '2023-01-05', '2023-01-05'],
    'Departure_Date': ['2023-01-04', '2023-01-03', '2023-01-06', '2023-01-07']
})
```
- This step initializes `df_data`, a pandas DataFrame with two columns: `Arrival_Date` and `Departure_Date`. Each row represents a reservation with its arrival and departure dates.

### **2. Convert Dates to Datetime Format**

```python
df_data['Arrival_Date'] = pd.to_datetime(df_data['Arrival_Date'])
df_data['Departure_Date'] = pd.to_datetime(df_data['Departure_Date'])
```
- Converts the date columns from strings (or any other format they might be in) to pandas datetime objects, allowing for date arithmetic and other time-series operations.

### **3. Generate Counts for Arrivals and Departures**

```python
arrivals = df_data['Arrival_Date'].value_counts().rename('count')
departures = df_data['Departure_Date'].value_counts().rename('count')
```
- Counts how many reservations start (`arrivals`) and end (`departures`) on each date. The `value_counts()` method tallies occurrences of each date, and `rename('count')` changes the Series name to `'count'`, which aids in clarity for later operations.

### **4. Combine Arrival and Departure Counts**

```python
df_counts = pd.concat([arrivals, -departures]).sort_index().reset_index()
df_counts.columns = ['Date', 'Count']
```
- Combines the arrivals and departures into a single DataFrame, `df_counts`, with arrivals contributing positively to the count and departures negatively (indicating the end of a reservation). The data is then sorted by date.

### **5. Aggregate Counts on the Same Date**

```python
df_counts = df_counts.groupby('Date').sum()
```
- Since the combination of arrivals and departures could result in multiple entries for the same date, this step aggregates (sums) all counts for each date. This ensures each date is unique, addressing the initial issue of duplicate labels.

### **6. Generate a Complete Date Range and Reindex**

```python
date_range = pd.date_range(start=df_counts.index.min(), end=df_counts.index.max())
df_counts = df_counts.reindex(date_range, fill_value=0)
```
- Creates a continuous range of dates covering the entire period from the earliest to the latest date in `df_counts`. It then reindexes `df_counts` to include every date in this range, filling any dates without data with `0`, ensuring there's a record for every single day in the period.

### **7. Calculate Cumulative Sum for Active Reservations**

```python
df_counts['Active_Reservations'] = df_counts['Count'].cumsum()
```
- Computes the cumulative sum of the daily net reservation counts (`Count`). This step effectively calculates the total number of active reservations for each date by adding up the arrivals and subtracting the departures as they occur over time.

### **Conclusion**

- The final output, `df_counts['Active_Reservations']`, shows the total number of active reservations for each date in the range. This method is efficient and avoids the problem of duplicate labels by ensuring that each date is unique before the reindexing step, leveraging pandas' capabilities for handling time series data.

In [17]:
# Generate counts for arrivals and departures on their respective dates
arrivals = df_data['Arrival_Date'].value_counts().rename('count')
departures = df_data['Departure_Date'].value_counts().rename('count')
arrivals.head(), departures.head()

(Arrival_Date
 2015-12-05    448
 2016-11-07    366
 2015-10-16    356
 2016-10-13    344
 2015-09-18    340
 Name: count, dtype: int64,
 Departure_Date
 2015-12-08    452
 2016-06-17    411
 2015-09-20    398
 2015-09-30    392
 2015-10-18    372
 Name: count, dtype: int64)

In [18]:
# Create a DataFrame from arrivals and departures, marking departures as negative
df_counts = pd.concat([arrivals, -departures]).sort_index().reset_index()
df_counts.columns = ['Date', 'Count']
df_counts

Unnamed: 0,Date,Count
0,2015-07-01,122
1,2015-07-01,-2
2,2015-07-02,93
3,2015-07-02,-7
4,2015-07-03,-83
...,...,...
1594,2017-09-09,-9
1595,2017-09-10,-4
1596,2017-09-11,-2
1597,2017-09-12,-1


In [19]:
# Aggregate counts on the same date to avoid duplicate labels
df_counts = df_counts.groupby('Date').sum()
df_counts

Unnamed: 0_level_0,Count
Date,Unnamed: 1_level_1
2015-07-01,120
2015-07-02,86
2015-07-03,-27
2015-07-04,36
2015-07-05,-4
...,...
2017-09-09,-9
2017-09-10,-4
2017-09-11,-2
2017-09-12,-1


In [20]:
# Generate a complete date range covering the period in df_data
date_range = pd.date_range(start=df_counts.index.min(), end=df_counts.index.max())
date_range

DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-04',
               '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08',
               '2015-07-09', '2015-07-10',
               ...
               '2017-09-05', '2017-09-06', '2017-09-07', '2017-09-08',
               '2017-09-09', '2017-09-10', '2017-09-11', '2017-09-12',
               '2017-09-13', '2017-09-14'],
              dtype='datetime64[ns]', length=807, freq='D')

In [21]:
# Reindex the aggregated count DataFrame to include all dates in the range, filling missing dates with 0
df_counts = df_counts.reindex(date_range, fill_value=0)
df_counts

Unnamed: 0,Count
2015-07-01,120
2015-07-02,86
2015-07-03,-27
2015-07-04,36
2015-07-05,-4
...,...
2017-09-10,-4
2017-09-11,-2
2017-09-12,-1
2017-09-13,0


In [22]:
# Calculate the cumulative sum to determine active reservations for each date
df_counts['Active_Reservations'] = df_counts['Count'].cumsum()

df_counts['Active_Reservations']

2015-07-01    120
2015-07-02    206
2015-07-03    179
2015-07-04    215
2015-07-05    211
             ... 
2017-09-10      5
2017-09-11      3
2017-09-12      2
2017-09-13      2
2017-09-14      0
Freq: D, Name: Active_Reservations, Length: 807, dtype: int64

#### Groupby.Sum vs. Cumsum

The use of both `groupby().sum()` and the `cumsum()` methods serves two different purposes in the process of calculating the total number of active reservations for each date. Here's a clarification of the roles each step plays in the computation:

### **1. GroupBy().sum()**

- **Purpose:** This step aggregates the daily net changes in reservations (arrivals and departures) for each unique date. Since arrivals are counted positively and departures negatively, the sum for each date tells us the net reservation change on that day. 
- **What It Solves:** If, for instance, 5 reservations start (arrive) and 3 end (depart) on a particular date, the net change in reservations for that day would be +2. This calculation consolidates all changes into a single value per date, ensuring there's no duplication of dates in the dataset, which is necessary for the next steps.

### **2. cumsum()**

- **Purpose:** The cumulative sum (`cumsum()`) takes these daily net changes and accumulates them over the entire period to calculate the total number of active reservations for each date. It essentially adds up the net changes from the start date, rolling forward, to show how many reservations are active on any given day.
- **What It Solves:** This step provides the running total of active reservations. It accounts for the ongoing balance of reservations as they begin and end over time, showing the total active reservations on each date. This is crucial for understanding the capacity or occupancy on any given day.

### **Illustrative Example:**

Let's say you have data for three days:

- **Day 1:** 5 arrivals, 0 departures (net +5)
- **Day 2:** 3 arrivals, 1 departure (net +2)
- **Day 3:** 2 arrivals, 4 departures (net -2)

After `groupby().sum()`, you'd have a net change sequence of [+5, +2, -2].

Applying `cumsum()` to this sequence gives you the total active reservations for each day: [5, 7, 5]. This demonstrates how the occupancy evolves:

- **Day 1:** Starts with 5,
- **Day 2:** Increases to 7,
- **Day 3:** Decreases back to 5.

### **Conclusion:**

- **`groupby().sum()`** is used for condensing the dataset into a form where each date has a single net change value, resolving any issues with duplicate dates.
- **`cumsum()`** transforms these net changes into a running total of active reservations, reflecting how the number of active reservations builds up or reduces over time.