# Revisiting the Reservations - EDA and Data Wrangling

---

***Warning: Work-in-Progress***

Originally, I used this notebook to perform EDA with the intention of using the dataset only for classifying whether a reservation would cancel.

Now, as part of my efforts to revisit and revamp this overall repository and workflow, I am updating and adapting it for broader uses, including regression modeling, classification modeling, and time series forecasting.

The end goal is to have a comprehensive overview of the data and to be flexible enough to handle different workflows.


As this is a revamp of the original workbook, some of the code and comments may be outdated. I intend to update and clarify all steps in time, but there may be some parts that are out of place while I clean things up.

---

**Revenue Forecasting**

> Proper revenue forecasting is a critical aspect of the revenue management cycle, with the goal to maximize the hotel's revenues. Using the source dataset from two Portuguese hotels (one in an urban location and another a resort), I will utilize machine learning models to forecast the hotels' average daily rate ("ADR") based on common reservation details that would be known prior to arrival. 
>
> Before I train any models, I need to get an idea of my data, including its data types; distributions; presence/absence of outliers; and to brainstorm ideas for feature engineering.
>
> In this notebook, I explore the original dataset and identify specific features to remove from the data. I will determine whether to keep the feature in my modeling based on my domain knowledge of whether this data is commonly known prior to arrival.
>
> Once the data is prepared, I will perform further modeling and feature engineering in additional notebooks.

---

# Import Packages

In [None]:
## DEPRECATED - not using custom functions in this noteboook.
## Keep this code for future reference/use.

## Used to re-import custom functions during development
# %load_ext autoreload
# %autoreload 2

# ## Import necessary modules to enable access to custom functions in separate directory
# import os
# import sys

# # Construct the absolute path to the 'src' directory
# src_path = os.path.abspath(os.path.join('../..', 'src'))

# # Append the path to 'sys.path'
# if src_path not in sys.path:
#     sys.path.append(src_path)

In [None]:
## Data Handling
import duckdb
import json
import numpy as np
import pandas as pd
from sqlalchemy import create_engine

## Visualizations
import matplotlib.pyplot as plt
import sweetviz as sv

## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

# Load Dataset from Database

In [None]:
# ## Create connection to database file

# con = create_engine("duckdb:///../../data/reservations.duckdb")
# q = '''SELECT *
# FROM reservations'''
# res_data = pd.read_sql(q, con)

# q = '''SELECT *
# FROM guests'''
# guests_data = pd.read_sql(q, con)

# q = '''SELECT *
# FROM rooms'''
# rooms_data = pd.read_sql(q, con)

# q = '''SELECT *
# FROM bookingagents'''
# ba_data = pd.read_sql(q, con)

# q = '''SELECT *
# FROM BookingDetails'''
# bd_data = pd.read_sql(q, con)

# data = pd.merge(left = res_data, right = rooms_data, on = 'UUID', how = 'outer')
# data = pd.merge(left = data, right = guests_data, on = 'UUID', how = 'outer')
# data = pd.merge(left = data, right = ba_data, on = 'UUID', how = 'outer')
# data = pd.merge(left = data, right = bd_data, on = 'UUID', how = 'outer')
# data = data.set_index('UUID')
# data

In [None]:
path = '../../data/source/full_data.feather'

data = pd.read_feather(path)

data

# Exploratory Data Analysis

## Sweetviz Report

In [None]:
report = sv.analyze(data,pairwise_analysis = 'off')

report.show_notebook()

---

***Feature Reviews***

IsCanceled - Binary, but needs to be validated (i.e., what indicates the reservation was canceled? How was this feature created?)

LeadTime - Right-skewed; may benefit from a Yeo-Johnson transformation or just drop the outliers.

All date features - move to feature engineering for time series details.

Adults, Children, Babies - Right-skewed; transform or drop.

Meal - Categorical, perform encoding during modeling pipeline preprocessing

Country - Categorical, perform encoding during modeling pipeline preprocessing

PreviousCancellations/PreviousBookingsNotCancelled - useful for forecasting after reservations booked; harder to use for pre-booking revenue forecasting.

BookedRoomType/AssignedRoomType - `AssignedRoomType` not known until after stay; possible candidate for classification modeling for feature engineering (e.g., which room type assigned; whether booking matches assignment).

BookingChanges - Right-skewed, transform during preprocessing

DepositType - Categorical, perform encoding during modeling pipeline preprocessing

Agent, Company - Categorical, perform encoding during modeling pipeline preprocessing

DaysInWaitingList - Right-skewed, transform during preprocessing

ADR - target feature; negative values (decide whether to keep, drop, or add absolute value of largest negative value to all ADRs to bring to zero); right-skewed, would benefit from a TransformedTargetRegressor w/ PowerTransformer.

RequiredParkingSpaces - Right-skewed, transform during preprocessing

ReservationStatus - used to determine `IsCanceled`; possibly use with other features to engineer new target feature that includes ADR, etc.

ReservationStatusDate - use in temporal feature engineering

---

**Review Results**

Based on the report, each hotel exibits several numeric features with outliers, as well as several features with high cardinality. There are also several features showing a high degree of multicollinearity.

---

**Processing Steps**

* *Outliers and Skew:* Many of the numeric features are right-skewed with some extreme outliers. My initial thoughts are to preserve the details and perform a Yeo-Johnson transformation using a `PowerTransformer` during the modeling preprocessing. However, I may choose to drop these records as they are such extreme values.

* *Categoricals and High Cardinality* - I need to encode each feature, but I want to avoid one-hot encoding due to the high cardinality of the features. Instead, I will use a CountFrequencyEncoder to convert the categories into the frequency of each class.

* *Multicollinearity* - Only concering for linear models. Use methods such as VIF to remove features with high multicollinearity.

---

## Summary Stats via Describe Method

In [None]:
## Numeric Stats
data.describe(include = 'number').T

---

- Outliers present in nearly half of the features
- Power transformations/removal may be required in preprocessing pipeline to minimize impact on models

---

In [None]:
## Non-Numeric Stats
data.describe(exclude = 'number').T

---

- High cardinality in Country, Agent, Company

---

## Review Missing Values

In [None]:
nan_sum = data.isna().sum()
nan_sum[nan_sum > 0]

In [None]:
nan_avg = data.isna().mean()
nan_avg[nan_avg > 0]

---

- Two features missing values
- Average number of missing values less than 1%
- No action taken; will address in model pipeline

---

# Feature Selection - Known at Booking/Pre-Arrival

---

Based on my domain knowledge and experience, I know there are a few features that would not be known prior to booking/pre-arrival. These features should be excluded from the analysis and modeling, but I could create additional models to predict the values for each of these features.

**Reservation-Specific Features**

* Meal
* IsRepeatedGuest
* PreviousCancellations
* PreviousBookingsNotCanceled

**Post-Stay Details**

* AssignedRoomType
* BookingChanges
* DepositType
* DaysInWaitingList
* RequiredCarParkingSpaces
* ReservationStatus
* ReservationStatusDate

---

In [None]:
data.columns

## Create JSON File to Record Column Groups

---

Since I know I will need to use subsets of different features for different analyses and models, I will preemptively define different groups of features to use as filters in the rest of my workflow.

---

In [None]:
## Define custon groups of features
column_groups = {
    'booking_details': ['UUID', 'HotelNumber', 'LeadTime', 'ArrivalDateMonth',
                        'ArrivalDateWeekNumber','ArrivalDateDayOfMonth',
                        'StaysInWeekendNights','StaysInWeekNights','Adults',
                        'Children', 'Babies','Country','MarketSegment',
                        'DistributionChannel','ReservedRoomType',
                        'DepositType','Agent','Company','CustomerType','ADR'],
    'post_stay_details': ['UUID', 'AssignedRoomType','BookingChanges',
                          'DaysInWaitingList','RequiredCarParkingSpaces',
                          'ReservationStatus','ReservationStatusDate'],
    'reservation_specific': ['UUID', 'Meal', 'IsRepeatedGuest','PreviousCancellations',
                             'PreviousBookingsNotCanceled'],
    'temporal_features': ['UUID', 'LeadTime', 'ArrivalDateYear', 'ArrivalDateMonth',
                          'ArrivalDateWeekNumber','ArrivalDateDayOfMonth',
                          'StaysInWeekendNights','StaysInWeekNights',
                          'ReservationStatusDate']
}

## Save results to JSON file
file_name = '../../data/column_groups.json'
with open(file_name, 'w') as json_file:
    json.dump(column_groups, json_file, indent=4)

In [None]:
data[column_groups['booking_details']]

# Initial EDA - Review

---

**Disfunctional Features**

> This inital EDA highlighted several problems with the overall dataset, including:
> * Extreme outliers for several numeric features;
> * High cardinality for several categorical features;
> * Temporal features that need to be re-engineered;
> * Questions about how the `IsCanceled` feature was created.

**Next Steps**

> The next steps will include:
> * Transforming outliers in numeric features via PowerTransformer;
> * Rare-label encoding for categorical features with high cardinality;
> * Re-engineering the temporal features for regression and classification modeling;
> * Either dropping or re-engineering the `IsCanceled` feature due to the ambiguity of its engineering.

---