# **Hotel Cancel Culture** - **EDA Notebook**

---

**Author:** Ben McCarty

**Extension of Capstone Project** - Expanding Hotel Reservation dataset analysis and modeling

**Contact:** bmccarty505@gmail.com

---

## Revisiting the Reservations

---

Originally, I used this notebook to perform EDA with the intention of using the dataset only for classifying whether a reservation would cancel.

Now, as part of my efforts to revisit and revamp this overall repository and workflow, I am adapting it for broader uses, such as regression modeling and time series forecasting.

The end goal is to have a comprehensive overview of the data and to be flexible enough to handle different workflows.

**Warning: Work-in-Progress**

As this is a revamp of the original workbook, some of the code and comments may be outdated. I intend to update and clarify all steps in time, but there may be some parts that are out of place while I clean things up.

---

**Of Demand and Cancellations**

*This was the initial intro to the notebook with a focus on classification modeling.*

>**Every aspect of hospitality depends on accurately anticipating business demand**: how many rooms to clean; how many rooms are available to sell; what would be the best rate; and how to bring it all together to make every guest satisfied. 
>
> Proper forecasting is critical to every department and staff member, and to generate our forecasts, **hotel managers need to know how many guests will cancel prior to arrival**. Using data from two European hotels, I developed a model to predict whether a given reservation would cancel based on 30 different reservation details.

**In order to develop and train my models, I need to prepare the data in advance.**

>In this notebook, I explore the original dataset and its features; condense several features into smaller subsets; engineer new features; and remove unwanted features from the data.
>
**Once the data is prepared, I will reload the data in a new notebook to create and train my models to determine my predictions of who will stay and who will cancel.**

# **Import Packages**

In [None]:
## Used to re-import custom functions during development
%load_ext autoreload
%autoreload 2

In [None]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import os
import sys

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils, eda

## Data Handling
import numpy as np
import pandas as pd

## Visualizations
import matplotlib.pyplot as plt
from missingno import matrix
import plotly.express as px
import seaborn as sns

In [None]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

# Read Source Data (with UUIDs)

In [None]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     data = conn.execute(q).df()
    
# data.head()

In [None]:
backup_data_path = '../data/data_condensed_with_uuid.parquet'

data = pd.read_parquet(backup_data_path)

data.head()

## Add Pre-Engineered Date Features

In [None]:
filepath = '../data/engineered_data_dates.parquet'

df_dates = pd.read_parquet(filepath)
df_dates.head()

In [None]:
df_dates.info()

## Condense to Single DataFrame

In [None]:
data = data.merge(right = df_dates, how = 'left', on = 'UUID')
data.head()

In [None]:
data.info()

## Dropping Old Features

*Some features were used to engineer new features - particularly arrival details.*

In [None]:
drop_feats = ['LeadTime', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateWeekNumber',
              'ArrivalDateDayOfMonth', 'StaysInWeekendNights', 'StaysInWeekNights',
              'ReservationStatusDate_x', 'ReservationStatusDate_y', 'ArrivalDate',
              'DepartureDate', 'BookingDate']
drop_feats

In [None]:
data = data.drop(columns = drop_feats)
data.head()

# Abbreviated EDA

---

- Original notebook reviewed each feature in depth
- Abbreviating review for simplicity.

---

## Summary Stats via Describe Method

In [None]:
## Numeric Stats
data.describe(include = 'number')

---

- Outliers present in many features
- Outlier detection/removal may be required in preprocessing pipeline for certain model types

---

In [None]:
## Non-Numeric Stats
data.describe(exclude = 'number')

---

- High cardinality in Country, Agent, Company (disregard UUID; reservation ID)

---

## Missing Values

In [None]:
nan_sum = data.isna().sum()
nan_sum[nan_sum>0]

In [None]:
nan_avg = data.isna().mean()
nan_avg[nan_avg>0]

---

- Two features missing values
- Average number of missing values less than 1%
- No action taken; will address in model pipeline

---

## Visualizing Data

In [None]:
data_number = data.select_dtypes(include = 'number').columns
data_non_num = data.select_dtypes(exclude = 'number').columns

### Numeric

In [None]:
data[data_number].hist(bins = 20, figsize = (18,21), layout = (-1, 5));

### Non-Numeric

In [None]:
vc_params = {'normalize':True, 'dropna': False, 'ascending': False}

for col in data_non_num:
    if data[col].nunique() < 10:
        print(data[col].value_counts(**vc_params),'\n')
    else:
        print(data[col].value_counts(**vc_params)[:5], '\n')

---

Rare-Label Encoding for categories <5%. Binary encoding for features w/ low variance.

---

# Drop ReservationStatus, UUID

---

> `ReservationStatus` is nearly identical to my target feature and would be too strong of a predictor in my models.
>
> `UUID` is a non-informative unique identifier for each reservation and should be dropped.

---

In [None]:
data[['ReservationStatus', 'IsCanceled']].value_counts()

In [None]:
## Dropping "reservation_status"
data = data.drop(columns = ['ReservationStatus', 'UUID'])

In [None]:
## Confirming 'reservation_status' removal from dataframe
'ReservationStatus' not in data

#  **Post-EDA Updates**

1. Outliers (continuous features): [OutlierTrimmer using MAD](https://feature-engine.trainindata.com/en/latest/user_guide/outliers/OutlierTrimmer.html#maximum-absolute-deviation)
2. Categorical encoding: [DecisionTreeEncoder](https://feature-engine.trainindata.com/en/latest/api_doc/encoding/DecisionTreeEncoder.html#decisiontreeencoder)
3. Rare labels (categoricals): [RareLabelEncoding](https://feature-engine.trainindata.com/en/latest/api_doc/encoding/RareLabelEncoder.html#rarelabelencoder)
4. Datetime-related features: Review [DatetimeFeatures](https://feature-engine.trainindata.com/en/latest/user_guide/datetime/DatetimeFeatures.html#automating-feature-extraction)

In [None]:
data.head().T

## Test Missing Value Imputation

In [None]:
data.isna().sum()[data.isna().sum() > 0]

In [None]:
from feature_engine.imputation import CategoricalImputer, MeanMedianImputer

cat_imputer = CategoricalImputer(variables=['Country'], imputation_method = 'frequent')
data_imputed = cat_imputer.fit_transform(data)

num_imputer = MeanMedianImputer(imputation_method = 'median', variables = ['Children'])
data_imputed = num_imputer.fit_transform(data_imputed)


data_imputed.isna().sum().sum()

## Test Categorical Encoder

In [None]:
data_imputed.select_dtypes('object').columns

In [None]:
from feature_engine.encoding import DecisionTreeEncoder, RareLabelEncoder

In [None]:
rle = RareLabelEncoder(tol=0.05,n_categories=5,replace_with='Rare')

data_imputed = rle.fit_transform(data_imputed)

In [None]:
dte = DecisionTreeEncoder(regression = True, random_state=42)

data_imputed = dte.fit_transform(data_imputed.drop(columns=['ADR']), data_imputed['ADR'])

In [None]:
data_imputed.dtypes

## outliers

In [None]:
from feature_engine.outliers import OutlierTrimmer

In [None]:
olt = OutlierTrimmer(capping_method='gaussian')

data_no_outs = olt.fit_transform(data_imputed)
data_no_outs.head()

In [None]:
data_imputed.shape[0] - data_no_outs.shape[0]

# Preserving the Pandas (DataFrame)

---

Now I am ready to save the cleaned and processed data for modeling in my next notebook.

---

In [None]:
# ## Pickling with Pandas
# data.to_pickle(path = '../data/data_prepped.pickle',
#             compression = 'gzip')
# print(f'Successfully pickled!')

In [None]:
## Pickling with Pandas
data.to_parquet(path = '../data/data_prepped.parquet')
print(f'Successfully saved!')

# Future Work: EDA

---

In the future, I will revisit the visualization aspects of my EDA function to convert them from Plotly Express figures to Matplotlib figures. The goal with Plotly Express was to have additional interativity; however these models crippled my notebook's operations. Matplotlib figures would be more appropriate in this case, and I will revisit this work when I have more time.

---


# Moving to Modeling!

---

> Now that I completed the pre-processing and EDA steps, I will move to my next notebook to perform my classification modeling.

---