# EDA and Data Wrangling

## Revisiting the Reservations

---

Originally, I used this notebook to perform EDA with the intention of using the dataset only for classifying whether a reservation would cancel.

Now, as part of my efforts to revisit and revamp this overall repository and workflow, I am adapting it for broader uses, such as regression modeling and time series forecasting.

The end goal is to have a comprehensive overview of the data and to be flexible enough to handle different workflows.

**Warning: Work-in-Progress**

As this is a revamp of the original workbook, some of the code and comments may be outdated. I intend to update and clarify all steps in time, but there may be some parts that are out of place while I clean things up.

---

**Of Demand and Cancellations**

*This was the initial intro to the notebook with a focus on classification modeling.*

>**Every aspect of hospitality depends on accurately anticipating business demand**: how many rooms to clean; how many rooms are available to sell; what would be the best rate; and how to bring it all together to make every guest satisfied. 
>
> Proper forecasting is critical to every department and staff member, and to generate our forecasts, **hotel managers need to know how many guests will cancel prior to arrival**. Using data from two European hotels, I developed a model to predict whether a given reservation would cancel based on 30 different reservation details.

**In order to develop and train my models, I need to prepare the data in advance.**

>In this notebook, I explore the original dataset and its features; condense several features into smaller subsets; engineer new features; and remove unwanted features from the data.
>
**Once the data is prepared, I will reload the data in a new notebook to create and train my models to determine my predictions of who will stay and who will cancel.**

# Import Packages

In [None]:
## Used to re-import custom functions during development
%load_ext autoreload
%autoreload 2

In [None]:
## Enabling access to custom functions in separate directory

# Import necessary modules
import os
import sys

# Construct the absolute path to the 'src' directory
src_path = os.path.abspath(os.path.join('..', 'src'))

# Append the path to 'sys.path'
if src_path not in sys.path:
    sys.path.append(src_path)

import db_utils, eda

## Data Handling
import numpy as np
import pandas as pd

## Visualizations
import matplotlib.pyplot as plt
from missingno import matrix
import plotly.express as px
import seaborn as sns

## Feature Preprocessing
from feature_engine.imputation import CategoricalImputer, MeanMedianImputer
from feature_engine.encoding import OneHotEncoder, RareLabelEncoder
from feature_engine.outliers import OutlierTrimmer

from sklearn.ensemble import IsolationForest, HistGradientBoostingRegressor, HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split

In [None]:
## Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('display.max_rows', 50)
%matplotlib inline

# Read Source Data (with UUIDs)

In [None]:
# # Path to the DuckDB database file
# db_path = '../data/hotel_reservations.duckdb'

# ## Select subset of data for review
# q = 'SELECT * FROM res_data LIMIT 5'

# with db_utils.duckdb_connection(db_path) as conn:
#     data = conn.execute(q).df()
    
# data.head()

In [None]:
backup_data_path = '../data/data_condensed_with_uuid.parquet'

data = pd.read_parquet(backup_data_path)

data.head()

## Add Pre-Engineered Date Features

In [None]:
filepath = '../data/engineered_data_dates.parquet'

df_dates = pd.read_parquet(filepath)
df_dates.head()

In [None]:
df_dates.info()

## Condense to Single DataFrame

In [None]:
data = data.merge(right = df_dates, how = 'left', on = 'UUID')
data.head()

In [None]:
data.info()

## Dropping Select Features

*Some features were used to engineer new features - particularly arrival details.*

In [None]:
drop_feats = ['UUID','LeadTime', 'ArrivalDateYear', 'ArrivalDateMonth', 'ArrivalDateWeekNumber',
              'ArrivalDateDayOfMonth', 'StaysInWeekendNights', 'StaysInWeekNights',
              'ReservationStatusDate_x', 'ReservationStatusDate_y', 'ArrivalDate',
              'DepartureDate', 'BookingDate']
drop_feats

In [None]:
data = data.drop(columns = drop_feats)
data.head()

# Abbreviated EDA

---

- Original notebook reviewed each feature in depth
- Abbreviating review for simplicity.

---

## Summary Stats via Describe Method

In [None]:
## Numeric Stats
data.describe(include = 'number')

---

- Outliers present in many features
- Outlier detection/removal may be required in preprocessing pipeline for certain model types

---

In [None]:
## Non-Numeric Stats
data.describe(exclude = 'number')

---

- High cardinality in Country, Agent, Company (disregard UUID; reservation ID)

---

## Review Missing Values

In [None]:
nan_sum = data.isna().sum()
nan_sum[nan_sum>0]

In [None]:
nan_avg = data.isna().mean()
nan_avg[nan_avg>0]

---

- Two features missing values
- Average number of missing values less than 1%
- No action taken; will address in model pipeline

---

## Visualizing Data

In [None]:
data_number = data.select_dtypes(include = 'number').columns
data_non_num = data.select_dtypes(exclude = 'number').columns

### Numeric

In [None]:
data[data_number].hist(bins = 20, figsize = (18,21), layout = (-1, 5));

### Non-Numeric

In [None]:
vc_params = {'normalize':True, 'dropna': False, 'ascending': False}

for col in data_non_num:
    if data[col].nunique() < 10:
        print(data[col].value_counts(**vc_params),'\n')
    else:
        print(data[col].value_counts(**vc_params)[:5], '\n')

---

Rare-Label Encoding for categories <5%. Binary encoding for features w/ low variance.

---

# Drop ReservationStatus

---

> `ReservationStatus` is nearly identical to my target feature and would be too strong of a predictor in my models.

---

In [None]:
data[['ReservationStatus', 'IsCanceled']].value_counts()

In [None]:
## Dropping "reservation_status"
data = data.drop(columns = ['ReservationStatus'])

In [None]:
## Confirming 'reservation_status' removal from dataframe
'ReservationStatus' not in data

#  Feature Preprocessing with `Feature-Engine` Package

1. Outliers (continuous features): [OutlierTrimmer using MAD](https://feature-engine.trainindata.com/en/latest/user_guide/outliers/OutlierTrimmer.html#maximum-absolute-deviation)
    * *Test alternative methods - reduces dataset by 50%.*
3. Categorical encoding: [DecisionTreeEncoder](https://feature-engine.trainindata.com/en/latest/api_doc/encoding/DecisionTreeEncoder.html#decisiontreeencoder)
   * *Only usable with target feature; not ideal for all-purpose preprocessing.*
5. Rare labels (categoricals): [RareLabelEncoding](https://feature-engine.trainindata.com/en/latest/api_doc/encoding/RareLabelEncoder.html#rarelabelencoder)
   * *Best option for high cardinality features.*
7. Datetime-related features: Review [DatetimeFeatures](https://feature-engine.trainindata.com/en/latest/user_guide/datetime/DatetimeFeatures.html#automating-feature-extraction)
   * *Possibly useful for revising datetime feature engineering notebook.*

In [None]:
data.head().T

## Missing Value Imputation

In [None]:
data.isna().sum()[data.isna().sum() > 0]

In [None]:
## Instantiate imputers for categorical and continuous features

cat_imputer = CategoricalImputer(variables=['Country'], imputation_method = 'frequent')
data = cat_imputer.fit_transform(data)

num_imputer = MeanMedianImputer(imputation_method = 'median', variables = ['Children'])
data = num_imputer.fit_transform(data)


data.isna().sum().sum()

## Categorical Encoding

### Rare Label Encoding

---

* Several features with high degree of cardinality
* Performing rare label encoding will reduce the cardinality, making one-hot encoding more efficient and effective

---

In [None]:
data.describe(include = 'object')

In [None]:
rle = RareLabelEncoder(tol=0.05,n_categories=3,replace_with='Rare')

data = rle.fit_transform(data)

In [None]:
data.describe(include = 'object')

### One-Hot Encoding

---

* Used after rare label encoding, this will convert my categorical features to numeric.

---

In [None]:
ohe = OneHotEncoder(drop_last = True)

data = ohe.fit_transform(data)

data.head()

## Outlier Detection: Isolation Forest

In [None]:
model = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
model.fit(data)

# Predict anomalies
labels = model.predict(data)

# # Calculate anomaly scores
# scores = model.decision_function(data)

# # Example: Inspect the first 5 predictions and scores
# for i in range(5):
#     print(f"Data Point {i+1}: Label = {labels[i]}, Score = {scores[i]:.3f}")

In [None]:
data[labels == -1].describe()

# Test: Modeling

In [None]:
data.columns

In [None]:
data = data[labels == 1]
data

In [None]:
target = 'ADR'
# target = 'IsCanceled'

X = data.drop(columns = [target])
y= data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
model = HistGradientBoostingRegressor(random_state = 42)
# model = HistGradientBoostingClassifier(random_state = 42)

model.fit(X_train, y_train)

# y_pred = model.predict(X_test)

model.score(X_test, y_test)