# Baseline Modeling - Regression

---

* Goal: to develop baseline models prior to feature engineering to compare performance vs. post-engineered models.

My goal with this notebook is to develop a series of baseline models using minimal preprocessing. These models will establish a baseline performance for me to improve with additional feature engineering. Additionally, the most impactful features for each model can indicate if there are any features that are too strongly predictive.

---

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sweetviz as sv

In [None]:
## SKLearn and Modeling Tools

from feature_engine.encoding import CountFrequencyEncoder
from feature_engine.outliers import Winsorizer
from feature_engine.pipeline import Pipeline as fePipeline

from sklearn import metrics
from sklearn import set_config
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler

set_config(transform_output='pandas')

## Load Data

In [None]:
df_data = pd.read_feather('../../data/source/full_data.feather')
# df_data = df_data.set_index('UUID')
df_data

## Set Target Feature

In [None]:
target_feature = 'ADR'

## Quick Overview

In [None]:
sv.analyze(df_data,pairwise_analysis = 'off').show_notebook()

---

**Quick Overview: Review**

Based on the quick EDA, I see there are both categorical features (several with high cardinality) and continuous (with right-tailed skews and some extreme outliers).

**Questionable Features**

First, I will drop the column `UUID` as it is a unique identifier and does not have any predictive value.

There are two features that I can identify from domain knowledge as being too strongly predictive of the ADR (`IsCanceled`, `ReservationStatus`). These features indicate whether or not a guest stayed (if they cancel or no-show, the revenue is zero).

Additionally, there are some temporal features that are either irrelevant to predictive modeling (`ArrivalDateYear`) or too closely related to the predictive features above (`ReservationStatusDate`).

I will drop these features to match real-world data more closely/realistically.

---

In [None]:
df_data = df_data.drop(columns = ['UUID', 'IsCanceled',
                                  'ReservationStatus',
                                  'ReservationStatusDate',
                                  'ArrivalDateYear'])

In [None]:
df_data.head().T

# Train-Test Split and Preprocessor

In [None]:
df_data.head()

In [None]:
X = df_data.drop(columns = target_feature)
y = df_data[target_feature]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 903)

# Create Feature-Engine Pipeline

In [None]:
cat_feats = X.select_dtypes('object').columns
num_feats = X.select_dtypes('number').columns

cat_pipeline = fePipeline([('imputer', SimpleImputer(strategy = 'most_frequent')),
                           ('encoder', CountFrequencyEncoder(encoding_method = 'frequency',
                                                             unseen = 'encode',
                                                             missing_values = 'ignore'))])

num_pipeline = fePipeline([('imputer', SimpleImputer(strategy='mean')),
                           ('powertransformer', PowerTransformer(method = 'yeo-johnson')),
                           ('scaler', StandardScaler())])

## Combine transformers into a single ColumnTransformer
preprocessor = ColumnTransformer(transformers=[('num', num_pipeline, num_feats),
                                               ('cat', cat_pipeline, cat_feats)
                                               ])

## Define the target transformer
target_transformer = PowerTransformer(method='yeo-johnson')

## Instantiate the model
base_regressor = RandomForestRegressor(n_jobs = -1,
                                  random_state = 903,
                                  min_samples_split = 2,
                                  min_samples_leaf = 2)

# Create the TransformedTargetRegressor with Yeo-Johnson transformation
regressor = TransformedTargetRegressor(regressor=base_regressor, transformer=target_transformer)


# Test Models with Feature-Engine

In [None]:
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', regressor)
])
model_pipeline.fit(X_train, y_train)

In [None]:
train_score = round(model_pipeline.score(X_train, y_train),4)

test_score = round(model_pipeline.score(X_test, y_test), 4)

train_score, test_score

In [None]:
preds = model_pipeline.predict(X_test)

mean_ae = metrics.mean_absolute_error(preds, y_test)
median_ae = metrics.median_absolute_error(preds, y_test)

print(f'The MAE is {mean_ae:,.2f} and the Median Absolute Error is {median_ae:,.2f}.')

In [None]:
depths = [tree.get_depth() for tree in model_pipeline[-1].regressor_.estimators_]

sns.histplot(depths);

In [None]:
# Calculate permutation importances
result = permutation_importance(model_pipeline,
                                X_test, y_test,
                                random_state=42,
                                n_jobs=-1)

In [None]:
# Extract importances and standard deviations
perm_importances = result.importances_mean
perm_importances_std = result.importances_std

# Create a DataFrame for easy plotting
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': perm_importances,
    'Importance_std': perm_importances_std
}).sort_values(by='Importance', ascending=False)

# Plot the feature importances
sns.barplot(x='Importance', y='Feature',data=importance_df)

In [None]:
raise Exception('End of Testing')

# Convert NaN and Negative ADRs to .0001

In [None]:
df_data_target = pd.Series(np.where(df_data_target <= 0,.0001,df_data_target))
df_data_target

In [None]:
df_data_target.describe()

# Pipeline

In [None]:
def create_and_test_bl_model(X_train, y_train,
                             X_test, y_test,
                             regressor,
                             show_metrics = True):

### ---  Creating ColumnTransformer and sub-transformers for imputation and encoding --- ###
    num_cols = X_train.select_dtypes('number').columns
    cat_cols = X_train.select_dtypes('object').columns
    
    cat_pipe = Pipeline(steps=[('cat_imp', SimpleImputer(strategy = 'most_frequent')),
                               ('ohe',OneHotEncoder(drop = 'if_binary',
                                              handle_unknown='ignore',
                                              sparse_output=False))])
    
    num_pipe = Pipeline(steps=[('cat_imp', SimpleImputer(strategy = 'most_frequent')),
                               ('scaler', StandardScaler())])
    
    preprocessor = ColumnTransformer(transformers=[('num', num_pipe, num_cols),
                                                   ('cat', cat_pipe, cat_cols)])
        
    # Integrating the preprocessor with the regressor into a pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', regressor)])
    
    pipeline.fit(X_train, y_train)
    
    if show_metrics == True:
        preds = pipeline.predict(X_test)
        mae = metrics.mean_absolute_error(y_test, preds)
        rmse = metrics.root_mean_squared_error(y_test, preds)
        r2 = metrics.r2_score(y_test, preds)
        
        print(f'\nThe MAE is: {mae:.2f}',
              f'\nThe RMSE is: {rmse:.2f}'
              f'\nThe R2 is: {r2:.2f}')
    else:
        pass

    return pipeline

## DummyRegressor

In [None]:
create_and_test_bl_model(X_train,y_train, X_test, y_test, DummyRegressor(random_state = 903))

In [None]:
create_and_test_bl_model(X_train,y_train, X_test, y_test,
                         HistGradientBoostingRegressor(random_state = 903))

In [None]:
rfr_model = create_and_test_bl_model(X_train,y_train, X_test, y_test, 
                                     RandomForestRegressor(n_jobs = -1,
                                                           min_samples_split=2,
                                                           max_depth=75))

In [None]:
depths = [tree.get_depth() for tree in rfr_model[-1].estimators_]

sns.histplot(depths);

In [None]:
rfr_model[-1]

In [None]:
rfr_model[-1].feature_importances_

## SGDRegressor

In [None]:
# ### ---  Creating ColumnTransformer and sub-transformers for imputation and encoding --- ###
# num_cols = X.select_dtypes('number').columns
# cat_cols = X.select_dtypes('object').columns

# cat_pipe = Pipeline(steps=[('cat_imp', SimpleImputer(strategy = 'most_frequent')),
#                            ('ohe',
#                             OneHotEncoder(drop = 'if_binary',
#                                           handle_unknown='ignore',
#                                           sparse_output=False))])

# num_pipe = Pipeline(steps=[('cat_imp', SimpleImputer(strategy = 'most_frequent')),
#                            ('scaler', StandardScaler())])

# preprocessor = ColumnTransformer(transformers=[('num', num_pipe, num_cols),
#                                                ('cat', cat_pipe, cat_cols)])

# # Integrating the preprocessor with the SGDRegressor into a pipeline
# pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                            ('regressor', SGDRegressor(loss='huber',
#                                                       penalty='elasticnet',
#                                                       random_state=903))])

# pipeline.fit(X_train, y_train)


# preds = pipeline.predict(X_test)
# mae = metrics.mean_absolute_error(y_test, preds)
# rmse = metrics.root_mean_squared_error(y_test, preds)
# r2 = metrics.r2_score(y_test, preds)

# print(f'\nThe MAE is: {mae:.2f}',
#       f'\nThe RMSE is: {rmse:.2f}'
#       f'\nThe R2 is: {r2:.2f}')

# XGBRegressor

In [None]:
# pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                            ('regressor', XGBRegressor(objective='reg:squarederror', random_state=42))])

# # Fit the pipeline to the training data
# pipeline.fit(X_train, y_train)

# # Make predictions on the test data
# y_pred = pipeline.predict(X_test)

# # Evaluate the model
# mae = metrics.mean_absolute_error(y_test, y_pred)
# mse = metrics.mean_squared_error(y_test, y_pred)
# r2 = metrics.r2_score(y_test, y_pred)

# # Print the results
# print(f"Mean Absolute Error (MAE): {mae:,.2f}",)
# print(f"Mean Squared Error (MSE): {mse:,.2f}",)
# print(f"R-squared (R²): {r2:,.2f}")

# Results

---

The best model was the Random Forest Regressor model, with an MAE of # and R^2 of #. This model performed well with minor pre-processing, leading me to believe there may be features that are strongly predictive of the ADR. I will need to investigate further to confirm.

---