### VARMAX & Elastic Net Regression

Our goals with this experiment are to: 

1) Handling Multivariate Data:

VARMA: It is specifically designed for multivariate time series data where the variables have interdependencies.
Elastic Net: It can also handle multivariate regression problems, though it does not model the time series aspect.

2) Explore Regularization and Feature Selection:

VARMA: While VARMA itself does not include regularization, the VARMAX model, an extension that includes exogenous variables, can include L1 regularization (similar to Lasso) in some implementations.
Elastic Net: It includes both L1 and L2 regularization, which helps in feature selection and in dealing with multicollinearity, respectively.

Ideally we will be able to obtain results from VARMAX but it is computationally taxing. We will be able to gain similar benefits from Elastic Net Regression (minus the time series component) so will rely on that if not successful.

In [1]:
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.varmax import VARMAX
from sklearn.metrics import mean_squared_error
import joblib
from sklearn.model_selection import train_test_split

# Load the saved sets from data/processed using numpy
X_train = np.load('../data/processed/X_train.npy')
X_val = np.load('../data/processed/X_val.npy')
y_train = np.load('../data/processed/y_train.npy')
y_val = np.load('../data/processed/y_val.npy')

# Specify the fraction of data to use (e.g., 20%)
data_fraction = 0.2

# Randomly select a subset of the data for training
X_train_subset, _, y_train_subset, _ = train_test_split(
    X_train, y_train, train_size=data_fraction, random_state=42
)

# Create a pandas DataFrame from your training data
feature_columns = ['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', 'feature7', 'feature8', 'feature9', 'feature10', 'feature11']
train_df = pd.DataFrame(X_train_subset, columns=feature_columns)

# Create a pandas DataFrame for the target variable
target_df = pd.DataFrame(y_train_subset, columns=['totalFare'])

# Combine the target variable and feature variables
train_df = pd.concat([target_df, train_df], axis=1)

# Fit a VARMA model
p = 2  # Specify the order for autoregressive (AR) component
q = 1  # Specify the order for moving average (MA) component
model = VARMAX(train_df, order=(p, q))
model_fitted = model.fit(disp=False)

# Save the fitted model to a .joblib file
joblib.dump(model_fitted, '../models/varma_model.joblib')

# Make predictions on validation data
forecast = model_fitted.forecast(steps=len(X_val))

# Calculate Mean Squared Error (MSE) for the predictions
mse = mean_squared_error(y_val, forecast)

print(f'Mean Squared Error (MSE) on Validation Data: {mse:.2f}')

  warn('Estimation of VARMA(p,q) models is not generically robust,'


The model was left to run overnight - after 1000 mins, and despite training on only 20% of the data, no results were yielded. As expected, it was computationally taxing given the large dataset. We will utilise Elastic Net Regression instead.

In [3]:
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
import joblib

# Load the saved sets from data/processed using numpy
X_train = np.load('../data/processed/X_train.npy')
X_val = np.load('../data/processed/X_val.npy')
y_train = np.load('../data/processed/y_train.npy')
y_val = np.load('../data/processed/y_val.npy')

# Initialize the ElasticNet model
elastic_net_model = ElasticNet(random_state=42)

# Fit the model
elastic_net_model.fit(X_train, y_train)

# Save the fitted model to a .joblib file
joblib.dump(elastic_net_model, '../models/elastic_net_model.joblib')

# Make predictions on both training and validation data
y_train_pred = elastic_net_model.predict(X_train)
y_val_pred = elastic_net_model.predict(X_val)

# Calculate Mean Squared Error (MSE) for the predictions
mse_train = mean_squared_error(y_train, y_train_pred)
mse_val = mean_squared_error(y_val, y_val_pred)

# Print MSE for training and validation data
print(f'Mean Squared Error (MSE) on Training Data: {mse_train:.2f}')
print(f'Mean Squared Error (MSE) on Validation Data: {mse_val:.2f}')


Mean Squared Error (MSE) on Training Data: 27711.33
Mean Squared Error (MSE) on Validation Data: 27828.64


The model performs better than the base model but worse than Random Forest, XGBoost and Linear Regression. 