# <span style="font-width:bold; font-size: 3rem; color:#2656a3;">**Msc. BDS Module - Data Engineering and Machine Learning Operations in Business (MLOPs)** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>

## <span style='color:#2656a3'> 🗒️ This notebook is divided into the following sections:
1. Feature selection.
2. Creating a Feature View.
3. Training datasets creation - splitting into train and test sets.
4. Training the model.
5. Register the model to Hopsworks Model Registry.

## <span style='color:#2656a3'> ⚙️ Import of libraries and packages
We start with importing some of the necessary libraries needed for this notebook and warnings to avoid unnecessary distractions and keep output clean.

In [6]:
# Importing the packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#2656a3;"> 📡 Connecting to Hopsworks Feature Store
We connect to Hopsworks Feature Store so we can retrieve the Feature Groups and select features for training data.

In [7]:
# Importing the hopsworks module for interacting with the Hopsworks platform
import hopsworks

# Logging into the Hopsworks project
project = hopsworks.login()

# Getting the feature store from the project
fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.


ConnectionError: HTTPSConnectionPool(host='c.app.hopsworks.ai', port=443): Max retries exceeded with url: /hopsworks-api/api/variables/versions (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x30d2f58d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

In [None]:
# Retrieve the feature groups
electricity_fg = fs.get_feature_group(
    name='electricity_prices',
    version=1,
)

weather_fg = fs.get_feature_group(
    name='weather_measurements',
    version=1,
)

danish_calendar_fg = fs.get_feature_group(
    name='dk_calendar',
    version=1,
)

## <span style="color:#2656a3;"> 🖍 Feature View Creation and Retrieving </span>

We first select the features that we want to include for model training.

Since we specified `primary_key`as `date` and `timestamp` in `1_feature_backfill` we can now join them together for the `electricity_fg`, `weather_fg` and `danish_holiday_fg`.

`join_type` specifies the type of join to perform. An inner join refers to only retaining the rows based on the keys present in all joined DataFrames.

In [None]:
# Select features for training data and join them together and except duplicate columns
selected_features_training = electricity_fg.select_all()\
    .join(weather_fg.select_except(["timestamp", "datetime", "hour"]), join_type="inner")\
    .join(danish_calendar_fg.select_all(), join_type="inner")

In [None]:
# Display the first 5 rows of the selected features
selected_features_training.show(5)

A `Feature View` stands between the **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create a **Feature View** which stores a metadata of our data. Having the **Feature View** we can create a **Training Dataset**.

In order to create Feature View we can use `fs.get_or_create_feature_view()` method.

We can specify parameters:

- `name` - Name of the feature view to create.
- `version` - Version of the feature view to create.
- `query` - Query object with the data.

In [None]:
# Getting or creating a feature view named 'dk1_electricity_training_feature_view'
version = 1
feature_view_training = fs.get_or_create_feature_view(
    name='dk1_electricity_training_feature_view',
    version=version,
    query=selected_features_training,
)

## <span style="color:#2656a3;"> 🏋️ Training Dataset Creation</span>

In Hopsworks, a training dataset is generated from a query defined by the parent FeatureView, which determines the set of features.

**Training Dataset may contain splits such as:** 
* Training set: This subset of the training data is utilized for model training.
* Validation set: Used for evaluating hyperparameters during model training. *(We have not included a validation set for this project)*
* Test set: Reserved as a holdout subset of training data for evaluating a trained model's performance.

Training dataset is created using `fs.training_data()` method.

In [None]:
# Retrieve training data from the feature view 'feature_view_training', assigning the features to 'X'.
df, _ = feature_view_training.training_data(
    description = 'Electricity Prices Training Dataset',
)

In [None]:
# Sorting the dataframe by timestamp for better performance for temporal models
df.sort_values(by='timestamp', ascending=True, inplace=True)

# Resetting the index of the dataframe
df = df.reset_index(drop=True)

# Display the first 5 rows of the training data
df.head()

### Lagged data

In [None]:
# Creating a copy of the dataframe
df_lagged = df.copy()

# Creating lag features for time-series data
def create_lag_features(data, lag_steps=7*24): # lag_steps is the number of lag features before the current observation
    for i in range(1, lag_steps + 1):
        data[f'lag_{i}'] = data['dk1_spotpricedkk_kwh'].shift(i) # Shift 
    return data

# Applying lag feature creation to the dataset
lagged_data = create_lag_features(df_lagged, lag_steps=7*24) # Creating 5 lag features on dk1_spotpricedkk_kwh

In [None]:
lagged_data.tail()

### Rolling mean

In [None]:
df_rolling = df.copy()

# Creating rolling mean for time-series data

def create_rolling_mean(data, window_size=5): # window_size is the number of previous observations to consider // 5 hours

    data['rolling_mean'] = data['dk1_spotpricedkk_kwh'].rolling(window=window_size).mean()

    return data

# Applying rolling mean to the dataset

rolled_data = create_rolling_mean(df_rolling, window_size=5) # 5 hours

In [None]:
rolled_data.tail()

### Fourier Transformation

In [None]:
df_ft = df.copy()

# Applying Fourier transformation for capturing seasonality

from scipy.fft import fft

def apply_fourier_transform(data):

    values = data['dk1_spotpricedkk_kwh'].values

    fourier_transform = fft(values)

    data['fourier_transform'] = np.abs(fourier_transform)

    return data

# Applying Fourier transformation to the dataset

fourier_data = apply_fourier_transform(df_ft)

In [None]:
fourier_data

### <span style="color:#2656a3;"> ⛳️ Dataset with train and test splits</span>

Here we define our train and test splits for traning the model.

In [None]:
lagged_data = lagged_data.drop(['timestamp', 'datetime', 'date'], axis=1)
rolled_data = rolled_data.drop(['timestamp', 'datetime', 'date'], axis=1)
fourier_data = fourier_data.drop(['timestamp', 'datetime', 'date'], axis=1)


In [None]:
# lagged_data = lagged_data.dropna()
# rolled_data = rolled_data.dropna()
# fourier_data = fourier_data.dropna()

In [None]:
# Splitting time-series data into training and testing sets

train_size_lagged = int(len(lagged_data) * 0.8)

X_train_data_lag, X_test_data_lag = lagged_data[:train_size_lagged], lagged_data[train_size_lagged:]

train_size_rolled = int(len(rolled_data) * 0.8)

X_train_data_roll, X_test_data_roll = rolled_data[:train_size_rolled], rolled_data[train_size_rolled:]

train_size_fourier = int(len(fourier_data) * 0.8)

X_train_data_fourier, X_test_data_fourier = fourier_data[:train_size_fourier], fourier_data[train_size_fourier:]

In [None]:
y_train_data_lag = X_train_data_lag.pop('dk1_spotpricedkk_kwh')
y_test_data_lag = X_test_data_lag.pop('dk1_spotpricedkk_kwh')
y_train_data_roll = X_train_data_roll.pop('dk1_spotpricedkk_kwh')
y_test_data_roll = X_test_data_roll.pop('dk1_spotpricedkk_kwh')
y_train_data_fourier = X_train_data_fourier.pop('dk1_spotpricedkk_kwh')
y_test_data_fourier = X_test_data_fourier.pop('dk1_spotpricedkk_kwh')

## <span style="color:#2656a3;">🧬 Modeling</span>

For Modeling we initialize the `XGBoost Regressor`.

The XGBoost Regressor is a powerful and versatile algorithm known for its effectiveness in a wide range of regression tasks, including predictive modeling and time series forecasting. Specifically tailored for regression tasks, it aims to predict continuous numerical values. The algorithm constructs an ensemble of regression trees, optimizing them to minimize a specified loss function, commonly the mean squared error for regression tasks. Ultimately, the final prediction is derived by aggregating the predictions of individual trees.

In [None]:
# Training the XGBoost model

from xgboost import XGBRegressor

xgb_model_lag = XGBRegressor(objective='reg:squarederror')

xgb_model_lag.fit(X_train_data_lag, y_train_data_lag)

In [None]:
# Importing the model validation metric functions from the sklearn library
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
# Predict target values on the test set
y_pred_lag = xgb_model_lag.predict(X_test_data_lag)

# Calculate Mean Squared Error (MSE) using sklearn
mse = mean_squared_error(y_test_data_lag, y_pred_lag)
print("⛳️ MSE:", mse)

# Calculate R squared using sklearn
r2 = r2_score(y_test_data_lag, y_pred_lag)
print("⛳️ R^2:", r2)

# Calculate Mean Absolute Error (MAE) using sklearn
mae = mean_absolute_error(y_test_data_lag, y_pred_lag)
print("⛳️ MAE:", mae)

In [None]:
# Importing the matplotlib library for plotting the predictions against the expected values
import matplotlib.pyplot as plt

# Plot the predictions against the expected values
plt.title('Expected vs Predicted Electricity Prices for area DK1')

# Plot the predicted values
plt.bar(x=np.arange(len(y_pred_lag)), height=y_pred_lag, label='predicted', alpha=0.7)

# Plot the expected values
plt.bar(x=np.arange(len(y_pred_lag)), height=y_test_data_lag, label='actual', alpha=0.7)

# Add labels to the x-axis and y-axis
plt.xlabel('Time')
plt.ylabel('Price in DKK')

# Add a legend and display the plot
plt.legend()
plt.show() 

In [None]:
# Import the plot_importance function from XGBoost
from xgboost import plot_importance
# Plot feature importances using the plot_importance function from XGBoost
plot_importance(
    xgb_model_lag, 
    max_num_features=25,  # Display the top 25 most important features
)
plt.show()

In [None]:
# Training the XGBoost model

xgb_model_roll = XGBRegressor(objective='reg:squarederror')

xgb_model_roll.fit(X_train_data_roll, y_train_data_roll)

In [None]:
# Predict target values on the test set
y_pred_roll = xgb_model_roll.predict(X_test_data_roll)

# Calculate Mean Squared Error (MSE) using sklearn
mse = mean_squared_error(y_test_data_roll, y_pred_roll)
print("⛳️ MSE:", mse)

# Calculate R squared using sklearn
r2 = r2_score(y_test_data_roll, y_pred_roll)
print("⛳️ R^2:", r2)

# Calculate Mean Absolute Error (MAE) using sklearn
mae = mean_absolute_error(y_test_data_roll, y_pred_roll)
print("⛳️ MAE:", mae)

In [None]:
# Importing the matplotlib library for plotting the predictions against the expected values
import matplotlib.pyplot as plt

# Plot the predictions against the expected values
plt.title('Expected vs Predicted Electricity Prices for area DK1')

# Plot the predicted values
plt.bar(x=np.arange(len(y_pred_roll)), height=y_pred_roll, label='predicted', alpha=0.7)

# Plot the expected values
plt.bar(x=np.arange(len(y_pred_roll)), height=y_test_data_roll, label='actual', alpha=0.7)

# Add labels to the x-axis and y-axis
plt.xlabel('Time')
plt.ylabel('Price in DKK')

# Add a legend and display the plot
plt.legend()
plt.show() 

In [None]:
# Import the plot_importance function from XGBoost
from xgboost import plot_importance
# Plot feature importances using the plot_importance function from XGBoost
plot_importance(
    xgb_model_roll, 
    max_num_features=25,  # Display the top 25 most important features
)
plt.show()

In [None]:
# Training the XGBoost model

xgb_model_fourier = XGBRegressor(objective='reg:squarederror')

xgb_model_fourier.fit(X_train_data_fourier, y_train_data_fourier)

In [None]:
# Predict target values on the test set
y_pred_fourier = xgb_model_fourier.predict(X_test_data_fourier)

# Calculate Mean Squared Error (MSE) using sklearn
mse = mean_squared_error(y_test_data_fourier, y_pred_fourier)
print("⛳️ MSE:", mse)

# Calculate R squared using sklearn
r2 = r2_score(y_test_data_fourier, y_pred_fourier)
print("⛳️ R^2:", r2)

# Calculate Mean Absolute Error (MAE) using sklearn
mae = mean_absolute_error(y_test_data_fourier, y_pred_fourier)
print("⛳️ MAE:", mae)

In [None]:
# Importing the matplotlib library for plotting the predictions against the expected values
import matplotlib.pyplot as plt

# Plot the predictions against the expected values
plt.title('Expected vs Predicted Electricity Prices for area DK1')

# Plot the predicted values
plt.bar(x=np.arange(len(y_pred_fourier)), height=y_pred_fourier, label='predicted', alpha=0.7)

# Plot the expected values
plt.bar(x=np.arange(len(y_pred_fourier)), height=y_test_data_fourier, label='actual', alpha=0.7)

# Add labels to the x-axis and y-axis
plt.xlabel('Time')
plt.ylabel('Price in DKK')

# Add a legend and display the plot
plt.legend()
plt.show() 

In [None]:
# Import the plot_importance function from XGBoost
from xgboost import plot_importance
# Plot feature importances using the plot_importance function from XGBoost
plot_importance(
    xgb_model_fourier, 
    max_num_features=25,  # Display the top 25 most important features
)
plt.show()

## <span style='color:#ff5f27'> ⚖️ Model Validation

After fitting the XGBoost Regressor, we evaluate the performance using the following validation metrics.

**Mean Squared Error (MSE):**
- Measures the average squared difference between the actual and predicted values in a regression problem. 
- It squares the differences between predicted and actual values to penalize larger errors more heavily.
- Lower MSE values indicate better model performance.

**R-squared (R²):**
- Measures the proportion of the variance in the dependent variable (target) that is predictable from the independent variables (features) in a regression model.
- R-squared values range from 0 to 1, where 0 indicates that the model does not explain any variability in the target variable, and 1 indicates that the model explains all the variability.
- R-squared is a useful metric for assessing how well the regression model fits the observed data. However, it does not provide information about the goodness of fit on new, unseen data.

**Mean Absolute Error (MAE):**
- Measures the average absolute difference between the actual and predicted values.
- MAE is less sensitive to outliers compared to MSE because it does not square the errors.
- Like MSE and RMSE, lower MAE values indicate better model performance.

MSE focus on the magnitude of errors, while R-squared provides insight into the proportion of variance explained by the model. MAE provides a measure of average error without considering the direction of errors.

In this case, the `MSE` is 0.0546, which suggests that on average, the squared difference between the actual and predicted values is relatively low. An `R^2` value of 0.933 indicates that approximately 93.33% of the variance in the dependent variable is predictable from the feature variables in the model. This is a high value, suggesting that the model explains a significant portion of the variability in the data. A `MAE` of 0.1604 suggests that, on average, the model's predictions are off by approximately 0.1604 units from the actual values. Similar to MSE, a lower MAE indicates better accuracy of the model.

In summary, based on these metrics, the model seems to perform quite well. It has relatively low error (both in terms of MSE and MAE), and a high percentage of the variance in the dependent variable is explained by the feature variables, as indicated by the high R-squared value.

As shown in the above feature importance plot features like `temperature`, `day`, `hour` and `month` are most important for predicting the dependent variable. 

## <span style='color:#2656a3'>🗄 Model Registry</span>

The Model Registry in Hopsworks enable us to store the trained model. The model registry centralizes model management, enabling models to be securely accessed and governed. We can also save model metrics with the model, enabling the user to understand performance of the model on test (or unseen) data.

### <span style="color:#ff5f27;">⚙️ Model Schema</span>
A model schema defines the structure and format of the input and output data that a machine learning model expects and produces, respectively. It serves as a **blueprint** for understanding how to interact with the model in terms of input features and output predictions. In the context of the Hopsworks platform, a model schema is typically defined using the Schema class, which specifies the features expected in the input data and the target variable in the output data. This schema helps ensure consistency and compatibility between the model and the data it operates on.

## <span style="color:#2656a3;">⏭️ **Next:** Part 04: Batch Inference </span>

Next notebook we will use the registered model to make predictions based on the batch data.