# <span style="font-width:bold; font-size: 3rem; color:#2656a3;">**Data Engineering and Machine Learning Operations in Business** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>

## 🗒️ This notebook is divided into the following sections:
1. Feature selection.
2. Feature transformations.
3. Training datasets creation.
4. Loading the training data.
5. Train the model.
6. Register model to Hopsworks model registry.

## <span style='color:#2656a3'> ⚙️ Import of libraries and packages

In [1]:
!pip install tensorflow --quiet

In [2]:
# Importing the packages for the needed libraries for the Jupyter notebook
import inspect 
import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#2656a3;"> 📡 Connecting to Hopsworks Feature Store

In [3]:
# Importing the hopsworks module
import hopsworks

# Logging in to the Hopsworks project
project = hopsworks.login()

# Getting the feature store from the project
fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/556180
Connected. Call `.close()` to terminate connection gracefully.


In [4]:
# Retrieve the feature groups
electricity_fg = fs.get_feature_group(
    name='electricity_prices',
    version=1,
)

weather_fg = fs.get_feature_group(
    name='weather_measurements',
    version=1,
)

danish_holidays_fg = fs.get_feature_group(
    name='danish_holidayss',
    version=1,
)
forecast_renewable_energy_fg = fs.get_feature_group(
    name='forecast_renewable_energy',
    version=1
)

## <span style="color:#2656a3;"> 🖍 Feature View Creation and Retrieving </span>

We first select the features that we want to include for model training.

Since we specified `primary_key`as `date` and `event_time` as `timestamp` in part 01 we can now join them together for the `electricity_fg`, `weather_fg` and `forecast_renewable_energy_fg`.

hmmm skal 'time' egentlig være 'date'???

In [5]:
# Select features for training data
selected_features = electricity_fg.select_all()\
    .join(weather_fg.select_except(["timestamp", "time"]))\
    .join(forecast_renewable_energy_fg.select_except(["timestamp", "time"]))\
    .join(danish_holidays_fg.select_all())

In [6]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

### <span style="color:#2656a3;"> 🤖 Transformation Functions</span>

We preprocess our data using *min-max scaling* on the numerical features and *label encoding* on the one categorical feature we have.
To achieve this, we create a mapping between our features and transformation functions. This ensures that transformation functions like min-max scaling are applied exclusively on the training data, preventing any data leakage into the validation or test sets.

To achieve this, we create a mapping between our features and transformation functions - ved ikke om man kan sige det her?

In [7]:
# Defining transformation functions for feature scaling and encoding
transformation_functions = {
        "dk1_spotpricedkk_kwh": fs.get_transformation_function(name="min_max_scaler"), 
        "dk1_offshore_wind_forecastintraday_kwh": fs.get_transformation_function(name="min_max_scaler"), 
        "dk1_onshore_wind_forecastintraday_kwh": fs.get_transformation_function(name="min_max_scaler"), 
        "dk1_solar_forecastintraday_kwh": fs.get_transformation_function(name="min_max_scaler"), 
        "temperature_2m": fs.get_transformation_function(name="min_max_scaler"), 
        "relative_humidity_2m": fs.get_transformation_function(name="min_max_scaler"), 
        "precipitation": fs.get_transformation_function(name="min_max_scaler"), 
        "rain": fs.get_transformation_function(name="min_max_scaler"), 
        "snowfall": fs.get_transformation_function(name="min_max_scaler"), 
        "weather_code": fs.get_transformation_function(name="min_max_scaler"), 
        "cloud_cover": fs.get_transformation_function(name="min_max_scaler"), 
        "wind_speed_10m": fs.get_transformation_function(name="min_max_scaler"),
        "wind_gusts_10m": fs.get_transformation_function(name="min_max_scaler"),
        "type": fs.get_transformation_function(name="label_encoder"),
    }

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

ved ikke om den her omformulering af botten går an?

`Feature Views` serve as an intermediary between **Feature Groups** and the **Training Dataset**. By combining various **Feature Groups**, we can construct **Feature Views**, which retain metadata about our data. Utilizing **Feature Views**, we can subsequently generate a **Training Dataset**.

Feature Views facilitate the definition of schema through queries with filters, identification of the model's target feature or label, and application of additional transformation functions.

To create a Feature View, we employ the `FeatureStore.get_or_create_feature_view()` method, where we specify the following parameters:

- `name`: The name of the feature group.

- `version`: The version of the feature group.

- `labels`: Our target variable.

- `transformation_functions`: Functions to transform our features.

- `query`: A query object containing the relevant data.

In [8]:
# Getting or creating a feature view named 'electricity_feature_view'
version = 1 # Defining the version for the feature view
feature_view = fs.get_or_create_feature_view(
    name='electricity_feature_view',
    version=version,
    labels=[], # Labels will be defined manually later for our 'y'
    transformation_functions=transformation_functions,
    query=selected_features,
)

## <span style="color:#2656a3;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_training_dataset()` method.

**From feature view APIs you can also create training datasts based on even time filters specifing `start_time` and `end_time`** 

### <span style="color:#2656a3;"> ⛳️ Dataset with train, test and validation splits</span>

In [9]:
# Splitting the feature view data into train, validation, and test sets
# We didn't specify 'labels' in feature view creation, it will therefore return 'None' for Y
X_train, X_val, X_test, _, _, _ = feature_view.train_validation_test_split(
    train_start="2022-01-01",
    train_end="2023-06-30",
    validation_start="2023-07-01",
    validation_end="2023-09-30",
    test_start="2023-10-01",
    test_end="2023-12-31",
    description='Electricity price prediction dataset',
)

Finished: Reading data from Hopsworks, using ArrowFlight (199.29s) 




In [10]:
# Sorting the training, validation, and test datasets based on the 'timestamp' column
X_train.sort_values(["timestamp"], inplace=True)
X_val.sort_values(["timestamp"], inplace=True)
X_test.sort_values(["timestamp"], inplace=True)

In [11]:
# Extracting the target variable 'dk1_spotpricedkk_kwh' and defineing 'y_train', 'y_val' and 'y_test' 
y_train = X_train[["dk1_spotpricedkk_kwh"]]
y_val = X_val[["dk1_spotpricedkk_kwh"]]
y_test = X_test[["dk1_spotpricedkk_kwh"]]

In [None]:
# # Dropping the 'date', 'time' and 'timestamp' columns from the training, validation, and test datasets
# X_train.drop(["date", "time", "timestamp"], axis=1, inplace=True)
# X_val.drop(["date", "time", "timestamp"], axis=1, inplace=True)
# X_test.drop(["date", "time", "timestamp"], axis=1, inplace=True)

In [None]:
# # Dropping the 'dare', 'time' and 'timestamp' and dependent variable (y) columns from the training, validation, and test datasets
# X_train.drop(["dk1_spotpricedkk_kwh"], axis=1, inplace=True)
# X_val.drop(["dk1_spotpricedkk_kwh"], axis=1, inplace=True)
# X_test.drop(["dk1_spotpricedkk_kwh"], axis=1, inplace=True)

In [12]:
# Displaying the first 5 rows of the train dataset (X_train)
X_train.head()

Unnamed: 0,timestamp,time,date,dk1_spotpricedkk_kwh,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m,dk1_offshore_wind_forecastintraday_kwh,dk1_onshore_wind_forecastintraday_kwh,dk1_solar_forecastintraday_kwh,type
5905751,1640995200000,2022-01-01 00:00:00+00:00,2022-01-01,0.179988,0.435268,0.986667,0.011364,0.011364,0.0,0.68,1.0,0.315152,0.272633,0.945277,0.481878,0.0,1
19398,1640995200000,2022-01-01 00:00:00+00:00,2022-01-01,0.179988,0.435268,0.986667,0.011364,0.011364,0.0,0.68,1.0,0.315152,0.272633,0.934795,0.446702,8e-06,1
5919627,1640995200000,2022-01-01 00:00:00+00:00,2022-01-01,0.179988,0.417411,0.933333,0.0,0.0,0.0,0.04,1.0,0.082828,0.074922,0.773045,0.264375,1.8e-05,1
4719247,1640995200000,2022-01-01 00:00:00+00:00,2022-01-01,0.179988,0.426339,0.933333,0.0,0.0,0.0,0.04,1.0,0.19596,0.187305,0.913059,0.358547,1.2e-05,1
4743896,1640995200000,2022-01-01 00:00:00+00:00,2022-01-01,0.179988,0.417411,0.933333,0.0,0.0,0.0,0.04,1.0,0.082828,0.074922,0.493641,0.133456,0.005406,1


In [14]:
df = X_train[["date", "dk1_spotpricedkk_kwh"]]

Unnamed: 0,date,dk1_spotpricedkk_kwh
5905751,2022-01-01,0.179988
19398,2022-01-01,0.179988
5919627,2022-01-01,0.179988
4719247,2022-01-01,0.179988
4743896,2022-01-01,0.179988


In [25]:
df.columns = ["ds", "y"]
df.head()

Unnamed: 0,ds,y
5905751,2022-01-01,0.179988
19398,2022-01-01,0.179988
5919627,2022-01-01,0.179988
4719247,2022-01-01,0.179988
4743896,2022-01-01,0.179988


## <span style="color:#2656a3;">🗃 Window timeseries dataset </span>

## <span style="color:#2656a3;">🧬 Modeling Testing</span>

In [22]:
from prophet import Prophet

In [26]:
m = Prophet(interval_width=0.95, daily_seasonality=True)
model = m.fit(df)

14:24:30 - cmdstanpy - INFO - Chain [1] start processing


In [None]:
future = m.make_future_dataframe(periods=100,freq='D')
forecast = m.predict(future)
forecast.head()

In [None]:
plot1 = m.plot(forecast)


## <span style="color:#2656a3;">🧬 Modeling</span>

In [None]:
# import pandas as pd
# import numpy as np
# import xgboost as xgb
# from sklearn.metrics import mean_squared_error
# import os

In [None]:
# # Initialize the XGBoost regressor
# model = xgb.XGBRegressor()
# model_val = xgb.XGBRegressor()

In [None]:
# # Train the model on the training data
# model.fit(X_train, y_train)

In [None]:
# # Make predictions on the validation set
# y_test_pred = model.predict(X_test)

In [None]:
# # Calculate RMSE on the validation set
# mse = mean_squared_error(y_test, y_test_pred, squared=False)
# print(f"Mean Squared Error (MSE): {mse}")

## <span style='color:#2656a3'>🗄 Model Registry</span>

In [None]:
# Exporting the trained model to a directory
model_dir = "electricity_price_model"
print('Exporting trained model to: {}'.format(model_dir))

# Saving the model using TensorFlow's saved_model.save function
tf.saved_model.save(model, model_dir)

In [None]:
# Retrieving the Model Registry
mr = project.get_model_registry()

# Extracting loss value from the training history
metrics = {'loss': history_dict['val_loss'][0]} 

# Creating a TensorFlow model in the Model Registry
tf_model = mr.tensorflow.create_model(
    name="DK_electricity_price_prediction_model",
    metrics=metrics,
    description="Hourly electricity price prediction model.",
    input_example=n_step_window.example[0].numpy(),
)

# Saving the model to the specified directory
tf_model.save(model_dir)

---

## <span style="color:#2656a3;">⏭️ **Next:** Part 04: Batch Inference </span>

In the next notebook you will use your registered model to predict batch data.