# <span style="font-width:bold; font-size: 3rem; color:#2656a3;">**Data Engineering and Machine Learning Operations in Business** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

## 🗒️ This notebook is divided into the following sections:
1. Load the data and process features
2. Connect to the Hopsworks feature store
3. Create feature groups and upload them to the feature store

## <span style='color:#2656a3'> ⚙️ Import of libraries and packages

First, we'll install the Python packages required for this notebook. We'll use the --quiet command after specifying the names of the libraries to ensure a silent installation process. Then, we'll proceed to import all the necessary libraries.

In [1]:
# Install of the packages for hopsworks
# !pip install -U hopsworks --quiet

In [2]:
# First we have to go one back in out directory so we can find the folder with our functions
%cd ..

# Now you can import the functions from the features folder. 
from features import electricity_prices, weater_measures # own function# This is functions we have created 

# Go back into the notebooks folder
%cd notebooks

/Users/tobiasmjensen/Documents/aau_bds/m5_data-engineering-and-mlops/exam_assigment/MLOPs-Assignment-
/Users/tobiasmjensen/Documents/aau_bds/m5_data-engineering-and-mlops/exam_assigment/MLOPs-Assignment-/notebooks


In [3]:
# Importing of the packages for the needed libraries for the Jupyter notebook
import pandas as pd
import requests
from datetime import datetime, timedelta

# Ignore warnings
import warnings 
warnings.filterwarnings('ignore')

## <span style="color:#2656a3;"> 💽 Load the historical data

The data you will use comes from three different sources:

- Electricity prices in Denmark per day from [Energinet](https://www.energidataservice.dk).
- Different meteorological observations from [Open meteo](https://www.open-meteo.com).
- Danish Calendar with the type if the date is a national holiday or not. This files is made manually by the group and is located in the "*data*" folder inside this repository.

### <span style="color:#2656a3;">💸 Electricity prices per day from Energinet
This first dataset is Electricity prices per day from Energinet/Dataservice. Here we use 

In [4]:
# Fetching historical electricity prices data from '2022-01-01' to today. 
# But then historical is set to True, today is not included in the data as it is not historical data
electricity_df = electricity_prices.electricity_prices(
    historical=True, 
    area=["DK1"], 
    start='2022-01-01', 
    #end='2023-12-31'
)

In [5]:
# Display the first 5 rows of the dataframe
electricity_df.head(5)

Unnamed: 0,timestamp,time,date,dk1_spotpricedkk_kwh
0,1640995200000,2022-01-01 00:00:00,2022-01-01,0.3722
1,1640998800000,2022-01-01 01:00:00,2022-01-01,0.30735
2,1641002400000,2022-01-01 02:00:00,2022-01-01,0.32141
3,1641006000000,2022-01-01 03:00:00,2022-01-01,0.33806
4,1641009600000,2022-01-01 04:00:00,2022-01-01,0.28013


In [6]:
# Display the last 5 rows of the dataframe
electricity_df.tail(5)

Unnamed: 0,timestamp,time,date,dk1_spotpricedkk_kwh
20248,1713898800000,2024-04-23 19:00:00,2024-04-23,0.9178
20249,1713902400000,2024-04-23 20:00:00,2024-04-23,0.93317
20250,1713906000000,2024-04-23 21:00:00,2024-04-23,0.78417
20251,1713909600000,2024-04-23 22:00:00,2024-04-23,0.69188
20252,1713913200000,2024-04-23 23:00:00,2024-04-23,0.60846


In [7]:
# Showing the information for the electricity dataframe
electricity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20253 entries, 0 to 20252
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             20253 non-null  int64         
 1   time                  20253 non-null  datetime64[ns]
 2   date                  20253 non-null  object        
 3   dk1_spotpricedkk_kwh  20253 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 633.0+ KB


### <span style="color:#2656a3;">☀️💨 Forecast Renewable Energy next day from Energinet

In [8]:
# Fetching historical forecast of renewable energy data from '2022-01-01' to today
# But then historical is set to True, today is not included in the data as it is not historical data
forecast_renewable_energy_df = electricity_prices.forecast_renewable_energy(
    historical=True, 
    area = ["DK1"],
    start= '2022-01-01', 
    #end='2023-12-31'
)

In [9]:
# Display the first 5 rows of the dataframe
forecast_renewable_energy_df.head(5)

Unnamed: 0,timestamp,time,date,dk1_offshore_wind_forecastintraday_kwh,dk1_onshore_wind_forecastintraday_kwh,dk1_solar_forecastintraday_kwh
0,1641024000000,2022-01-01 08:00:00,2022-01-01,0.611708,0.236792,5e-05
1,1641027600000,2022-01-01 09:00:00,2022-01-01,0.459708,0.196667,0.004841
2,1641031200000,2022-01-01 10:00:00,2022-01-01,0.310375,0.1785,0.020353
3,1641034800000,2022-01-01 11:00:00,2022-01-01,0.32075,0.201125,0.035719
4,1641038400000,2022-01-01 12:00:00,2022-01-01,0.355667,0.277667,0.038027


In [10]:
# Display the last 5 rows of the dataframe
forecast_renewable_energy_df.tail(5)

Unnamed: 0,timestamp,time,date,dk1_offshore_wind_forecastintraday_kwh,dk1_onshore_wind_forecastintraday_kwh,dk1_solar_forecastintraday_kwh
14234,1713898800000,2024-04-23 19:00:00,2024-04-23,0.584125,0.351958,0.092361
14235,1713902400000,2024-04-23 20:00:00,2024-04-23,0.52175,0.307208,0.009653
14236,1713906000000,2024-04-23 21:00:00,2024-04-23,0.461208,0.331917,0.000294
14237,1713909600000,2024-04-23 22:00:00,2024-04-23,0.450083,0.376583,0.0
14238,1713913200000,2024-04-23 23:00:00,2024-04-23,0.463958,0.401042,0.0


In [11]:
# Showing the information for the forecast_renewable_energy dataframe
forecast_renewable_energy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14239 entries, 0 to 14238
Data columns (total 6 columns):
 #   Column                                  Non-Null Count  Dtype         
---  ------                                  --------------  -----         
 0   timestamp                               14239 non-null  int64         
 1   time                                    14239 non-null  datetime64[ns]
 2   date                                    14239 non-null  object        
 3   dk1_offshore_wind_forecastintraday_kwh  14223 non-null  float64       
 4   dk1_onshore_wind_forecastintraday_kwh   14223 non-null  float64       
 5   dk1_solar_forecastintraday_kwh          14223 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 667.6+ KB


### <span style="color:#2656a3;"> 🌤 Weather measurements from Open Meteo

Burde have enddate 2023-12-31. url = ("https://archive-api.open-meteo.com/v1/archive?latitude=57.048&longitude=9.9187&start_date=2022-01-01&end_date=2023-12-31&hourly=temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m")

#### <span style="color:#2656a3;"> 🕰️ Historical Weater Measures

In [12]:
# Fetching historical electricity prices data
historical_weather_df = weater_measures.historical_weater_measures(
    historical=True, 
    start = '2022-01-01', 
    #end = '2023-12-31'
)

In [13]:
# Display the first 5 rows of the dataframe
historical_weather_df.head(5)

Unnamed: 0,timestamp,date,time,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
0,1640995200000,2022-01-01,2022-01-01 00:00:00,6.7,100.0,0.0,0.0,0.0,3.0,100.0,16.2,36.0
1,1640998800000,2022-01-01,2022-01-01 01:00:00,6.6,100.0,0.0,0.0,0.0,3.0,100.0,16.2,30.2
2,1641002400000,2022-01-01,2022-01-01 02:00:00,6.7,99.0,0.0,0.0,0.0,3.0,100.0,15.5,30.6
3,1641006000000,2022-01-01,2022-01-01 03:00:00,6.7,100.0,0.0,0.0,0.0,3.0,100.0,12.7,28.8
4,1641009600000,2022-01-01,2022-01-01 04:00:00,6.7,99.0,0.0,0.0,0.0,3.0,100.0,10.6,23.8


In [14]:
# Display the last 5 rows of the dataframe
historical_weather_df.tail(5)

Unnamed: 0,timestamp,date,time,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
20227,1713812400000,2024-04-22,2024-04-22 19:00:00,2.6,69.0,0.0,0.0,0.0,0.0,0.0,9.4,17.6
20228,1713816000000,2024-04-22,2024-04-22 20:00:00,1.9,79.0,0.0,0.0,0.0,0.0,0.0,5.6,16.6
20229,1713819600000,2024-04-22,2024-04-22 21:00:00,0.7,84.0,0.0,0.0,0.0,0.0,0.0,6.5,9.4
20230,1713823200000,2024-04-22,2024-04-22 22:00:00,1.1,83.0,0.0,0.0,0.0,0.0,0.0,5.4,9.4
20231,1713826800000,2024-04-22,2024-04-22 23:00:00,1.9,83.0,0.0,0.0,0.0,0.0,7.0,7.7,12.2


In [15]:
# Showing the information for the weather dataframe
historical_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20232 entries, 0 to 20231
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             20232 non-null  int64         
 1   date                  20232 non-null  object        
 2   time                  20232 non-null  datetime64[ns]
 3   temperature_2m        20232 non-null  float64       
 4   relative_humidity_2m  20232 non-null  float64       
 5   precipitation         20232 non-null  float64       
 6   rain                  20232 non-null  float64       
 7   snowfall              20232 non-null  float64       
 8   weather_code          20232 non-null  float64       
 9   cloud_cover           20232 non-null  float64       
 10  wind_speed_10m        20232 non-null  float64       
 11  wind_gusts_10m        20232 non-null  float64       
dtypes: datetime64[ns](1), float64(9), int64(1), object(1)
memory usage: 2.0+ M

In [16]:
# deleting rows with missing values
# historical_weather_df.dropna()

#### <span style="color:#2656a3;"> 🌈 Weater Forecast

In [17]:
historical_weather_df.tail(5)

Unnamed: 0,timestamp,date,time,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
20227,1713812400000,2024-04-22,2024-04-22 19:00:00,2.6,69.0,0.0,0.0,0.0,0.0,0.0,9.4,17.6
20228,1713816000000,2024-04-22,2024-04-22 20:00:00,1.9,79.0,0.0,0.0,0.0,0.0,0.0,5.6,16.6
20229,1713819600000,2024-04-22,2024-04-22 21:00:00,0.7,84.0,0.0,0.0,0.0,0.0,0.0,6.5,9.4
20230,1713823200000,2024-04-22,2024-04-22 22:00:00,1.1,83.0,0.0,0.0,0.0,0.0,0.0,5.4,9.4
20231,1713826800000,2024-04-22,2024-04-22 23:00:00,1.9,83.0,0.0,0.0,0.0,0.0,7.0,7.7,12.2


In [18]:
# Fetching historical electricity prices data
weather_forecast_df = weater_measures.forecast_weater_measures(
    forecast_length=5
)

In [19]:
# Display the first 5 rows of the dataframe
weather_forecast_df.head(5)

Unnamed: 0,timestamp,date,time,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
0,1713916800000,2024-04-24,2024-04-24 00:00:00,5.4,94,0.0,0.0,0.0,3,100,12.6,22.3
1,1713920400000,2024-04-24,2024-04-24 01:00:00,5.2,90,0.0,0.0,0.0,3,100,15.8,28.1
2,1713924000000,2024-04-24,2024-04-24 02:00:00,4.7,88,0.0,0.0,0.0,3,99,20.2,35.6
3,1713927600000,2024-04-24,2024-04-24 03:00:00,4.6,90,0.1,0.1,0.0,51,100,23.0,39.6
4,1713931200000,2024-04-24,2024-04-24 04:00:00,4.7,91,0.3,0.3,0.0,51,88,21.2,41.0


In [20]:
# Display the last 5 rows of the dataframe
weather_forecast_df.tail(5)

Unnamed: 0,timestamp,date,time,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
115,1714330800000,2024-04-28,2024-04-28 19:00:00,11.6,86,0.2,0.0,0.0,1,82,15.3,36.4
116,1714334400000,2024-04-28,2024-04-28 20:00:00,11.0,90,0.2,0.0,0.0,1,63,12.7,29.5
117,1714338000000,2024-04-28,2024-04-28 21:00:00,10.6,93,0.2,0.0,0.0,1,45,10.7,22.3
118,1714341600000,2024-04-28,2024-04-28 22:00:00,10.3,95,0.0,0.0,0.0,2,52,10.0,21.6
119,1714345200000,2024-04-28,2024-04-28 23:00:00,10.0,96,0.0,0.0,0.0,2,59,10.4,20.9


In [21]:
# Showing the information for the weather dataframe
weather_forecast_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             120 non-null    int64         
 1   date                  120 non-null    object        
 2   time                  120 non-null    datetime64[ns]
 3   temperature_2m        120 non-null    float64       
 4   relative_humidity_2m  120 non-null    int64         
 5   precipitation         120 non-null    float64       
 6   rain                  120 non-null    float64       
 7   snowfall              120 non-null    float64       
 8   weather_code          120 non-null    int64         
 9   cloud_cover           120 non-null    int64         
 10  wind_speed_10m        120 non-null    float64       
 11  wind_gusts_10m        120 non-null    float64       
dtypes: datetime64[ns](1), float64(6), int64(4), object(1)
memory usage: 11.4+ KB


### <span style="color:#2656a3;"> 🗓️ Calendar of Danish workdays and holidays 

In [22]:
# Read csv file with calender
calender_df = pd.read_csv('https://raw.githubusercontent.com/Camillahannesbo/MLOPs-Assignment-/main/data/calendar_incl_holiday.csv', delimiter=';', usecols=['date', 'type'])
 
# Display the DataFrame
calender_df.head()

Unnamed: 0,date,type
0,01/01/2022,Not a Workday
1,02/01/2022,Not a Workday
2,03/01/2022,Workday
3,04/01/2022,Workday
4,05/01/2022,Workday


In [23]:
# Formatting the date column to 'YYYY-MM-DD' dateformat
calender_df["date"] = calender_df["date"].map(lambda x: datetime.strptime(x, '%d/%m/%Y').strftime("%Y-%m-%d"))

In [24]:
# Display the first 5 rows of the dataframe
calender_df.head(5)

Unnamed: 0,date,type
0,2022-01-01,Not a Workday
1,2022-01-02,Not a Workday
2,2022-01-03,Workday
3,2022-01-04,Workday
4,2022-01-05,Workday


In [25]:
# Display the last 5 rows of the dataframe
calender_df.tail(5)

Unnamed: 0,date,type
1091,2024-12-27,Workday
1092,2024-12-28,Not a Workday
1093,2024-12-29,Not a Workday
1094,2024-12-30,Workday
1095,2024-12-31,Workday


In [26]:
# Showing the information for the calender dataframe
calender_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    1096 non-null   object
 1   type    1096 non-null   object
dtypes: object(2)
memory usage: 17.3+ KB


## <span style="color:#2656a3;"> 📡 Connecting to Hopsworks Feature Store

First we will connect to Hopsworks Feature Store so we can access and create Feature Groups.
Feature groups can also be used to define a namespace for features. For instance, in a real-life setting you would likely want to experiment with different window lengths. In that case, you can create feature groups with identical schema for each window length. 

Before you can create a feature group you need to connect to our feature store.

In [27]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.







Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/554133
Connected. Call `.close()` to terminate connection gracefully.


### <span style="color:#2656a3;"> 🪄 Creating Feature Groups

When creating a feature group, you must name it and designate a primary key. Additionally, it's helpful to include a description of the feature group's contents and a version number; if not defined, it will default to `1`. 

We've configured `online_enabled` as `True` to enable the feature group to be read via the Online API for a Feature View.

In [28]:
# Creating the feature group for the weater data
weather_fg = fs.get_or_create_feature_group(
    name="weather_measurements",
    version=1,
    description="Weather measurements from Open Meteo API",
    primary_key=["date"],
    event_time="timestamp",
    online_enabled=True,
)

By now, you've only outlined metadata for the feature group. There's no data stored, nor is there a defined schema for it. To establish persistence for the feature group, you'll need to populate it with its associated data using the `insert` function

In [29]:
# Inserting the weather_df into the feature group named weather_fg
weather_fg.insert(historical_weather_df)

Uploading Dataframe: 100.00% |██████████| Rows 20232/20232 | Elapsed Time: 00:11 | Remaining Time: 00:00


Launching job: weather_measurements_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/554133/jobs/named/weather_measurements_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x16b182150>, None)

We make a descriptions for each feature we put into the feature group. In this way we are adding more information and documentation to the user

In [30]:
weather_feature_descriptions = [
    {"name": "timestamp", "description": "Timestamp for the event_time"},
    {"name": "date", "description": "Date of the weather measurement"},
    {"name": "time", "description": "Time of the weather measurement"},
    {"name": "temperature_2m", "description": "Temperature at 2m above ground"},
    {"name": "relative_humidity_2m", "description": "Relative humidity at 2m above ground"},
    {"name": "precipitation", "description": "Precipitation"},
    {"name": "rain", "description": "Rain"},
    {"name": "snowfall", "description": "Snowfall"},   
    {"name": "weather_code", "description": "Weather code"},   
    {"name": "cloud_cover", "description": "Cloud cover"},   
    {"name": "wind_speed_10m", "description": "Wind speed at 10m above ground"},   
    {"name": "wind_gusts_10m", "description": "Wind gusts at 10m above ground"},   
]

for desc in weather_feature_descriptions: 
    weather_fg.update_feature_description(desc["name"], desc["description"])

We replicate the process for both the `electricity_fg`, `forecast_renewable_energy_fg` and `danish_holidays_fg` by establishing feature groups and inserting the dataframes into their respective feature groups.

In [31]:
# Creating the feature group for the electricity prices
electricity_fg = fs.get_or_create_feature_group(
    name="electricity_prices",
    version=1,
    description="Electricity prices from Energidata API",
    primary_key=["date"],
    online_enabled=True,
    event_time="timestamp",
)

In [32]:
# Inserting the electricity_df into the feature group named electricity_fg
electricity_fg.insert(electricity_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/554133/fs/549956/fg/750929


Uploading Dataframe: 100.00% |██████████| Rows 20253/20253 | Elapsed Time: 00:09 | Remaining Time: 00:00


Launching job: electricity_prices_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/554133/jobs/named/electricity_prices_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x16b10ef10>, None)

In [33]:
electricity_feature_descriptions = [
    {"name": "timestamp", "description": "Timestamp for the event_time"},
    {"name": "date", "description": "Date of the electricity measurement"},
    {"name": "time", "description": "Time of the electricity measurement"},
    {"name": "dk1_spotpricedkk_kwh", "description": "Spot price in DKK per KWH"}, 
]

for desc in electricity_feature_descriptions: 
    electricity_fg.update_feature_description(desc["name"], desc["description"])

In [34]:
# Creating the feature group for the electricity prices
forecast_renewable_energy_fg = fs.get_or_create_feature_group(
    name="forecast_renewable_energy",
    version=1,
    description="Forecast on Renewable Energy on ForecastType from Energidata API",
    primary_key=["date"],
    online_enabled=True,
    event_time="timestamp",
)

In [35]:
# Inserting the electricity_df into the feature group named electricity_fg
forecast_renewable_energy_fg.insert(forecast_renewable_energy_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/554133/fs/549956/fg/748886


Uploading Dataframe: 100.00% |██████████| Rows 14239/14239 | Elapsed Time: 00:09 | Remaining Time: 00:00


Launching job: forecast_renewable_energy_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/554133/jobs/named/forecast_renewable_energy_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x169a23710>, None)

In [36]:
forecast_renewable_energy_feature_descriptions = [
    {"name": "timestamp", "description": "Timestamp for the event_time"},
    {"name": "date", "description": "Date"},
    {"name": "time", "description": "Time for the event_time"},
    {"name": "dk1_offshore_wind_forecastintraday_kwh", "description": "The forecast for the coming day at 6am Danish time zone"},
]

for desc in forecast_renewable_energy_feature_descriptions: 
    forecast_renewable_energy_fg.update_feature_description(desc["name"], desc["description"])

In [37]:
# Creating the feature group for the danish holidays
danish_holidays_fg = fs.get_or_create_feature_group(
    name="danish_holidays",
    version=1,
    description="Danish holidays calendar.",
    online_enabled=True,
    primary_key=["date"],
)

In [38]:
# Inserting the calender_df into the feature group named danish_holidays_fg
danish_holidays_fg.insert(calender_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/554133/fs/549956/fg/748887


Uploading Dataframe: 100.00% |██████████| Rows 1096/1096 | Elapsed Time: 00:07 | Remaining Time: 00:00


Launching job: danish_holidays_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/554133/jobs/named/danish_holidays_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x16b1dd210>, None)

In [39]:
danish_holidays_feature_descriptions = [
    {"name": "date", "description": "Date in the calendar"},
    {"name": "type", "description": "Holyday or not holyday"},
]

for desc in danish_holidays_feature_descriptions: 
    danish_holidays_fg.update_feature_description(desc["name"], desc["description"])

---
## <span style="color:#2656a3;">⏭️ **Next:** Part 02: Feature Pipeline </span>

In the next notebook, you will be generating new data for the Feature Groups.