# <span style="font-width:bold; font-size: 3rem; color:#2656a3;">**Msc. BDS - M7 Second Semester Project** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Feature Pipeline</span>

## <span style='color:#2656a3'> 🗒️ The notebook is divided into the following sections:
1. Parsing new data.
2. Inserting the new data into the Feature Store.

## <span style='color:#2656a3'> ⚙️ Import of libraries and packages

We start by accessing the folder we have created that holds the functions (incl. live API calls and data preprocessing) we need for electricity prices and weather measures. Then, we proceed to import some of the necessary libraries needed for this notebook and warnings to avoid unnecessary distractions and keep output clean.

In [1]:
# First we go one back in our directory to access the folder with our functions
%cd ..

# Now we import the functions from the features folder
# This is the functions we have created to generate features for electricity prices and weather measures
from features import electricity_prices, weather_measures

# We go back into the notebooks folder
%cd pipeline

/Users/camillahannesbo/Documents/AAU/Master - BDS/2. semester/2. semester project/bds_m7_second-semester-project
/Users/camillahannesbo/Documents/AAU/Master - BDS/2. semester/2. semester project/bds_m7_second-semester-project/pipeline


In [2]:
# Importing libraries for data handling
import pandas as pd
from datetime import datetime, timedelta

# Ignore warnings
import warnings 
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning)

## <span style='color:#2656a3'> 🪄 Parsing New Data
To fetch non-historical electricity prices we are setting `historical` to `False`. The same is done to fetch the moving average electricity prices.

In order to provide real time weather measures, a weather forecast measure for the next 5 days is being fetched.

There are of course no changes to the calendar data, and therefore no new data is retrieved from it.

### <span style="color:#2656a3;">💸 Electricity Prices per day from Energinet

In [3]:
# Fetching non-historical electricity prices for area DK1
electricity_price_df = electricity_prices.electricity_prices(
    historical=False,
    area=["DK1"]
)

In [4]:
# Display the electricity dataframe
electricity_price_df

Unnamed: 0,timestamp,datetime,date,hour,dk1_spotpricedkk_kwh
0,1716681600000,2024-05-26 00:00:00,2024-05-26,0,0.80109
1,1716685200000,2024-05-26 01:00:00,2024-05-26,1,0.73348
2,1716688800000,2024-05-26 02:00:00,2024-05-26,2,0.6773
3,1716692400000,2024-05-26 03:00:00,2024-05-26,3,0.6723
4,1716696000000,2024-05-26 04:00:00,2024-05-26,4,0.68021
5,1716699600000,2024-05-26 05:00:00,2024-05-26,5,0.66118
6,1716703200000,2024-05-26 06:00:00,2024-05-26,6,0.62298
7,1716706800000,2024-05-26 07:00:00,2024-05-26,7,0.53649
8,1716710400000,2024-05-26 08:00:00,2024-05-26,8,0.09887
9,1716714000000,2024-05-26 09:00:00,2024-05-26,9,0.00015


In [5]:
# Display the information of the electricity dataframe
electricity_price_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             24 non-null     int64         
 1   datetime              24 non-null     datetime64[ns]
 2   date                  24 non-null     datetime64[ns]
 3   hour                  24 non-null     int64         
 4   dk1_spotpricedkk_kwh  24 non-null     float64       
dtypes: datetime64[ns](2), float64(1), int64(2)
memory usage: 1.1 KB


### <span style="color:#2656a3;">🪟 Rolling Window for Electricity Prices

In [6]:
# Fetching non-historical electricity prices and rolling window for area DK1
electricity_price_window_df = electricity_prices.electricity_prices_window(
    historical=False,
    area=["DK1"],
    end=datetime.now().date() + timedelta(days=(7*1)) 
)

In [7]:
# Display the electricity_price_window dataframe
electricity_price_window_df

Unnamed: 0,timestamp,datetime,prev_1w_mean,prev_2w_mean,prev_4w_mean,prev_6w_mean,prev_8w_mean,prev_12w_mean
38568,1716681600000,2024-05-26 00:00:00,0.510743,0.351968,0.398314,0.451050,0.422507,0.433568
38569,1716685200000,2024-05-26 01:00:00,0.509817,0.351875,0.398395,0.451492,0.422451,0.433584
38570,1716688800000,2024-05-26 02:00:00,0.508897,0.351975,0.398549,0.451934,0.422395,0.433601
38571,1716692400000,2024-05-26 03:00:00,0.508046,0.352159,0.398793,0.452376,0.422348,0.433610
38572,1716696000000,2024-05-26 04:00:00,0.507220,0.352357,0.399083,0.452810,0.422327,0.433621
...,...,...,...,...,...,...,...,...
38755,1717354800000,2024-06-02 19:00:00,,0.538760,0.417957,0.446383,0.432043,0.430358
38756,1717358400000,2024-06-02 20:00:00,,0.536656,0.417094,0.446060,0.432109,0.430374
38757,1717362000000,2024-06-02 21:00:00,,0.534053,0.416295,0.445775,0.432172,0.430392
38758,1717365600000,2024-06-02 22:00:00,,0.532491,0.415630,0.445504,0.432251,0.430411


In [8]:
# Display the information of the electricity_price_window dataframe
electricity_price_window_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 192 entries, 38568 to 38759
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      192 non-null    int64         
 1   datetime       192 non-null    datetime64[ns]
 2   prev_1w_mean   167 non-null    float64       
 3   prev_2w_mean   192 non-null    float64       
 4   prev_4w_mean   192 non-null    float64       
 5   prev_6w_mean   192 non-null    float64       
 6   prev_8w_mean   192 non-null    float64       
 7   prev_12w_mean  192 non-null    float64       
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 13.5 KB


### <span style="color:#2656a3;"> 🌈 Forecast Weather Measures from Open Meteo

In [9]:
# Fetching weather forecast measures for the next 5 days
forecast_weather_df = weather_measures.forecast_weather_measures(
    forecast_length=5
)

In [10]:
# Display the weather forecast dataframe
forecast_weather_df

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m
0,1716681600000,2024-05-26 00:00:00,2024-05-26,0,17.5,88.0,0.0,0.0,0.0,2.0,56.0,11.9,21.2
1,1716685200000,2024-05-26 01:00:00,2024-05-26,1,16.6,92.0,0.0,0.0,0.0,0.0,16.0,11.2,21.2
2,1716688800000,2024-05-26 02:00:00,2024-05-26,2,16.0,94.0,0.0,0.0,0.0,1.0,25.0,10.4,19.4
3,1716692400000,2024-05-26 03:00:00,2024-05-26,3,15.5,94.0,0.0,0.0,0.0,3.0,100.0,9.0,17.3
4,1716696000000,2024-05-26 04:00:00,2024-05-26,4,15.8,94.0,0.0,0.0,0.0,3.0,100.0,11.5,20.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,1717095600000,2024-05-30 19:00:00,2024-05-30,19,15.7,92.0,0.4,0.0,0.0,2.0,86.0,8.0,24.5
116,1717099200000,2024-05-30 20:00:00,2024-05-30,20,15.4,92.0,0.4,0.0,0.0,2.0,72.0,9.5,24.5
117,1717102800000,2024-05-30 21:00:00,2024-05-30,21,15.1,93.0,0.4,0.0,0.0,2.0,58.0,11.2,24.8
118,1717106400000,2024-05-30 22:00:00,2024-05-30,22,14.8,94.0,0.0,0.0,0.0,2.0,57.0,11.2,23.0


In [11]:
# Display the information of the weather forecast dataframe
forecast_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   timestamp             120 non-null    int64         
 1   datetime              120 non-null    datetime64[ns]
 2   date                  120 non-null    datetime64[ns]
 3   hour                  120 non-null    int64         
 4   temperature_2m        120 non-null    float64       
 5   relative_humidity_2m  120 non-null    float64       
 6   precipitation         120 non-null    float64       
 7   rain                  120 non-null    float64       
 8   snowfall              120 non-null    float64       
 9   weather_code          120 non-null    float64       
 10  cloud_cover           120 non-null    float64       
 11  wind_speed_10m        120 non-null    float64       
 12  wind_gusts_10m        120 non-null    float64       
dtypes: datetime64[ns](2)

## <span style="color:#2656a3;"> 📡 Connecting to Hopsworks Feature Store

We connect to Hopsworks Feature Store so we can access the Feature Groups and upload the new data into the Feature Groups.

In [12]:
# Importing the hopsworks module for interacting with the Hopsworks platform
import hopsworks

# Logging into the Hopsworks project
project = hopsworks.login()

# Getting the feature store from the project
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/550040
Connected. Call `.close()` to terminate connection gracefully.


In [13]:
# Retrieve the feature groups
electricity_price_fg = fs.get_feature_group(
    name="electricity_spot_prices",
    version=1,
)

electricity_price_window_fg = fs.get_feature_group(
    name="electricity_spot_price_window",
    version=1,
)

weather_fg = fs.get_feature_group(
    name="weather_measurements",
    version=1,
)

### <span style="color:#2656a3;"> ⬆️ Uploading new data to the Feature Store
Here we upload the new data to the retrieved Feature groups by using the `insert` function.

In [14]:
# Inserting the electricity_price_df into the feature group named electricity_fg
electricity_price_fg.insert(electricity_price_df, 
                      write_options={"wait_for_job" : False})

Uploading Dataframe: 100.00% |██████████| Rows 24/24 | Elapsed Time: 00:06 | Remaining Time: 00:00


Launching job: electricity_spot_prices_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/550040/jobs/named/electricity_spot_prices_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x13ca8e410>, None)

In [15]:
# Inserting the electricity_price_window_df into the feature group named electricity_price_window_fg
electricity_price_window_fg.insert(electricity_price_window_df, 
                      write_options={"wait_for_job" : False})

Uploading Dataframe: 100.00% |██████████| Rows 192/192 | Elapsed Time: 00:06 | Remaining Time: 00:00


Launching job: electricity_spot_price_window_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/550040/jobs/named/electricity_spot_price_window_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x13cab6a50>, None)

In [16]:
# Inserting the weather_df into the feature group named weather_fg
weather_fg.insert(forecast_weather_df, 
                  write_options={"wait_for_job" : False})

Uploading Dataframe: 100.00% |██████████| Rows 120/120 | Elapsed Time: 00:06 | Remaining Time: 00:00


Launching job: weather_measurements_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/550040/jobs/named/weather_measurements_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x13cad76d0>, None)

---
## <span style="color:#2656a3;">⏭️ **Next:** Part 03: Training Pipeline </span>

Next we will create a feature view and training dataset. Further we will train a model and save it in model registry. There are two training pipelines; `3_1_training_pipeline_xgb`and `3_2_training_pipeline_lstm`.