# <span style="font-width:bold; font-size: 3rem; color:#2656a3;">**Data Engineering and Machine Learning Operations in Business** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 04: Batch Inference</span>

## <span style='color:#2656a3'> 🗒️ This notebook is divided into the following sections:

1. Load batch data.
2. Predict using model from Model Registry.

## <span style='color:#2656a3'> ⚙️ Import of libraries and packages

First, we'll install the Python packages required for this notebook. We'll use the --quiet command after specifying the names of the libraries to ensure a silent installation process. Then, we'll proceed to import all the necessary libraries.

In [1]:
# Importing the packages for the needed libraries for the Jupyter notebook
import joblib
import inspect 
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import os

#%config InlineBackend.figure_format='retina'
#%matplotlib inline

## <span style="color:#2656a3;"> 📡 Connecting to Hopsworks Feature Store

In [2]:
# Importing the hopsworks module
import hopsworks

# Logging in to the Hopsworks project
project = hopsworks.login()

# Getting the feature store from the project
fs = project.get_feature_store() 

  from .autonotebook import tqdm as notebook_tqdm


Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/550040
Connected. Call `.close()` to terminate connection gracefully.


### <span style='color:#2656a3'> ⚙️ Feature View Retrieval

In [3]:
# Retrieve the 'electricity_feature_view' feature view
feature_view = fs.get_feature_view(
    name='electricity_feature_view2',
    version=1,
)

### <span style='color:#2656a3'> 🗄 Model Registry

In [4]:
# Retrieve the model registry
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


## <span style='color:#2656a3'> 📮 Retrieving model from Model Registry

In [5]:
# Retrieving the model from the Model Registry
retrieved_model = mr.get_model(
    name="electricity_price_prediction_model", 
    version=1,
)

# Downloading the saved model to a local directory
saved_model_dir = retrieved_model.download()

# Loading the saved XGB model
retrieved_xgboost_model = joblib.load(saved_model_dir + "/dk_electricity_model.pkl")

Downloading model artifact (0 dirs, 3 files)... DONE

In [6]:
# Display the retrieved XGBoost regressor model
retrieved_xgboost_model

## <span style='color:#2656a3'> ✨ Load Batch Data

In [8]:
import datetime

# # Calculating the start date as 5 days ago from the current date
# start_date = datetime.datetime.now() - datetime.timedelta(days=5)

# # Converting the start date to a timestamp in milliseconds
# start_time = int(start_date.timestamp()) * 1000

# # Displaying the start date in timestamp format
# start_time

In [7]:
# # Initializing batch scoring
# feature_view.init_batch_scoring(1)

# # Retrieving batch data from the feature view starting from the specified start time
# batch_data = feature_view.get_batch_data(
#     start_time=start_time,
# )

# batch_data

In [9]:
# First we go one back in our directory to access the folder with our functions
%cd ..

# Now we import the functions from the features folder
# This is the functions we have created to generate features for electricity prices and weather measures
from features import electricity_prices, weather_measures, calendar 

# We go back into the notebooks folder
%cd notebooks

/Users/camillahannesbo/Documents/AAU/Master - BDS/2. semester/Data Engineering and Machine learning operations in Business/MLOPs-Assignment-
/Users/camillahannesbo/Documents/AAU/Master - BDS/2. semester/Data Engineering and Machine learning operations in Business/MLOPs-Assignment-/notebooks


In [10]:
# Fetching weather forecast measures for the next 5 days
weather_forecast_df = weather_measures.forecast_weather_measures(
    forecast_length=5
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [11]:
calendar_df = calendar.get_calendar()

In [12]:
# # Read csv file with calender
# calender_df = pd.read_csv('https://raw.githubusercontent.com/Camillahannesbo/MLOPs-Assignment-/main/data/calendar_incl_holiday.csv', delimiter=';', usecols=['date', 'type'])

# calender_df

In [13]:
# from datetime import datetime, timedelta

# # Formatting the date column to 'YYYY-MM-DD' dateformat
# calender_df["date"] = calender_df["date"].map(lambda x: datetime.strptime(x, '%d/%m/%Y').strftime("%Y-%m-%d"))

In [14]:
import numpy as np

In [17]:
# # Add features to the calender dataframe
# calender_df['date_'] = pd.to_datetime(calender_df['date'])
# calender_df['day'] = calender_df['date_'].dt.dayofweek
# calender_df['month'] = calender_df['date_'].dt.month
# calender_df['holiday'] = np.where(calender_df['type'] == 'Not a Workday', 1, 0)

# # Drop the columns 'type' and 'date_' to finalize the calender dataframe
# calender_df = calender_df.drop(['type','date_'], axis=1)

merged_df = pd.merge(weather_forecast_df, calendar_df, how='inner', left_on='date', right_on='date')

In [18]:
# Display the first 5 rows of the batch data
batch_data = merged_df

batch_data.tail()

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m,dayofweek,day,month,year,holiday
115,1715022000000,2024-05-06 19:00:00,2024-05-06,19,10.7,91.0,1.4,1.4,0.0,61.0,100.0,16.6,32.0,0,6,5,2024,0
116,1715025600000,2024-05-06 20:00:00,2024-05-06,20,10.1,90.0,1.4,1.4,0.0,61.0,100.0,19.5,37.1,0,6,5,2024,0
117,1715029200000,2024-05-06 21:00:00,2024-05-06,21,9.5,88.0,1.4,1.4,0.0,61.0,100.0,21.6,42.1,0,6,5,2024,0
118,1715032800000,2024-05-06 22:00:00,2024-05-06,22,9.3,86.0,0.6,0.6,0.0,3.0,100.0,22.0,41.0,0,6,5,2024,0
119,1715036400000,2024-05-06 23:00:00,2024-05-06,23,9.1,84.0,0.6,0.6,0.0,3.0,100.0,21.3,40.3,0,6,5,2024,0


### <span style="color:#ff5f27;">🤖 Making the predictions</span>

In [None]:
# from sklearn.preprocessing import LabelEncoder

# # Create a LabelEncoder object
# label_encoder = LabelEncoder()

# # Fit the encoder to the data in the 'city_name' column
# label_encoder.fit(batch_data[['type']])

# # Transform the 'city_name' column data using the fitted encoder
# encoded = label_encoder.transform(batch_data[['type']])

In [19]:
batch_data

Unnamed: 0,timestamp,datetime,date,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m,dayofweek,day,month,year,holiday
0,1714608000000,2024-05-02 00:00:00,2024-05-02,0,14.9,66.0,0.0,0.0,0.0,0.0,13.0,21.6,41.4,3,2,5,2024,0
1,1714611600000,2024-05-02 01:00:00,2024-05-02,1,14.2,71.0,0.0,0.0,0.0,0.0,4.0,20.5,37.1,3,2,5,2024,0
2,1714615200000,2024-05-02 02:00:00,2024-05-02,2,13.4,73.0,0.0,0.0,0.0,2.0,70.0,21.2,36.7,3,2,5,2024,0
3,1714618800000,2024-05-02 03:00:00,2024-05-02,3,13.2,72.0,0.1,0.1,0.0,51.0,51.0,22.3,39.2,3,2,5,2024,0
4,1714622400000,2024-05-02 04:00:00,2024-05-02,4,12.7,73.0,0.0,0.0,0.0,2.0,78.0,21.6,38.9,3,2,5,2024,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,1715022000000,2024-05-06 19:00:00,2024-05-06,19,10.7,91.0,1.4,1.4,0.0,61.0,100.0,16.6,32.0,0,6,5,2024,0
116,1715025600000,2024-05-06 20:00:00,2024-05-06,20,10.1,90.0,1.4,1.4,0.0,61.0,100.0,19.5,37.1,0,6,5,2024,0
117,1715029200000,2024-05-06 21:00:00,2024-05-06,21,9.5,88.0,1.4,1.4,0.0,61.0,100.0,21.6,42.1,0,6,5,2024,0
118,1715032800000,2024-05-06 22:00:00,2024-05-06,22,9.3,86.0,0.6,0.6,0.0,3.0,100.0,22.0,41.0,0,6,5,2024,0


In [25]:
# # Convert the output of the label encoding to a dense array and concatenate with the original data
# X_batch = pd.concat([batch_data, pd.DataFrame(encoded)], axis=1)

X_batch = batch_data

# Drop columns 'date', 'city_name', 'unix_time' from the DataFrame 'X'
X_batch = X_batch.drop(columns=['date', 'datetime', 'timestamp'])

# # Rename the newly added column with label-encoded city names to 'city_name_encoded'
# X_batch = X_batch.rename(columns={0: "type_encoded"})

# Displaying the first 5 rows of the modified DataFrame
X_batch.head()

Unnamed: 0,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m,dayofweek,day,month,year,holiday
0,0,14.9,66.0,0.0,0.0,0.0,0.0,13.0,21.6,41.4,3,2,5,2024,0
1,1,14.2,71.0,0.0,0.0,0.0,0.0,4.0,20.5,37.1,3,2,5,2024,0
2,2,13.4,73.0,0.0,0.0,0.0,2.0,70.0,21.2,36.7,3,2,5,2024,0
3,3,13.2,72.0,0.1,0.1,0.0,51.0,51.0,22.3,39.2,3,2,5,2024,0
4,4,12.7,73.0,0.0,0.0,0.0,2.0,78.0,21.6,38.9,3,2,5,2024,0


In [22]:
# # Extract the target variable 'dk1_spotpricedkk_kwh' from the batch data
# y_batch = X_batch.pop('dk1_spotpricedkk_kwh')

# # Displaying the first 5 rows of the modified DataFrame
# y_batch.head()

In [26]:
X_batch

Unnamed: 0,hour,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code,cloud_cover,wind_speed_10m,wind_gusts_10m,dayofweek,day,month,year,holiday
0,0,14.9,66.0,0.0,0.0,0.0,0.0,13.0,21.6,41.4,3,2,5,2024,0
1,1,14.2,71.0,0.0,0.0,0.0,0.0,4.0,20.5,37.1,3,2,5,2024,0
2,2,13.4,73.0,0.0,0.0,0.0,2.0,70.0,21.2,36.7,3,2,5,2024,0
3,3,13.2,72.0,0.1,0.1,0.0,51.0,51.0,22.3,39.2,3,2,5,2024,0
4,4,12.7,73.0,0.0,0.0,0.0,2.0,78.0,21.6,38.9,3,2,5,2024,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,19,10.7,91.0,1.4,1.4,0.0,61.0,100.0,16.6,32.0,0,6,5,2024,0
116,20,10.1,90.0,1.4,1.4,0.0,61.0,100.0,19.5,37.1,0,6,5,2024,0
117,21,9.5,88.0,1.4,1.4,0.0,61.0,100.0,21.6,42.1,0,6,5,2024,0
118,22,9.3,86.0,0.6,0.6,0.0,3.0,100.0,22.0,41.0,0,6,5,2024,0


In [27]:
# Make predictions on the batch data using the retrieved XGBoost regressor model
predictions = retrieved_xgboost_model.predict(X_batch)

# Display the first 5 predictions
predictions[:5]

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)


array([0.08131132, 0.13439555, 0.11292348, 0.13106975, 0.07179935],
      dtype=float32)

In [30]:
# label = batch_data["time"]
# y_pred = retrieved_xgboost_model.predict(X_batch)

# data = {
#     'prediction': [y_pred],
#     'time': [label],
# }

# monitor_df = pd.DataFrame(data)
# monitor_df

In [31]:
label = batch_data["datetime"]
y_pred = retrieved_xgboost_model.predict(X_batch)

data = {
    'prediction': y_pred,
    'time': label,
}

monitor_df = pd.DataFrame(data)
monitor_df

Unnamed: 0,prediction,time
0,0.081311,2024-05-02 00:00:00
1,0.134396,2024-05-02 01:00:00
2,0.112923,2024-05-02 02:00:00
3,0.131070,2024-05-02 03:00:00
4,0.071799,2024-05-02 04:00:00
...,...,...
115,0.571495,2024-05-06 19:00:00
116,0.556356,2024-05-06 20:00:00
117,0.413346,2024-05-06 21:00:00
118,0.409960,2024-05-06 22:00:00


---
## <span style="color:#ff5f27;">👾 Next is creating our Streamlit App?</span>