# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 04: Batch Inference</span>

## 🗒️ This notebook is divided into the following sections:

1. Download model and batch inference data
2. Make predictions, generate PNG for forecast
3. Store predictions in a monitoring feature group adn generate PNG for hindcast

## <span style='color:#ff5f27'> 📝 Imports

In [1]:
import datetime
import pandas as pd
from xgboost import XGBRegressor
import hopsworks
from functions import util

In [2]:
# Getting the current date
today = datetime.date.today()
# start_day = today - datetime.timedelta(days = 100)
country="sweden"
city="stockholm"
street="stockholm-hornsgatan-108-gata"

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [3]:
project = hopsworks.login()
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/8321
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> ⚙️ Feature View Retrieval</span>


In [4]:
# Retrieve the 'air_quality_fv' feature view
feature_view = fs.get_feature_view(
    name='air_quality_fv',
    version=1,
)

## <span style="color:#ff5f27;">🪝 Download the model from Model Registry</span>

In [5]:
mr = project.get_model_registry()

retrieved_model = mr.get_model(
    name="air_quality_xgboost_model",
    version=1,
)

# Download the saved model artifacts to a local directory
saved_model_dir = retrieved_model.download()

Connected. Call `.close()` to terminate connection gracefully.
Downloading model artifact (1 dirs, 6 files)... DONE

In [6]:
# Loading the XGBoost regressor model and label encoder from the saved model directory
# retrieved_xgboost_model = joblib.load(saved_model_dir + "/xgboost_regressor.pkl")
retrieved_xgboost_model = XGBRegressor()

retrieved_xgboost_model.load_model(saved_model_dir + "/model.json")

# Displaying the retrieved XGBoost regressor model
retrieved_xgboost_model

## <span style="color:#ff5f27;">✨ Get Weather Forecast Features with Feature View   </span>



In [7]:
weather_fg = fs.get_feature_group(
    name='weather',
    version=1,
)

f = weather_fg.read()

Finished: Reading data from Hopsworks, using ArrowFlight (0.42s) 


In [8]:
air_quality_fg = fs.get_feature_group(
    name='air_quality',
    version=1,
)
a = air_quality_fg.read()
a = a.sort_values(by=['date'])
a

Finished: Reading data from Hopsworks, using ArrowFlight (0.44s) 


Unnamed: 0,date,pm25,country,city,street
1465,2017-10-04 00:00:00+00:00,13.0,sweden,stockholm,stockholm-hornsgatan-108-gata
1076,2017-10-05 00:00:00+00:00,9.0,sweden,stockholm,stockholm-hornsgatan-108-gata
1917,2017-10-06 00:00:00+00:00,8.0,sweden,stockholm,stockholm-hornsgatan-108-gata
1941,2017-10-07 00:00:00+00:00,13.0,sweden,stockholm,stockholm-hornsgatan-108-gata
1355,2017-10-08 00:00:00+00:00,8.0,sweden,stockholm,stockholm-hornsgatan-108-gata
...,...,...,...,...,...
1569,2024-02-29 00:00:00+00:00,38.0,sweden,stockholm,stockholm-hornsgatan-108-gata
1312,2024-03-01 00:00:00+00:00,46.0,sweden,stockholm,stockholm-hornsgatan-108-gata
2143,2024-03-02 00:00:00+00:00,59.0,sweden,stockholm,stockholm-hornsgatan-108-gata
2232,2024-03-03 00:00:00+00:00,48.0,sweden,stockholm,stockholm-hornsgatan-108-gata


In [9]:
batch_data = f[f['date'] >= str(today)]
batch_data = batch_data.sort_values(by=['date'])
batch_data

Unnamed: 0,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant,city
2352,2024-03-12 00:00:00+00:00,5.95,0.0,6.618519,112.38018,stockholm
2354,2024-03-13 00:00:00+00:00,4.4,0.0,10.799999,216.86998,stockholm
2357,2024-03-14 00:00:00+00:00,9.4,0.0,17.414474,209.744797,stockholm
2359,2024-03-15 00:00:00+00:00,8.65,0.5,14.825706,209.054504,stockholm
2350,2024-03-16 00:00:00+00:00,8.8,0.2,8.287822,145.619598,stockholm
2351,2024-03-17 00:00:00+00:00,-1.75,0.1,22.366402,3.691312,stockholm
2355,2024-03-18 00:00:00+00:00,-1.5,0.0,14.241629,16.144413,stockholm
2353,2024-03-19 00:00:00+00:00,1.8,0.0,4.32,360.0,stockholm
2356,2024-03-20 00:00:00+00:00,2.6,0.0,3.319036,77.471199,stockholm
2358,2024-03-21 00:00:00+00:00,3.1,0.0,15.175612,157.693741,stockholm


In [10]:
batch_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 2352 to 2358
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   date                         10 non-null     datetime64[us, UTC]
 1   temperature_2m_mean          10 non-null     float32            
 2   precipitation_sum            10 non-null     float32            
 3   wind_speed_10m_max           10 non-null     float32            
 4   wind_direction_10m_dominant  10 non-null     float32            
 5   city                         10 non-null     object             
dtypes: datetime64[us, UTC](1), float32(4), object(1)
memory usage: 400.0+ bytes


In [11]:
# batch_data = feature_view.get_batch_data(start_time=today, event_time=True, primary_key=True)
# pred_df = batch_data.drop(columns=['date'])
# print(feature_view.query.to_string())

### <span style="color:#ff5f27;">🤖 Making the predictions</span>

In [12]:
batch_data['predicted_pm25'] = retrieved_xgboost_model.predict(
    batch_data[['temperature_2m_mean', 'precipitation_sum', 'wind_speed_10m_max', 'wind_direction_10m_dominant']])
batch_data

Unnamed: 0,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant,city,predicted_pm25
2352,2024-03-12 00:00:00+00:00,5.95,0.0,6.618519,112.38018,stockholm,58.964191
2354,2024-03-13 00:00:00+00:00,4.4,0.0,10.799999,216.86998,stockholm,43.018803
2357,2024-03-14 00:00:00+00:00,9.4,0.0,17.414474,209.744797,stockholm,39.864429
2359,2024-03-15 00:00:00+00:00,8.65,0.5,14.825706,209.054504,stockholm,33.077267
2350,2024-03-16 00:00:00+00:00,8.8,0.2,8.287822,145.619598,stockholm,35.166565
2351,2024-03-17 00:00:00+00:00,-1.75,0.1,22.366402,3.691312,stockholm,16.268124
2355,2024-03-18 00:00:00+00:00,-1.5,0.0,14.241629,16.144413,stockholm,20.995886
2353,2024-03-19 00:00:00+00:00,1.8,0.0,4.32,360.0,stockholm,27.325489
2356,2024-03-20 00:00:00+00:00,2.6,0.0,3.319036,77.471199,stockholm,49.869041
2358,2024-03-21 00:00:00+00:00,3.1,0.0,15.175612,157.693741,stockholm,39.632515


In [13]:
batch_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 2352 to 2358
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   date                         10 non-null     datetime64[us, UTC]
 1   temperature_2m_mean          10 non-null     float32            
 2   precipitation_sum            10 non-null     float32            
 3   wind_speed_10m_max           10 non-null     float32            
 4   wind_direction_10m_dominant  10 non-null     float32            
 5   city                         10 non-null     object             
 6   predicted_pm25               10 non-null     float32            
dtypes: datetime64[us, UTC](1), float32(5), object(1)
memory usage: 440.0+ bytes


In [14]:
batch_data['street'] = street
batch_data['city'] = city
batch_data['country'] = country
# Fill in the number of days before the date on which you made the forecast (base_date)
batch_data['days_before_forecast_day'] = range(1, len(batch_data)+1)

In [15]:
batch_data[['date', 'predicted_pm25']]

Unnamed: 0,date,predicted_pm25
2352,2024-03-12 00:00:00+00:00,58.964191
2354,2024-03-13 00:00:00+00:00,43.018803
2357,2024-03-14 00:00:00+00:00,39.864429
2359,2024-03-15 00:00:00+00:00,33.077267
2350,2024-03-16 00:00:00+00:00,35.166565
2351,2024-03-17 00:00:00+00:00,16.268124
2355,2024-03-18 00:00:00+00:00,20.995886
2353,2024-03-19 00:00:00+00:00,27.325489
2356,2024-03-20 00:00:00+00:00,49.869041
2358,2024-03-21 00:00:00+00:00,39.632515


In [16]:
batch_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 2352 to 2358
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   date                         10 non-null     datetime64[us, UTC]
 1   temperature_2m_mean          10 non-null     float32            
 2   precipitation_sum            10 non-null     float32            
 3   wind_speed_10m_max           10 non-null     float32            
 4   wind_direction_10m_dominant  10 non-null     float32            
 5   city                         10 non-null     object             
 6   predicted_pm25               10 non-null     float32            
 7   street                       10 non-null     object             
 8   country                      10 non-null     object             
 9   days_before_forecast_day     10 non-null     int64              
dtypes: datetime64[us, UTC](1), float32(5), int64(1), obj

### Create Forecast Graph
Draw a graph of the predictions with dates as a PNG and save it to the github repo
Show it on github pages

In [None]:
file_path = "../../docs/air-quality/assets/img/pm25_forecast.png"
plt = util.plot_air_quality_forecast(batch_data, file_path)
plt.show()

In [18]:
# Get or create feature group
monitor_fg = fs.get_or_create_feature_group(
    name='aq_monitoring',
    description='Air Quality prediction monitoring',
    version=1,
    primary_key=['country','street','date', 'days_before_forecast_day'],
    event_time="date"
)

In [19]:
monitor_fg.insert(batch_data, wait=True)

Feature Group created successfully, explore it at 
https://snurran.hops.works/p/8321/fs/8269/fg/9309


Uploading Dataframe: 0.00% |          | Rows 0/10 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: aq_monitoring_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/8321/jobs/named/aq_monitoring_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7fc682586410>, None)

In [20]:
from hsfs.feature import Feature

# We will create a hindcast chart for  only the forecasts made 1 day beforehand
monitoring_df = monitor_fg.filter(Feature("days_before_forecast_day") == 1).read()
monitoring_df

Finished: Reading data from Hopsworks, using ArrowFlight (0.50s) 


Unnamed: 0,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant,city,predicted_pm25,street,country,days_before_forecast_day
0,2024-03-12 00:00:00+00:00,5.95,0.0,6.618519,112.38018,stockholm,58.964191,stockholm-hornsgatan-108-gata,sweden,1


In [21]:
air_quality_fg = fs.get_feature_group(
    name='air_quality',
    version=1,
)
air_quality_df = air_quality_fg.read()
air_quality_df

Finished: Reading data from Hopsworks, using ArrowFlight (0.43s) 


Unnamed: 0,date,pm25,country,city,street
0,2017-10-18 00:00:00+00:00,10.0,sweden,stockholm,stockholm-hornsgatan-108-gata
1,2020-06-17 00:00:00+00:00,30.0,sweden,stockholm,stockholm-hornsgatan-108-gata
2,2023-04-12 00:00:00+00:00,62.0,sweden,stockholm,stockholm-hornsgatan-108-gata
3,2020-03-22 00:00:00+00:00,16.0,sweden,stockholm,stockholm-hornsgatan-108-gata
4,2018-11-11 00:00:00+00:00,57.0,sweden,stockholm,stockholm-hornsgatan-108-gata
...,...,...,...,...,...
2265,2017-10-12 00:00:00+00:00,10.0,sweden,stockholm,stockholm-hornsgatan-108-gata
2266,2020-10-08 00:00:00+00:00,17.0,sweden,stockholm,stockholm-hornsgatan-108-gata
2267,2018-04-23 00:00:00+00:00,17.0,sweden,stockholm,stockholm-hornsgatan-108-gata
2268,2019-03-27 00:00:00+00:00,33.0,sweden,stockholm,stockholm-hornsgatan-108-gata


In [22]:
air_quality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2270 entries, 0 to 2269
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype              
---  ------   --------------  -----              
 0   date     2270 non-null   datetime64[us, UTC]
 1   pm25     2270 non-null   float32            
 2   country  2270 non-null   object             
 3   city     2270 non-null   object             
 4   street   2270 non-null   object             
dtypes: datetime64[us, UTC](1), float32(1), object(3)
memory usage: 79.9+ KB


In [23]:
outcome_df = air_quality_df[['date', 'pm25']]
preds_df =  monitoring_df[['date', 'predicted_pm25']]

hindcast_df = pd.merge(preds_df, outcome_df, on="date")
hindcast_df = hindcast_df.sort_values(by=['date'])
hindcast_df

Unnamed: 0,date,predicted_pm25,pm25
0,2024-03-12 00:00:00+00:00,58.964191,46.0


In [24]:
# import plotly.express as px

# fig = px.line(hindcast_df, x="date", y=['pm25', 'predicted_pm25'])
# filename = "../../docs/air-quality/assets/img/pm25_hindcast_1day.png"
# fig.write_image(filename)

In [None]:
file_path = "../../docs/air-quality/assets/img/pm25_hindcast_1day.png"
plt = util.plot_air_quality_forecast(hindcast_df, file_path, hindcast=True)
plt.show()

---