# Aurora Forecasting - Part 02: Daily Feature Pipeline

üóíÔ∏è This notebook is divided into the following sections:
Initialize Hopsworks connection.

Fetch the latest real-time Solar Wind data from NOAA.

Fetch the latest Cloud Cover forecast for Stockholm, Lule√•, and Kiruna.

Update the Feature Groups in the Hopsworks Feature Store.

# Imports and Login

In [1]:
import pandas as pd
import datetime
import hopsworks
from config import HopsworksSettings
import util
import warnings
warnings.filterwarnings("ignore")
import numpy

# Setup settings
settings = HopsworksSettings()

print(settings.HOPSWORKS_PROJECT)

# Login to Hopsworks
project = hopsworks.login(
    project=settings.HOPSWORKS_PROJECT,
    api_key_value=settings.HOPSWORKS_API_KEY.get_secret_value()
)
fs = project.get_feature_store()


HopsworksSettings initialized!
mac64
2026-01-11 21:04:12,779 INFO: Initializing external client
2026-01-11 21:04:12,780 INFO: Base URL: https://c.app.hopsworks.ai:443






2026-01-11 21:04:15,799 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1299605


# Step 1: Get Real-time Solar Wind Data

We use the NOAA SWPC API to get the most recent measurements from the DSCOVR/ACE satellites. These will serve as the features for our real-time inference.

In [2]:
print("Fetching real-time solar wind data from NOAA...")

# Uses the helper function from util.py to fetch and merge mag/plasma data
new_solar_df = util.get_noaa_realtime_hourly_data(
    settings.NOAA_MAG_URL,
    settings.NOAA_PLASMA_URL,
    settings.KP_INDEX_URL
)

# Format the time_tag for Hopsworks compatibility
#new_solar_df['time'] = new_solar_df['time'].dt.strftime('%Y-%m-%d %H:%M:%S')

# Drop unecessary columns if any (spoiler, there are)
new_solar_df.drop(columns=['bx_gsm', 'lon_gsm', 'lat_gsm', 'bt', 'temperature', 'a_running', 'station_count'], inplace=True, errors='ignore')

print(f"Successfully retrieved {len(new_solar_df)} new solar wind records.")
new_solar_df

Fetching real-time solar wind data from NOAA...
Raw Magnetometer data:
     bx_gsm  by_gsm  bz_gsm  lon_gsm  lat_gsm     bt             date_and_time
0     0.27    9.51   -7.31    88.34   -37.53  12.00 2026-01-10 20:00:00+00:00
1    -3.74    0.30   -7.49   175.35   -63.40   8.38 2026-01-10 21:00:00+00:00
2    -3.14   -3.51   -8.43   228.13   -60.79   9.65 2026-01-10 22:00:00+00:00
3     1.98   -0.41   -5.64   348.27   -70.27   5.99 2026-01-10 23:00:00+00:00
4    10.19   -9.02    5.37   318.50    21.52  14.63 2026-01-11 00:00:00+00:00
5     9.59  -12.38   -3.59   307.76   -12.91  16.07 2026-01-11 01:00:00+00:00
6     6.95  -12.53    0.47   298.99     1.87  14.34 2026-01-11 02:00:00+00:00
7     7.36   -8.28    5.37   311.63    25.86  12.31 2026-01-11 03:00:00+00:00
8     6.66   -9.77   -1.16   304.30    -5.60  11.88 2026-01-11 04:00:00+00:00
9     4.20  -11.25   -6.64   290.46   -28.95  13.72 2026-01-11 05:00:00+00:00
10    9.13   -7.06    5.18   322.29    24.15  12.65 2026-01-11 06:00:0

Unnamed: 0,by_gsm,bz_gsm,date_and_time,density,speed,kp_index
0,9.51,-7.31,2026-01-10 20:00:00+00:00,5.38,584.4,
1,0.3,-7.49,2026-01-10 21:00:00+00:00,5.04,581.6,5.67
2,-3.51,-8.43,2026-01-10 22:00:00+00:00,6.38,569.8,
3,-0.41,-5.64,2026-01-10 23:00:00+00:00,0.4,540.8,
4,-9.02,5.37,2026-01-11 00:00:00+00:00,0.7,510.4,4.67
5,-12.38,-3.59,2026-01-11 01:00:00+00:00,0.91,552.9,
6,-12.53,0.47,2026-01-11 02:00:00+00:00,0.54,506.0,
7,-8.28,5.37,2026-01-11 03:00:00+00:00,0.54,528.1,3.0
8,-9.77,-1.16,2026-01-11 04:00:00+00:00,0.88,513.5,
9,-11.25,-6.64,2026-01-11 05:00:00+00:00,1.1,542.8,


In [3]:
# Add dynamic pressure calculation
new_solar_df = util.calculate_dynamic_pressure(new_solar_df)

In [4]:
# Aggregate the new solar wind data into 3-hour intervals
new_solar_aggregated_df = util.aggregate_solar_wind_3h(new_solar_df)
new_solar_aggregated_df

Unnamed: 0,window_start,window_end,by_gsm_mean,by_gsm_min,by_gsm_max,by_gsm_std,bz_gsm_mean,bz_gsm_min,bz_gsm_max,bz_gsm_std,...,density_std,speed_mean,speed_min,speed_max,speed_std,dynamic_pressure_mean,dynamic_pressure_min,dynamic_pressure_max,dynamic_pressure_std,kp_index
0,2026-01-10 21:00:00+00:00,2026-01-11 00:00:00+00:00,-1.206667,-3.51,0.3,2.026088,-7.186667,-8.43,-5.64,1.419519,...,3.138089,564.066667,540.8,581.6,20.995555,1297739.0,116985.859375,2071408.0,1038860.0,5.67
1,2026-01-11 00:00:00+00:00,2026-01-11 03:00:00+00:00,-11.31,-12.53,-9.02,1.984616,0.75,-3.59,5.37,4.486558,...,0.185562,523.1,506.0,552.9,25.901158,199600.2,138259.4375,278185.6,71539.22,4.67
2,2026-01-11 03:00:00+00:00,2026-01-11 06:00:00+00:00,-9.766667,-11.25,-8.28,1.485003,-0.81,-6.64,5.37,6.012645,...,0.282135,528.133333,513.5,542.8,14.650028,235578.6,150600.390625,324095.0,86801.42,3.0
3,2026-01-11 06:00:00+00:00,2026-01-11 09:00:00+00:00,-8.303333,-8.95,-7.06,1.077048,-0.27,-4.49,5.18,4.950949,...,0.095394,525.9,521.6,529.4,3.96106,630573.2,601986.0,655818.6,27071.46,3.0
4,2026-01-11 09:00:00+00:00,2026-01-11 12:00:00+00:00,-8.726667,-9.23,-8.43,0.438216,-4.85,-5.12,-4.53,0.298161,...,1.278684,535.533333,530.2,542.8,6.518691,904743.4,586541.6875,1340575.0,390536.9,2.67
5,2026-01-11 12:00:00+00:00,2026-01-11 15:00:00+00:00,-5.983333,-7.14,-5.35,1.003212,-7.706667,-8.6,-6.38,1.171722,...,0.445084,520.633333,501.4,534.1,17.095711,815163.8,730272.8125,955984.5,122811.0,4.67
6,2026-01-11 15:00:00+00:00,2026-01-11 18:00:00+00:00,-4.673333,-5.32,-3.86,0.744133,-8.13,-9.62,-5.49,2.292619,...,4.006549,506.866667,489.0,528.2,19.828599,1112409.0,399332.0625,2441208.0,1151803.0,5.0


In [5]:
# Filter out rows with missing values and sort by date_and_time
new_solar_aggregated_df = new_solar_aggregated_df.dropna()
new_solar_aggregated_df = new_solar_aggregated_df.sort_values(["window_start"])
new_solar_aggregated_df = new_solar_aggregated_df.reset_index(drop=True)

new_solar_aggregated_df

Unnamed: 0,window_start,window_end,by_gsm_mean,by_gsm_min,by_gsm_max,by_gsm_std,bz_gsm_mean,bz_gsm_min,bz_gsm_max,bz_gsm_std,...,density_std,speed_mean,speed_min,speed_max,speed_std,dynamic_pressure_mean,dynamic_pressure_min,dynamic_pressure_max,dynamic_pressure_std,kp_index
0,2026-01-10 21:00:00+00:00,2026-01-11 00:00:00+00:00,-1.206667,-3.51,0.3,2.026088,-7.186667,-8.43,-5.64,1.419519,...,3.138089,564.066667,540.8,581.6,20.995555,1297739.0,116985.859375,2071408.0,1038860.0,5.67
1,2026-01-11 00:00:00+00:00,2026-01-11 03:00:00+00:00,-11.31,-12.53,-9.02,1.984616,0.75,-3.59,5.37,4.486558,...,0.185562,523.1,506.0,552.9,25.901158,199600.2,138259.4375,278185.6,71539.22,4.67
2,2026-01-11 03:00:00+00:00,2026-01-11 06:00:00+00:00,-9.766667,-11.25,-8.28,1.485003,-0.81,-6.64,5.37,6.012645,...,0.282135,528.133333,513.5,542.8,14.650028,235578.6,150600.390625,324095.0,86801.42,3.0
3,2026-01-11 06:00:00+00:00,2026-01-11 09:00:00+00:00,-8.303333,-8.95,-7.06,1.077048,-0.27,-4.49,5.18,4.950949,...,0.095394,525.9,521.6,529.4,3.96106,630573.2,601986.0,655818.6,27071.46,3.0
4,2026-01-11 09:00:00+00:00,2026-01-11 12:00:00+00:00,-8.726667,-9.23,-8.43,0.438216,-4.85,-5.12,-4.53,0.298161,...,1.278684,535.533333,530.2,542.8,6.518691,904743.4,586541.6875,1340575.0,390536.9,2.67
5,2026-01-11 12:00:00+00:00,2026-01-11 15:00:00+00:00,-5.983333,-7.14,-5.35,1.003212,-7.706667,-8.6,-6.38,1.171722,...,0.445084,520.633333,501.4,534.1,17.095711,815163.8,730272.8125,955984.5,122811.0,4.67
6,2026-01-11 15:00:00+00:00,2026-01-11 18:00:00+00:00,-4.673333,-5.32,-3.86,0.744133,-8.13,-9.62,-5.49,2.292619,...,4.006549,506.866667,489.0,528.2,19.828599,1112409.0,399332.0625,2441208.0,1151803.0,5.0


In [6]:
# Drop the column of the KP index, because it is not useful for the inference in the real time data
new_solar_df = new_solar_df.drop(columns=['kp_index'])
new_solar_df.dropna(inplace=True)
new_solar_df = new_solar_df.sort_values(["date_and_time"])
new_solar_df = new_solar_df.reset_index(drop=True)
new_solar_df

Unnamed: 0,by_gsm,bz_gsm,date_and_time,density,speed,dynamic_pressure
0,9.51,-7.31,2026-01-10 20:00:00+00:00,5.38,584.4,1837396.0
1,0.3,-7.49,2026-01-10 21:00:00+00:00,5.04,581.6,1704823.0
2,-3.51,-8.43,2026-01-10 22:00:00+00:00,6.38,569.8,2071408.0
3,-0.41,-5.64,2026-01-10 23:00:00+00:00,0.4,540.8,116985.9
4,-9.02,5.37,2026-01-11 00:00:00+00:00,0.7,510.4,182355.7
5,-12.38,-3.59,2026-01-11 01:00:00+00:00,0.91,552.9,278185.6
6,-12.53,0.47,2026-01-11 02:00:00+00:00,0.54,506.0,138259.4
7,-8.28,5.37,2026-01-11 03:00:00+00:00,0.54,528.1,150600.4
8,-9.77,-1.16,2026-01-11 04:00:00+00:00,0.88,513.5,232040.4
9,-11.25,-6.64,2026-01-11 05:00:00+00:00,1.1,542.8,324095.0


# Step 3: Insert into Feature Groups

Now we push the new observations into the Feature Store. Hopsworks will handle the deduplication based on the primary keys defined in the backfill notebook.

In [7]:
print("Before casting the aggregated data:\n", new_solar_aggregated_df)
# Clean and cast to correct types for Feature Store compatibility
# Convert numeric columns to float32 (Feature Store expects 'float' not 'double')
df = new_solar_aggregated_df.copy()

for col in df.columns:
    if col not in ["window_start", "window_end"]:
        df[col] = pd.to_numeric(df[col], errors='coerce').astype('float32')

new_solar_aggregated_df = df
# check data types of each column
print("After casting:\n", new_solar_aggregated_df.dtypes)
new_solar_aggregated_df

Before casting the aggregated data:
                window_start                window_end  by_gsm_mean  \
0 2026-01-10 21:00:00+00:00 2026-01-11 00:00:00+00:00    -1.206667   
1 2026-01-11 00:00:00+00:00 2026-01-11 03:00:00+00:00   -11.310000   
2 2026-01-11 03:00:00+00:00 2026-01-11 06:00:00+00:00    -9.766667   
3 2026-01-11 06:00:00+00:00 2026-01-11 09:00:00+00:00    -8.303333   
4 2026-01-11 09:00:00+00:00 2026-01-11 12:00:00+00:00    -8.726667   
5 2026-01-11 12:00:00+00:00 2026-01-11 15:00:00+00:00    -5.983333   
6 2026-01-11 15:00:00+00:00 2026-01-11 18:00:00+00:00    -4.673333   

   by_gsm_min  by_gsm_max  by_gsm_std  bz_gsm_mean  bz_gsm_min  bz_gsm_max  \
0       -3.51        0.30    2.026088    -7.186667       -8.43       -5.64   
1      -12.53       -9.02    1.984616     0.750000       -3.59        5.37   
2      -11.25       -8.28    1.485003    -0.810000       -6.64        5.37   
3       -8.95       -7.06    1.077048    -0.270000       -4.49        5.18   
4       -9.2

Unnamed: 0,window_start,window_end,by_gsm_mean,by_gsm_min,by_gsm_max,by_gsm_std,bz_gsm_mean,bz_gsm_min,bz_gsm_max,bz_gsm_std,...,density_std,speed_mean,speed_min,speed_max,speed_std,dynamic_pressure_mean,dynamic_pressure_min,dynamic_pressure_max,dynamic_pressure_std,kp_index
0,2026-01-10 21:00:00+00:00,2026-01-11 00:00:00+00:00,-1.206667,-3.51,0.3,2.026088,-7.186666,-8.43,-5.64,1.419519,...,3.138089,564.06665,540.799988,581.599976,20.995556,1297739.0,116985.859375,2071408.0,1038860.0,5.67
1,2026-01-11 00:00:00+00:00,2026-01-11 03:00:00+00:00,-11.31,-12.53,-9.02,1.984616,0.75,-3.59,5.37,4.486557,...,0.185562,523.099976,506.0,552.900024,25.901157,199600.2,138259.4375,278185.6,71539.22,4.67
2,2026-01-11 03:00:00+00:00,2026-01-11 06:00:00+00:00,-9.766666,-11.25,-8.28,1.485003,-0.81,-6.64,5.37,6.012645,...,0.282135,528.133362,513.5,542.799988,14.650028,235578.6,150600.390625,324095.0,86801.42,3.0
3,2026-01-11 06:00:00+00:00,2026-01-11 09:00:00+00:00,-8.303333,-8.95,-7.06,1.077048,-0.27,-4.49,5.18,4.950949,...,0.095394,525.900024,521.599976,529.400024,3.961061,630573.2,601986.0,655818.6,27071.46,3.0
4,2026-01-11 09:00:00+00:00,2026-01-11 12:00:00+00:00,-8.726666,-9.23,-8.43,0.438216,-4.85,-5.12,-4.53,0.298161,...,1.278684,535.533325,530.200012,542.799988,6.518691,904743.4,586541.6875,1340575.0,390536.9,2.67
5,2026-01-11 12:00:00+00:00,2026-01-11 15:00:00+00:00,-5.983333,-7.14,-5.35,1.003211,-7.706666,-8.6,-6.38,1.171722,...,0.445084,520.633362,501.399994,534.099976,17.095711,815163.8,730272.8125,955984.5,122811.0,4.67
6,2026-01-11 15:00:00+00:00,2026-01-11 18:00:00+00:00,-4.673333,-5.32,-3.86,0.744133,-8.13,-9.62,-5.49,2.292619,...,4.006549,506.866669,489.0,528.200012,19.828598,1112409.0,399332.0625,2441208.0,1151803.0,5.0


In [8]:
print("Before casting the real time data:\n", new_solar_df)
# Clean and cast to correct types for Feature Store compatibility
# Convert numeric columns to float32 (Feature Store expects 'float' not 'double')
df = new_solar_df.copy()

for col in df.columns:
    if col not in ["date_and_time"]:
        df[col] = pd.to_numeric(df[col], errors='coerce').astype('float32')

new_solar_df = df
# check data types of each column
print("After casting:\n", new_solar_df.dtypes)
new_solar_df

Before casting the real time data:
     by_gsm  bz_gsm             date_and_time  density  speed  dynamic_pressure
0     9.51   -7.31 2026-01-10 20:00:00+00:00     5.38  584.4      1.837396e+06
1     0.30   -7.49 2026-01-10 21:00:00+00:00     5.04  581.6      1.704823e+06
2    -3.51   -8.43 2026-01-10 22:00:00+00:00     6.38  569.8      2.071408e+06
3    -0.41   -5.64 2026-01-10 23:00:00+00:00     0.40  540.8      1.169859e+05
4    -9.02    5.37 2026-01-11 00:00:00+00:00     0.70  510.4      1.823557e+05
5   -12.38   -3.59 2026-01-11 01:00:00+00:00     0.91  552.9      2.781856e+05
6   -12.53    0.47 2026-01-11 02:00:00+00:00     0.54  506.0      1.382594e+05
7    -8.28    5.37 2026-01-11 03:00:00+00:00     0.54  528.1      1.506004e+05
8    -9.77   -1.16 2026-01-11 04:00:00+00:00     0.88  513.5      2.320404e+05
9   -11.25   -6.64 2026-01-11 05:00:00+00:00     1.10  542.8      3.240950e+05
10   -7.06    5.18 2026-01-11 06:00:00+00:00     2.17  526.7      6.019860e+05
11   -8.95   -1.

Unnamed: 0,by_gsm,bz_gsm,date_and_time,density,speed,dynamic_pressure
0,9.51,-7.31,2026-01-10 20:00:00+00:00,5.38,584.400024,1837396.0
1,0.3,-7.49,2026-01-10 21:00:00+00:00,5.04,581.599976,1704823.0
2,-3.51,-8.43,2026-01-10 22:00:00+00:00,6.38,569.799988,2071408.0
3,-0.41,-5.64,2026-01-10 23:00:00+00:00,0.4,540.799988,116985.9
4,-9.02,5.37,2026-01-11 00:00:00+00:00,0.7,510.399994,182355.7
5,-12.38,-3.59,2026-01-11 01:00:00+00:00,0.91,552.900024,278185.6
6,-12.53,0.47,2026-01-11 02:00:00+00:00,0.54,506.0,138259.4
7,-8.28,5.37,2026-01-11 03:00:00+00:00,0.54,528.099976,150600.4
8,-9.77,-1.16,2026-01-11 04:00:00+00:00,0.88,513.5,232040.4
9,-11.25,-6.64,2026-01-11 05:00:00+00:00,1.1,542.799988,324095.0


In [9]:
# Retrieve references to the Feature Groups
solar_wind_fg = fs.get_feature_group(name="solar_wind_fg", version=8)
solar_wind_aggregated_fg = fs.get_feature_group(name="solar_wind_aggregated_fg", version=3)

# Insert new data
# Note: For real-time pipelines, we often use online_enabled=True
# so the data is available for immediate inference.
solar_wind_fg.insert(new_solar_df)
solar_wind_aggregated_fg.insert(new_solar_aggregated_df)

print("Daily Feature Pipeline execution complete!")

Uploading Dataframe: 100.00% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Rows 25/25 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: solar_wind_fg_8_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1299605/jobs/named/solar_wind_fg_8_offline_fg_materialization/executions


Uploading Dataframe: 100.00% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Rows 7/7 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: solar_wind_aggregated_fg_3_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1299605/jobs/named/solar_wind_aggregated_fg_3_offline_fg_materialization/executions
Daily Feature Pipeline execution complete!
