# Aurora Forecasting - Part 01: Feature Backfill

üóíÔ∏è This notebook is divided into the following sections:
Initialize Hopsworks connection.

Fetch historical Solar Wind & Kp index data using spacepy (OMNI dataset).

Fetch historical Cloud Cover for Stockholm, Lule√•, and Kiruna using Open-Meteo.

Create and Insert data into Feature Groups in the Hopsworks Feature Store.

# Imports

In [9]:
!pip install spacepy cdasws


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
import pandas as pd
import datetime
import hopsworks
from config import HopsworksSettings
import util
import spacepy.omni as omni
import spacepy.time as spt
import spacepy.toolbox as tb
from cdasws import CdasWs
from cdasws.timeinterval import TimeInterval

# Setup settings
settings = HopsworksSettings()

HopsworksSettings initialized!


# Step 1: Historical Solar Wind & Kp Data

We use the OMNI dataset via spacepy to get high-resolution historical satellite data. This includes the magnetic field components and the proton parameters needed to predict the Kp index.

https://helio.data.nasa.gov/dataset/OMNI_PT1H 

In [11]:
cdas = CdasWs()

# Get the OMNI dataset
#dataset = 'OMNI_HRO_1MIN'  # 1 minute resolution
dataset = 'OMNI2_H0_MRG1HR'  # 1 hour resolution
var_names = cdas.get_variable_names(dataset)

print('Variable names:', var_names)

# Time interval
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=2190)  # 6 years

# cdas.get_data expects a cdasws TimeInterval (or ISO start/end strings).
# Build a TimeInterval from the start and end datetimes to avoid "invalid time0 type".
time_interval = TimeInterval(start_date.strftime('%Y-%m-%dT%H:%M:%S'),
                             end_date.strftime('%Y-%m-%dT%H:%M:%S'))

# Fetch data
status, data = cdas.get_data(dataset, var_names, time_interval)

# Keep this (suggested in the official site)
if 'spacepy' in str(type(data)):
    #  see https://spacepy.github.io/datamodel.html
    print(var_names[0], '=', data[var_names[0]])
    print(data[var_names[0]].attrs)
    for var in var_names:
        print(var, '=', data[var])
        print(data[var].attrs)
        print("\n")
else:
    #  see https://github.com/MAVENSDC/cdflib
    print(var_names[0], '=', data.data_vars[var_names[0]].values)
    print(data.data_vars[var_names[0]].attrs)


Variable names: ['Rot1800', 'IMF1800', 'PLS1800', 'IMF_PTS1800', 'PLS_PTS1800', 'ABS_B1800', 'F1800', 'THETA_AV1800', 'PHI_AV1800', 'BX_GSE1800', 'BY_GSE1800', 'BZ_GSE1800', 'BY_GSM1800', 'BZ_GSM1800', 'SIGMA-ABS_B1800', 'SIGMA-B1800', 'SIGMA-Bx1800', 'SIGMA-By1800', 'SIGMA-Bz1800', 'T1800', 'N1800', 'V1800', 'PHI-V1800', 'THETA-V1800', 'Ratio1800', 'Pressure1800', 'SIGMA-T1800', 'SIGMA-N1800', 'SIGMA-V1800', 'SIGMA-PHI-V1800', 'SIGMA-THETA-V1800', 'SIGMA-ratio1800', 'E1800', 'Beta1800', 'Mach_num1800', 'Mgs_mach_num1800', 'PR-FLX_11800', 'PR-FLX_21800', 'PR-FLX_41800', 'PR-FLX_101800', 'PR-FLX_301800', 'PR-FLX_601800', 'MFLX1800', 'R1800', 'F10_INDEX1800', 'KP1800', 'DST1800', 'AE1800', 'AP_INDEX1800', 'AL_INDEX1800', 'AU_INDEX1800', 'PC_N_INDEX1800', 'Solar_Lyman_alpha1800', 'Proton_QI1800']
Rot1800 = [2543 2543 2543 ... 9999 9999 9999]
{'FIELDNAM': 'Bartels Rotation Number', 'VALIDMIN': 1700, 'VALIDMAX': 9998, 'SCALEMIN': 1700, 'SCALEMAX': 9998, 'FORMAT': 'I4', 'FILLVAL': 9999, 'VAR

In [12]:
# Extract needed information into a pd frame
solar_wind_df = pd.DataFrame({
    'date_and_time': data['Epoch'],
    'by_gsm': data['BY_GSM1800'],
    'bz_gsm': data['BZ_GSM1800'],
    'density': data['N1800'],
    'speed': data['V1800'],
    'kp_index': data['KP1800']
})

print(f"Extracted {len(solar_wind_df)} records")
solar_wind_df.tail(200)

Extracted 52296 records


Unnamed: 0,date_and_time,by_gsm,bz_gsm,density,speed,kp_index
52096,2025-12-18 05:00:00,999.900024,999.900024,2.200000,689.0,27
52097,2025-12-18 06:00:00,999.900024,999.900024,2.200000,688.0,37
52098,2025-12-18 07:00:00,999.900024,999.900024,2.400000,676.0,37
52099,2025-12-18 08:00:00,999.900024,999.900024,2.300000,651.0,37
52100,2025-12-18 09:00:00,999.900024,999.900024,2.300000,638.0,33
...,...,...,...,...,...,...
52291,2025-12-26 08:00:00,999.900024,999.900024,999.900024,9999.0,99
52292,2025-12-26 09:00:00,999.900024,999.900024,999.900024,9999.0,99
52293,2025-12-26 10:00:00,999.900024,999.900024,999.900024,9999.0,99
52294,2025-12-26 11:00:00,999.900024,999.900024,999.900024,9999.0,99


In [13]:
# Data Cleaning: OMNI uses 99.9 or 999.9 as fill values for missing data
solar_wind_df = solar_wind_df[(solar_wind_df['kp_index'] < 99) & 
                        (solar_wind_df['by_gsm'] < 999.9) &
                        (solar_wind_df['bz_gsm'] < 999.9) &
                        (solar_wind_df['density'] < 999.9) &
                        (solar_wind_df['speed'] < 9999.9)
                        ]
solar_wind_df = solar_wind_df.dropna()

# Divide by 10 the kp index to get the real value
solar_wind_df['kp_index'] = solar_wind_df['kp_index'] / 10.0
solar_wind_df['kp_index'] = solar_wind_df['kp_index'].astype('float32')

# Transform data column in string for Hopswork feature perch√® se no hopsworks rompe i coglioni che non supporta timestamp come primary key
#solar_wind_df['time'] = solar_wind_df['time'].dt.strftime('%Y-%m-%d %H:%M:%S')

print(solar_wind_df.dtypes)


date_and_time    datetime64[ns]
by_gsm                  float32
bz_gsm                  float32
density                 float32
speed                   float32
kp_index                float32
dtype: object


In [14]:
# Sort based on the date and time to be sure and reset the index
solar_wind_df = solar_wind_df.sort_values("date_and_time")
solar_wind_df = solar_wind_df.reset_index(drop=True)

print(f"Extracted {len(solar_wind_df)} records")
solar_wind_df.tail(200)

Extracted 51101 records


Unnamed: 0,date_and_time,by_gsm,bz_gsm,density,speed,kp_index
50901,2025-12-01 17:00:00,1.1,-1.5,2.9,451.0,3.0
50902,2025-12-01 18:00:00,0.7,-1.1,3.5,439.0,1.7
50903,2025-12-01 19:00:00,0.9,0.4,3.7,436.0,1.7
50904,2025-12-01 20:00:00,1.5,0.5,3.5,433.0,1.7
50905,2025-12-01 21:00:00,2.6,-1.4,3.5,432.0,1.3
...,...,...,...,...,...,...
51096,2025-12-09 21:00:00,-0.0,-2.6,4.0,351.0,1.3
51097,2025-12-09 22:00:00,-1.4,-2.2,3.8,347.0,1.3
51098,2025-12-09 23:00:00,-2.0,-2.1,4.0,345.0,1.3
51099,2025-12-10 00:00:00,-0.9,-2.2,2.9,335.0,2.0


In [15]:
# 3H aggregation for Kp physical correctness. 
# Kp index is repeated for each 3H window, but it's not correct for training otherwise.

# Use the reusable function from util.py
solar_wind_df = util.aggregate_solar_wind_3h(
    df=solar_wind_df,
    time_col='date_and_time',
    feature_cols=['by_gsm', 'bz_gsm', 'density', 'speed'],
    target_col='kp_index'
)

print(f"After 3H aggregation: {len(solar_wind_df)} records")
print("\nNew columns:")
print(solar_wind_df.columns.tolist())
print("\nData types:")
print(solar_wind_df.dtypes)
print("\nSample data showing 3H windows:")
solar_wind_df.tail(10)



After 3H aggregation: 17044 records

New columns:
['window_start', 'window_end', 'by_gsm_mean', 'by_gsm_min', 'by_gsm_max', 'by_gsm_std', 'bz_gsm_mean', 'bz_gsm_min', 'bz_gsm_max', 'bz_gsm_std', 'density_mean', 'density_min', 'density_max', 'density_std', 'speed_mean', 'speed_min', 'speed_max', 'speed_std', 'kp_index']

Data types:
window_start    datetime64[ns]
window_end      datetime64[ns]
by_gsm_mean            float32
by_gsm_min             float32
by_gsm_max             float32
by_gsm_std             float32
bz_gsm_mean            float32
bz_gsm_min             float32
bz_gsm_max             float32
bz_gsm_std             float32
density_mean           float32
density_min            float32
density_max            float32
density_std            float32
speed_mean             float32
speed_min              float32
speed_max              float32
speed_std              float32
kp_index               float32
dtype: object

Sample data showing 3H windows:


Unnamed: 0,window_start,window_end,by_gsm_mean,by_gsm_min,by_gsm_max,by_gsm_std,bz_gsm_mean,bz_gsm_min,bz_gsm_max,bz_gsm_std,density_mean,density_min,density_max,density_std,speed_mean,speed_min,speed_max,speed_std,kp_index
17034,2025-12-08 21:00:00,2025-12-09 00:00:00,-0.166667,-0.6,0.7,0.750555,-3.366667,-3.7,-2.9,0.416333,0.633333,0.6,0.7,0.057735,349.333344,339.0,363.0,12.34234,0.3
17035,2025-12-09 00:00:00,2025-12-09 03:00:00,-2.066667,-3.0,-0.7,1.209683,-2.433333,-2.6,-2.3,0.152752,0.9,0.9,0.9,0.0,357.666656,340.0,376.0,18.009256,0.7
17036,2025-12-09 03:00:00,2025-12-09 06:00:00,-0.666667,-1.3,-0.1,0.602771,-2.066667,-2.7,-1.2,0.776745,1.166667,1.0,1.3,0.152753,364.0,361.0,366.0,2.645751,0.0
17037,2025-12-09 06:00:00,2025-12-09 09:00:00,-0.4,-0.7,0.0,0.360555,-3.266667,-3.5,-3.1,0.208167,1.066667,1.0,1.1,0.057735,365.666656,361.0,369.0,4.163332,0.7
17038,2025-12-09 09:00:00,2025-12-09 12:00:00,-0.533333,-2.1,1.5,1.844813,-3.566667,-3.8,-3.2,0.321455,1.366667,1.3,1.5,0.11547,389.333344,372.0,399.0,15.044379,1.0
17039,2025-12-09 12:00:00,2025-12-09 15:00:00,1.7,1.4,2.0,0.424264,-3.3,-3.4,-3.2,0.141421,1.4,1.3,1.5,0.141421,345.0,345.0,345.0,0.0,1.0
17040,2025-12-09 15:00:00,2025-12-09 18:00:00,2.0,1.2,3.3,1.135782,-3.5,-4.0,-2.9,0.556776,3.166667,2.9,3.4,0.251661,365.333344,353.0,372.0,10.692677,1.3
17041,2025-12-09 18:00:00,2025-12-09 21:00:00,-1.7,-3.2,-0.4,1.410674,-2.133333,-2.6,-1.8,0.416333,3.066667,2.6,3.5,0.450925,368.0,360.0,376.0,8.0,1.3
17042,2025-12-09 21:00:00,2025-12-10 00:00:00,-1.133333,-2.0,-0.0,1.02632,-2.3,-2.6,-2.1,0.264575,3.933333,3.8,4.0,0.11547,347.666656,345.0,351.0,3.05505,1.3
17043,2025-12-10 00:00:00,2025-12-10 03:00:00,-1.45,-2.0,-0.9,0.777817,-2.2,-2.2,-2.2,0.0,2.95,2.9,3.0,0.070711,333.0,331.0,335.0,2.828427,2.0


# Step 2: Historical City Weather (The Visibility Constraint)

We fetch historical cloud cover for our three target cities. In the final system, the Aurora is only "Visible" if the cloud cover is low.

In [16]:
weather_backfill_list = []

for city, coords in settings.CITIES.items():
    print(f"Fetching historical cloud cover for {city}...")

    # We use the historical weather function from util.py
    # Modified to fetch 'cloud_cover' specifically
    df_city = util.get_city_weather_history(
        city=city,
        start_date=start_date.strftime("%Y-%m-%d"),
        end_date=end_date.strftime("%Y-%m-%d"),
        latitude=coords['lat'],
        longitude=coords['lon']
    )

    # Transform city column in string 
    df_city['city'] = city
    # Transform the date in datetime format
    df_city['date_and_time'] = pd.to_datetime(df_city['date'])

    # Ensure cloud_cover is present
    if 'cloud_cover' not in df_city.columns:
        # Fallback if your util function uses different naming like cloud_cover_mean
        df_city = df_city.rename(columns={'cloud_cover_mean': 'cloud_cover'})

    weather_backfill_list.append(df_city[['city', 'date_and_time', 'cloud_cover']])

weather_df = pd.concat(weather_backfill_list)
#weather_df['date'] = pd.to_datetime(weather_df['date']).dt.strftime('%Y-%m-%d')
print(weather_df.dtypes)
print(f"Extracted {len(weather_df)} weather records")
weather_df.head(50)

Fetching historical cloud cover for Kiruna...


Fetching historical cloud cover for Lule√•...
Fetching historical cloud cover for Stockholm...
city                     object
date_and_time    datetime64[ns]
cloud_cover               int64
dtype: object
Extracted 157752 weather records


Unnamed: 0,city,date_and_time,cloud_cover
0,Kiruna,2020-01-08 00:00:00,100
1,Kiruna,2020-01-08 01:00:00,100
2,Kiruna,2020-01-08 02:00:00,100
3,Kiruna,2020-01-08 03:00:00,100
4,Kiruna,2020-01-08 04:00:00,100
5,Kiruna,2020-01-08 05:00:00,100
6,Kiruna,2020-01-08 06:00:00,100
7,Kiruna,2020-01-08 07:00:00,100
8,Kiruna,2020-01-08 08:00:00,100
9,Kiruna,2020-01-08 09:00:00,100


In [17]:
print(f"Extracted {len(weather_df)} weather records before cleaning.")
# Data cleaning: Remove rows with missing cloud cover
weather_df = weather_df.dropna()
weather_df = weather_df.sort_values(["city", "date_and_time"])
weather_df = weather_df.reset_index(drop=True)

print(f"Extracted {len(weather_df)} weather records after cleaning.")
weather_df.head(500)

Extracted 157752 weather records before cleaning.
Extracted 157752 weather records after cleaning.


Unnamed: 0,city,date_and_time,cloud_cover
0,Kiruna,2020-01-08 00:00:00,100
1,Kiruna,2020-01-08 01:00:00,100
2,Kiruna,2020-01-08 02:00:00,100
3,Kiruna,2020-01-08 03:00:00,100
4,Kiruna,2020-01-08 04:00:00,100
...,...,...,...
495,Kiruna,2020-01-28 15:00:00,100
496,Kiruna,2020-01-28 16:00:00,100
497,Kiruna,2020-01-28 17:00:00,100
498,Kiruna,2020-01-28 18:00:00,100


# Step 3: Create Feature Groups and Insert Data

Now we register these datasets in the Hopsworks Feature Store.

In [18]:
# Login to Hopsworks
project = hopsworks.login(
    project=settings.HOPSWORKS_PROJECT,
    api_key_value=settings.HOPSWORKS_API_KEY.get_secret_value()
)
fs = project.get_feature_store()

2026-01-06 12:17:20,696 INFO: Initializing external client
2026-01-06 12:17:20,697 INFO: Base URL: https://c.app.hopsworks.ai:443
To ensure compatibility please install the latest bug fix release matching the minor version of your backend (4.2) by running 'pip install hopsworks==4.2.*'







2026-01-06 12:17:22,963 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1299605


In [20]:
# 1. Create Solar Wind Feature Group
solar_wind_fg = fs.get_or_create_feature_group(
    name="solar_wind_fg",
    version=4,
    primary_key=['window_start', 'window_end'],
    event_time="window_start",
    description="Satellite measurements (Bz, speed, density) and Kp index labels aggregated over 3H windows to match Kp physical meaning",
    #online_enabled=True,
    statistics_config={"enabled": True, "histograms": True, "correlations": True}
)

# 2. Create City Weather Feature Group
city_weather_fg = fs.get_or_create_feature_group(
    name="city_weather_fg",
    version=2,
    primary_key=['city', 'date_and_time'],
    event_time="date_and_time",
    description="Historical cloud cover for Stockholm, Lule√•, and Kiruna",
    #online_enabled=True,
    statistics_config={"enabled": True, "histograms": True}
)

# Insert Data
solar_wind_fg.insert(solar_wind_df)
city_weather_fg.insert(weather_df)

print("Backfill Complete! Data is now in the Hopsworks Feature Store.")

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1299605/fs/1287235/fg/1908110


Uploading Dataframe: 100.00% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Rows 17044/17044 | Elapsed Time: 00:04 | Remaining Time: 00:00


Launching job: solar_wind_fg_4_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1299605/jobs/named/solar_wind_fg_4_offline_fg_materialization/executions


Uploading Dataframe: 100.00% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Rows 157752/157752 | Elapsed Time: 00:25 | Remaining Time: 00:00


Launching job: city_weather_fg_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1299605/jobs/named/city_weather_fg_2_offline_fg_materialization/executions
Backfill Complete! Data is now in the Hopsworks Feature Store.
