# Aurora Forecasting - Part 01: Feature Backfill

üóíÔ∏è This notebook is divided into the following sections:
Initialize Hopsworks connection.

Fetch historical Solar Wind & Kp index data using spacepy (OMNI dataset).

Fetch historical Cloud Cover for Stockholm, Lule√•, and Kiruna using Open-Meteo.

Create and Insert data into Feature Groups in the Hopsworks Feature Store.

# Imports

In [1]:
!pip install spacepy cdasws


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import pandas as pd
import datetime
import hopsworks
from config import HopsworksSettings
import util
import spacepy.omni as omni
import spacepy.time as spt
import spacepy.toolbox as tb
from cdasws import CdasWs
from cdasws.timeinterval import TimeInterval

# Setup settings
settings = HopsworksSettings()


Aurora Project Settings initialized!


# Step 1: Historical Solar Wind & Kp Data (The Label)

We use the OMNI dataset via spacepy to get high-resolution historical satellite data. This includes the magnetic field components and the proton parameters needed to predict the Kp index.

https://helio.data.nasa.gov/dataset/OMNI_PT1H 

In [3]:
cdas = CdasWs()

# Get the OMNI dataset
#dataset = 'OMNI_HRO_1MIN'  # 1 minute resolution
dataset = 'OMNI2_H0_MRG1HR'  # 1 hour resolution
var_names = cdas.get_variable_names(dataset)

print('Variable names:', var_names)

# Time interval
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=730)  # 2 years

# cdas.get_data expects a cdasws TimeInterval (or ISO start/end strings).
# Build a TimeInterval from the start and end datetimes to avoid "invalid time0 type".
time_interval = TimeInterval(start_date.strftime('%Y-%m-%dT%H:%M:%S'),
                             end_date.strftime('%Y-%m-%dT%H:%M:%S'))

# Fetch data
status, data = cdas.get_data(dataset, var_names, time_interval)

# Keep this (suggested in the official site)
if 'spacepy' in str(type(data)):
    #  see https://spacepy.github.io/datamodel.html
    print(var_names[0], '=', data[var_names[0]])
    print(data[var_names[0]].attrs)
    for var in var_names:
        print(var, '=', data[var])
        print(data[var].attrs)
        print("\n")
else:
    #  see https://github.com/MAVENSDC/cdflib
    print(var_names[0], '=', data.data_vars[var_names[0]].values)
    print(data.data_vars[var_names[0]].attrs)


Variable names: ['Rot1800', 'IMF1800', 'PLS1800', 'IMF_PTS1800', 'PLS_PTS1800', 'ABS_B1800', 'F1800', 'THETA_AV1800', 'PHI_AV1800', 'BX_GSE1800', 'BY_GSE1800', 'BZ_GSE1800', 'BY_GSM1800', 'BZ_GSM1800', 'SIGMA-ABS_B1800', 'SIGMA-B1800', 'SIGMA-Bx1800', 'SIGMA-By1800', 'SIGMA-Bz1800', 'T1800', 'N1800', 'V1800', 'PHI-V1800', 'THETA-V1800', 'Ratio1800', 'Pressure1800', 'SIGMA-T1800', 'SIGMA-N1800', 'SIGMA-V1800', 'SIGMA-PHI-V1800', 'SIGMA-THETA-V1800', 'SIGMA-ratio1800', 'E1800', 'Beta1800', 'Mach_num1800', 'Mgs_mach_num1800', 'PR-FLX_11800', 'PR-FLX_21800', 'PR-FLX_41800', 'PR-FLX_101800', 'PR-FLX_301800', 'PR-FLX_601800', 'MFLX1800', 'R1800', 'F10_INDEX1800', 'KP1800', 'DST1800', 'AE1800', 'AP_INDEX1800', 'AL_INDEX1800', 'AU_INDEX1800', 'PC_N_INDEX1800', 'Solar_Lyman_alpha1800', 'Proton_QI1800']
Rot1800 = [2596 2596 2596 ... 9999 9999 9999]
{'FIELDNAM': 'Bartels Rotation Number', 'VALIDMIN': 1700, 'VALIDMAX': 9998, 'SCALEMIN': 1700, 'SCALEMAX': 9998, 'FORMAT': 'I4', 'FILLVAL': 9999, 'VAR

In [4]:
# Extract needed information into a pd frame
solar_wind_df = pd.DataFrame({
    'time': data['Epoch'],
    'by_gsm': data['BY_GSM1800'],
    'bz_gsm': data['BZ_GSM1800'],
    'density': data['N1800'],
    'speed': data['V1800'],
    'kp_index': data['KP1800']
})

print(f"Extracted {len(solar_wind_df)} records")
solar_wind_df.tail(200)

Extracted 17418 records


Unnamed: 0,time,by_gsm,bz_gsm,density,speed,kp_index
17218,2025-12-18 05:00:00,999.900024,999.900024,2.200000,689.0,27
17219,2025-12-18 06:00:00,999.900024,999.900024,2.200000,688.0,37
17220,2025-12-18 07:00:00,999.900024,999.900024,2.400000,676.0,37
17221,2025-12-18 08:00:00,999.900024,999.900024,2.300000,651.0,37
17222,2025-12-18 09:00:00,999.900024,999.900024,2.300000,638.0,33
...,...,...,...,...,...,...
17413,2025-12-26 08:00:00,999.900024,999.900024,999.900024,9999.0,99
17414,2025-12-26 09:00:00,999.900024,999.900024,999.900024,9999.0,99
17415,2025-12-26 10:00:00,999.900024,999.900024,999.900024,9999.0,99
17416,2025-12-26 11:00:00,999.900024,999.900024,999.900024,9999.0,99


In [5]:
# Data Cleaning: OMNI uses 99.9 or 999.9 as fill values for missing data
solar_wind_df = solar_wind_df[(solar_wind_df['kp_index'] < 99) & 
                        (solar_wind_df['by_gsm'] < 999.9) &
                        (solar_wind_df['bz_gsm'] < 999.9) &
                        (solar_wind_df['density'] < 999.9) &
                        (solar_wind_df['speed'] < 9999.9)
                        ]
solar_wind_df = solar_wind_df.dropna()

print(f"Extracted {len(solar_wind_df)} records")
solar_wind_df.tail(200)

# Transform data column in string for Hopswork feature
solar_wind_df['time'] = solar_wind_df['time'].dt.strftime('%Y-%m-%d %H:%M:%S')


Extracted 16737 records


# Step 2: Historical City Weather (The Visibility Constraint)

We fetch historical cloud cover for our three target cities. In the final system, the Aurora is only "Visible" if the cloud cover is low.

In [6]:
weather_backfill_list = []

for city, coords in settings.CITIES.items():
    print(f"Fetching historical cloud cover for {city}...")

    # We use the historical weather function from util.py
    # Modified to fetch 'cloud_cover' specifically
    df_city = util.get_city_weather_history(
        city=city,
        start_date=start_date.strftime("%Y-%m-%d"),
        end_date=end_date.strftime("%Y-%m-%d"),
        latitude=coords['lat'],
        longitude=coords['lon']
    )

    # Standardize columns
    df_city['city'] = city

    # Ensure cloud_cover is present
    if 'cloud_cover' not in df_city.columns:
        # Fallback if your util function uses different naming like cloud_cover_mean
        df_city = df_city.rename(columns={'cloud_cover_mean': 'cloud_cover'})

    weather_backfill_list.append(df_city[['city', 'date', 'cloud_cover']])

weather_df = pd.concat(weather_backfill_list)
weather_df['date'] = pd.to_datetime(weather_df['date']).dt.strftime('%Y-%m-%d')
weather_df.head()

Fetching historical cloud cover for Kiruna...
Fetching historical cloud cover for Lule√•...
Fetching historical cloud cover for Stockholm...


Unnamed: 0,city,date,cloud_cover
0,Kiruna,2023-12-31,100
1,Kiruna,2024-01-01,85
2,Kiruna,2024-01-02,75
3,Kiruna,2024-01-03,16
4,Kiruna,2024-01-04,0


In [7]:
print(f"Extracted {len(weather_df)} weather records before cleaning.")
# Data cleaning: Remove rows with missing cloud cover
weather_df = weather_df.dropna(subset=['cloud_cover'])
print(f"Extracted {len(weather_df)} weather records after cleaning.")
weather_df.head(500)

Extracted 2193 weather records before cleaning.
Extracted 2193 weather records after cleaning.


Unnamed: 0,city,date,cloud_cover
0,Kiruna,2023-12-31,100
1,Kiruna,2024-01-01,85
2,Kiruna,2024-01-02,75
3,Kiruna,2024-01-03,16
4,Kiruna,2024-01-04,0
...,...,...,...
495,Kiruna,2025-05-09,100
496,Kiruna,2025-05-10,97
497,Kiruna,2025-05-11,42
498,Kiruna,2025-05-12,41


# Step 3: Create Feature Groups and Insert Data

Now we register these datasets in the Hopsworks Feature Store.

In [8]:
# Login to Hopsworks
project = hopsworks.login(
    project=settings.HOPSWORKS_PROJECT,
    api_key_value=settings.HOPSWORKS_API_KEY.get_secret_value()
)
fs = project.get_feature_store()

2025-12-30 18:03:05,786 INFO: Initializing external client
2025-12-30 18:03:05,787 INFO: Base URL: https://c.app.hopsworks.ai:443
To ensure compatibility please install the latest bug fix release matching the minor version of your backend (4.2) by running 'pip install hopsworks==4.2.*'







2025-12-30 18:03:07,961 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1299605


In [9]:
# 1. Create Solar Wind Feature Group
solar_wind_fg = fs.get_or_create_feature_group(
    name="solar_wind_fg",
    version=1,
    primary_key=['time'],
    description="Satellite measurements (Bz, speed, density) and Kp index labels",
    online_enabled=True,
    statistics_config={"enabled": True, "histograms": True, "correlations": True}
)

# 2. Create City Weather Feature Group
city_weather_fg = fs.get_or_create_feature_group(
    name="city_weather_fg",
    version=1,
    primary_key=['city', 'date'],
    description="Historical cloud cover for Stockholm, Lule√•, and Kiruna",
    online_enabled=True,
    statistics_config={"enabled": True, "histograms": True}
)

# Insert Data
solar_wind_fg.insert(solar_wind_df)
city_weather_fg.insert(weather_df)

print("Backfill Complete! Data is now in the Hopsworks Feature Store.")

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1299605/fs/1287235/fg/1876500


Uploading Dataframe: 100.00% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Rows 16737/16737 | Elapsed Time: 00:14 | Remaining Time: 00:00


Launching job: solar_wind_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1299605/jobs/named/solar_wind_fg_1_offline_fg_materialization/executions
Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1299605/fs/1287235/fg/1880495


Uploading Dataframe: 100.00% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Rows 2193/2193 | Elapsed Time: 00:02 | Remaining Time: 00:00


Launching job: city_weather_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1299605/jobs/named/city_weather_fg_1_offline_fg_materialization/executions
Backfill Complete! Data is now in the Hopsworks Feature Store.
