In this case we will be training the model using date from the feature store (Hopsworks) and not the local `parquet` files and save the model to the Hopsworks model registry instead of local disk.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('../')
import src.config as config

In [3]:
import hopsworks
from hsfs.feature import Feature

# Connect to the project 
project = hopsworks.login(
    project=config.HOPSWORKS_PROJECT_NAME,
    api_key_value=config.HOPSWORKS_API_KEY
)

# Connect to the feature store
feature_store = project.get_feature_store()

# Use a NEW version to avoid the corrupted schema
# Increment version to create fresh feature group
FEATURE_GROUP_VERSION_NEW = 3
FEATURE_VIEW_VERSION_NEW = 3

print(f"Using new versions: FG v{FEATURE_GROUP_VERSION_NEW}, FV v{FEATURE_VIEW_VERSION_NEW}")

# Load and transform data for the feature group
print("\nLoading data for feature group...")
from datetime import datetime
import pandas as pd
from src.data import load_raw_data, transform_raw_data_into_ts_data

from_year = 2024
to_year = datetime.now().year
print(f"Downloading raw data from {from_year} to {to_year}")

rides = pd.DataFrame()
for year in range(from_year, to_year + 1):
    rides_one_year = load_raw_data(year)
    rides = pd.concat([rides, rides_one_year])

print(f"✓ Loaded {len(rides)} rides")

# Transform to time series data
ts_data = transform_raw_data_into_ts_data(rides)
print(f"✓ Transformed data shape: {ts_data.shape}")
print(f"Columns: {list(ts_data.columns)}")
print(f"Data types:\n{ts_data.dtypes}")

# Ensure correct data types before inserting
ts_data['pickup_hour'] = pd.to_datetime(ts_data['pickup_hour'])
ts_data['pickup_location_id'] = ts_data['pickup_location_id'].astype('int64')
ts_data['rides'] = ts_data['rides'].astype('int64')

print(f"\n✓ Data types after conversion:\n{ts_data.dtypes}")

# Delete old feature view if exists (to avoid dependency issues)
try:
    feature_store.delete_feature_view(
        name=config.FEATURE_VIEW_NAME,
        version=FEATURE_VIEW_VERSION_NEW
    )
    print(f"✓ Deleted existing feature view v{FEATURE_VIEW_VERSION_NEW}")
except:
    pass

# Delete old feature group if exists
try:
    old_fg = feature_store.get_feature_group(
        name=config.FEATURE_GROUP_NAME,
        version=FEATURE_GROUP_VERSION_NEW
    )
    old_fg.delete()
    print(f"✓ Deleted existing feature group v{FEATURE_GROUP_VERSION_NEW}")
except:
    pass

# Define explicit schema to avoid corruption bug
schema = [
    Feature(name="pickup_hour", type="timestamp"),
    Feature(name="pickup_location_id", type="bigint"),
    Feature(name="rides", type="bigint"),
]

# Create fresh feature group with EXPLICIT schema
print(f"\nCreating fresh feature group v{FEATURE_GROUP_VERSION_NEW} with explicit schema...")
feature_group = feature_store.get_or_create_feature_group(
    name=config.FEATURE_GROUP_NAME,
    version=FEATURE_GROUP_VERSION_NEW,
    description="Time-series data at hourly frequency",
    primary_key=['pickup_location_id', 'pickup_hour'],
    event_time='pickup_hour',
    features=schema  # Explicit schema definition
)

# Insert data
print("Inserting data into feature group...")
feature_group.insert(ts_data, write_options={"wait_for_job": True})
print("✓ Data inserted successfully!")

  from .autonotebook import tqdm as notebook_tqdm


2025-12-31 14:33:28,638 INFO: Initializing external client
2025-12-31 14:33:28,641 INFO: Base URL: https://c.app.hopsworks.ai:443




To ensure compatibility please install the latest bug fix release matching the minor version of your backend (4.2) by running 'pip install hopsworks==4.2.*'


2025-12-31 14:33:29,856 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1329302
Using new versions: FG v3, FV v3

Loading data for feature group...
Downloading raw data from 2024 to 2025
File 2024-01 was already in local storage
File 2024-02 was already in local storage
File 2024-03 was already in local storage
File 2024-04 was already in local storage
File 2024-05 was already in local storage
File 2024-06 was already in local storage
File 2024-07 was already in local storage
File 2024-08 was already in local storage
File 2024-09 was already in local storage
File 2024-10 was already in local storage
File 2024-11 was already in local storage
File 2024-12 was already in local storage
File 2025-01 was already in local storage
File 2025-02 was already in local storage
File 2025-03 was already in local storage
File 2025-04 was already in local storage
File 2025-05 was already in local storage
File 2025-06 was already in local storage


100%|██████████| 263/263 [00:10<00:00, 25.91it/s]


✓ Transformed data shape: (4418400, 3)
Columns: ['pickup_hour', 'rides', 'pickup_location_id']
Data types:
pickup_hour           datetime64[ns]
rides                          int64
pickup_location_id             int32
dtype: object

✓ Data types after conversion:
pickup_hour           datetime64[ns]
rides                          int64
pickup_location_id             int64
dtype: object

Creating fresh feature group v3 with explicit schema...
Inserting data into feature group...
Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1329302/fs/1317957/fg/1878604


Uploading Dataframe: 100.00% |██████████| Rows 4418400/4418400 | Elapsed Time: 05:33 | Remaining Time: 00:00


Launching job: time_series_hourly_feature_group_3_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1329302/jobs/named/time_series_hourly_feature_group_3_offline_fg_materialization/executions
2025-12-31 14:40:19,486 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2025-12-31 14:40:22,575 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2025-12-31 14:40:25,663 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2025-12-31 14:44:42,656 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2025-12-31 14:44:42,780 INFO: Waiting for log aggregation to finish.
2025-12-31 14:45:00,720 INFO: Execution finished successfully.
✓ Data inserted successfully!


In [4]:
print("Inspecting feature group schema...")
print(f"Feature group name: {feature_group.name}")
print(f"Feature group version: {feature_group.version}")

# Get schema from the feature group
print(f"\nColumns in feature group:")
if hasattr(feature_group, 'schema'):
    for feature in feature_group.schema:
        print(f"  - {feature.name}: {feature.type}")
else:
    print("Schema not directly available, checking features...")
    # Alternative: check the feature_group structure
    print(f"Feature group attributes: {dir(feature_group)}")

print(f"\nPrimary keys: {feature_group.primary_key}")
print(f"Event time: {feature_group.event_time}")


Inspecting feature group schema...
Feature group name: time_series_hourly_feature_group
Feature group version: 3

Columns in feature group:
  - pickup_hour: timestamp
  - pickup_location_id: bigint
  - rides: bigint

Primary keys: ['pickup_hour', 'pickup_location_id']
Event time: pickup_hour


In [5]:
print("Setting up feature view...")

try:
    # Delete any existing feature view first
    try:
        feature_store.delete_feature_view(
            name=config.FEATURE_VIEW_NAME,
            version=FEATURE_VIEW_VERSION_NEW
        )
        print(f'✓ Deleted existing feature view v{FEATURE_VIEW_VERSION_NEW}')
    except:
        print(f'ℹ No existing feature view v{FEATURE_VIEW_VERSION_NEW} to delete')
    
    # Create fresh feature view with new version
    # Use select() with explicit column names instead of select_all()
    query = feature_group.select(["pickup_hour", "pickup_location_id", "rides"])
    feature_view = feature_store.create_feature_view(
        name=config.FEATURE_VIEW_NAME,
        version=FEATURE_VIEW_VERSION_NEW,
        query=query
    )
    print(f"✓ Feature view v{FEATURE_VIEW_VERSION_NEW} created successfully")
    
except Exception as e:
    print(f"Error creating feature view: {e}")
    # Try to get existing feature view as fallback
    try:
        feature_view = feature_store.get_feature_view(
            name=config.FEATURE_VIEW_NAME,
            version=FEATURE_VIEW_VERSION_NEW
        )
        print(f'✓ Using existing feature view v{FEATURE_VIEW_VERSION_NEW}')
    except:
        raise

print(f"✓ Feature view ready: {feature_view}")

Setting up feature view...
ℹ No existing feature view v3 to delete
Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1329302/fs/1317957/fv/time_series_hourly_feature_view/version/3
✓ Feature view v3 created successfully
✓ Feature view ready: <hsfs.feature_view.FeatureView object at 0x000001B0DD3C5290>


In [7]:
# WORKAROUND: Hopsworks Query Service has a bug with schema serialization
# The server is storing column names as dictionary strings like "{'name': 'pickup_hour', 'type': 'timestamp'}"
# Instead of fetching from feature store, we use the local ts_data that was already transformed

print("Using local training data (Hopsworks Query Service bug workaround)...")
print(f"Note: Data was already inserted into feature store for inference pipeline")

# ts_data is already available from cell 4 after transformation
# Just verify it's ready
if 'ts_data' in dir() and ts_data is not None and len(ts_data) > 0:
    print(f"✓ Training data ready! Shape: {ts_data.shape}")
    print(f"Columns: {list(ts_data.columns)}")
    print(f"\nData types:\n{ts_data.dtypes}")
    print(f"\nFirst few rows:\n{ts_data.head()}")
    print(f"\nData range: {ts_data['pickup_hour'].min()} to {ts_data['pickup_hour'].max()}")
else:
    raise Exception("ts_data not available - please run cell 4 first")

print("\n✓ Training data ready!")

Using local training data (Hopsworks Query Service bug workaround)...
Note: Data was already inserted into feature store for inference pipeline
✓ Training data ready! Shape: (4418400, 3)
Columns: ['pickup_hour', 'rides', 'pickup_location_id']

Data types:
pickup_hour           datetime64[ns]
rides                          int64
pickup_location_id             int64
dtype: object

First few rows:
          pickup_hour  rides  pickup_location_id
0 2024-01-01 00:00:00     25                   4
1 2024-01-01 01:00:00     29                   4
2 2024-01-01 02:00:00     34                   4
3 2024-01-01 03:00:00     31                   4
4 2024-01-01 04:00:00     32                   4

Data range: 2024-01-01 00:00:00 to 2025-11-30 23:00:00

✓ Training data ready!
