# Flight Delay Prediction Feature Store Demo

This notebook demonstrates how to use Feast feature store for flight delay prediction. We'll cover:
1. Setting up the feature store
2. Retrieving historical features
3. Managing online features
4. Working with feature services
5. Simulating real-time data ingestion

## Importing Required libraries

In [40]:
import subprocess
from datetime import datetime

import pandas as pd

from feast import FeatureStore
from feast.data_source import PushMode

In [41]:
# Initialize connection to the feature store 
# repo path is where the feature_repo containing the feature_store.yaml file is 
store = FeatureStore(repo_path=".")

Feast supports several patterns of feature retrieval:

1. Training data generation (via feature_store.get_historical_features(...))

2. Offline feature retrieval for batch scoring (via feature_store.get_historical_features(...))

3. Online feature retrieval for real-time model predictions

In this tutorial, we will focus on training data generation and online feature retrieval

## 2. Historical Feature Retrieval

Historical features represent point-in-time correct data used primarily for training machine learning models. These features are retrieved using the get_historical_features method, which ensures that only data available at the specified timestamp is used, preventing data leakage.

The function `fetch_historical_features_entity_df` demonstrates how to retrieve historical features for specific flights, showing how to properly structure entity dataframes with timestamps and feature selection.

In [56]:
def fetch_historical_features_entity_df(store: FeatureStore, searchdate: datetime):
    entity_df = pd.DataFrame.from_dict(
        {
            "flight_ID": [
                "WN_3609",
                "WN_3610",
                "WN_3611"
            ],
            "event_timestamp": [
                 searchdate,
                 searchdate,
                 searchdate]
        }
    )

    training_df = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "flight_stats:Distance",
            "flight_stats:CRSElapsedTime",
            "flight_stats:DayOfWeek",
            "flight_stats:Month",
            "flight_stats:WeatherDelay",
            "flight_stats:NASDelay",
        ],
    ).to_df()
    print(training_df.head())

## 3. Online Feature Retrieval

Online feature retrieval enables real-time serving of feature values for model inference in production environments. This is critical for making predictions on live data where low-latency access to current feature values is required.

The function `fetch_online_features`shows different approaches to retrieving online features, from basic feature retrieval to using feature retrieval using a push source

In [46]:
def fetch_online_features(store, source: str = ""):
    entity_rows = [
        {
            "flight_ID": "WN_3609",
        },
        {
            "flight_ID": "WN_3610",
        },
        {
            "flight_ID": "WN_3611",
        }
    ]

    if source == "feature_service":
        features_to_fetch = store.get_feature_service("flight_prediction_v1")
    elif source == "advanced_feature_service":
        features_to_fetch = store.get_feature_service("flight_prediction_v2")
    elif source == "push":
        features_to_fetch = store.get_feature_service("flight_prediction_v3")
    else:
        features_to_fetch = [
            "flight_stats:Distance",
            "flight_stats:WeatherDelay",
        ]

    returned_features = store.get_online_features(
        features=features_to_fetch,
        entity_rows=entity_rows,
    ).to_dict()

    for key, value in sorted(returned_features.items()):
        print(key, " : ", value)

In [47]:
fetch_historical_features_entity_df(store)

+----------------------------+------------+-------------+
| Merge columns              | left dtype | right dtype |
+----------------------------+------------+-------------+
| ('flight_ID', 'flight_ID') | object     | string      |
+----------------------------+------------+-------------+
Cast dtypes explicitly to avoid unexpected results.


  flight_ID           event_timestamp  Distance  CRSElapsedTime  DayOfWeek  \
0   WN_3609 2024-09-01 00:00:00+00:00     619.0           105.0          7   
1   WN_3610 2024-09-01 00:00:00+00:00    1670.0           250.0          7   

   Month  WeatherDelay  NASDelay  
0      9           NaN       NaN  
1      9           NaN       NaN  


## 4. Feature Store Operations

### 4.1 Materialize Features
Materialization is the process of pre-computing feature values and storing them in the online store for fast access. This operation ensures that features are readily available for real-time serving without computational overhead.

This example demonstrates materializing features for a specific date range (September 2024), showing how to keep the online store updated with relevant historical data.

In [48]:
from datetime import datetime

start_date = datetime.strptime('2024-09-01', '%Y-%m-%d')
end_date = datetime.strptime('2024-09-30', '%Y-%m-%d')

store.materialize(start_date=start_date, end_date=end_date)

Materializing [1m[32m2[0m feature views from [1m[32m2024-09-01 00:00:00+00:00[0m to [1m[32m2024-09-30 00:00:00+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mflight_stats[0m:


100%|███████████████████████████████████████████████████████| 21956/21956 [00:09<00:00, 2204.61it/s]


[1m[32mflight_stats_fresh[0m:


100%|███████████████████████████████████████████████████████| 21956/21956 [00:07<00:00, 2979.28it/s]


### 4.2 Test Feature Retrieval
Retrieve features using different methods and services.

We demonstrate three different approaches to feature retrieval: feature service v1 (basic flight information) for basic model, and feature service v2 (comprehensive flight information) for advanced model.

In [49]:
fetch_online_features(store)

Distance  :  [392.0, 787.0, 879.0]
WeatherDelay  :  [None, 0.0, None]
flight_ID  :  ['WN_3609', 'WN_3610', 'WN_3611']


In [50]:
print("\n--- Online features retrieved through feature service v1 ---")
fetch_online_features(store, source="feature_service")


--- Online features retrieved through feature service v1 ---
CRSElapsedTime  :  [80.0, 145.0, 150.0]
DayOfWeek  :  [7, 7, 6]
Distance  :  [392.0, 787.0, 879.0]
Month  :  [9, 9, 9]
flight_ID  :  ['WN_3609', 'WN_3610', 'WN_3611']


In [51]:
print("\n--- Online features retrieved through feature service v2 ---")
fetch_online_features(store, source="advanced_feature_service")


--- Online features retrieved through feature service v2 ---
ArrDelay  :  [-19.0, 53.0, -6.0]
CRSElapsedTime  :  [80.0, 145.0, 150.0]
CarrierDelay  :  [None, 45.0, None]
DayOfWeek  :  [7, 7, 6]
DepDelay  :  [-6.0, 45.0, 7.0]
Dest  :  ['ICT', 'MCO', 'PHX']
Distance  :  [392.0, 787.0, 879.0]
LateAircraftDelay  :  [None, 0.0, None]
Month  :  [9, 9, 9]
NASDelay  :  [None, 8.0, None]
Origin  :  ['STL', 'BWI', 'DAL']
Quarter  :  [3, 3, 3]
SecurityDelay  :  [None, 0.0, None]
WeatherDelay  :  [None, 0.0, None]
flight_ID  :  ['WN_3609', 'WN_3610', 'WN_3611']


In [52]:
print("\n--- Online features retrieved using feature service v3 (with push source) ---")
fetch_online_features(store, source="push")


--- Online features retrieved using feature service v3 (with push source) ---
CRSElapsedTime  :  [80.0, 145.0, 150.0]
DayOfWeek  :  [7, 7, 6]
DepDelay  :  [-6.0, 45.0, 7.0]
Dest  :  ['ICT', 'MCO', 'PHX']
Distance  :  [392.0, 787.0, 879.0]
Month  :  [9, 9, 9]
NASDelay  :  [None, 8.0, None]
Origin  :  ['STL', 'BWI', 'DAL']
Quarter  :  [3, 3, 3]
WeatherDelay  :  [None, 0.0, None]
flight_ID  :  ['WN_3609', 'WN_3610', 'WN_3611']


## 5. Real-time Feature Updates

Real-time feature updates allow the feature store to incorporate the latest data as it becomes available. This is crucial for maintaining up-to-date feature values in dynamic environments.

The example simulates a stream event for flight WN_3609, demonstrating how to push new feature values to both online and offline storage (`to=PushMode.ONLINE_AND_OFFLINE`) using the push source mechanism, ensuring consistency across both serving layers.

In [53]:
print("\n--- Simulate a stream event ingestion ---")
event_df = pd.DataFrame.from_dict(
    {
        "flight_ID": ["WN_3609"],
        "FlightDate": [datetime.now()],
        "Origin": ["ABQ"],
        "Dest": ["AUS"],
        "Distance": [619.0],
        "CRSElapsedTime": [95.0],
        "DayOfWeek": [7],
        "Month": [9],
        "Quarter": [3],
        "DepDelay": [15.0],
        "WeatherDelay": [10.0],
        "NASDelay": [0.0],
        "SecurityDelay": [0.0],      # Added
        "LateAircraftDelay": [0.0],  # Added
        "ArrDelay": [25.0],          # Added (sum of delays)
        "CarrierDelay": [0.0],       # Added
    }
    )
print(event_df)
store.push("flight_stats_push_source", event_df, to=PushMode.ONLINE_AND_OFFLINE)


--- Simulate a stream event ingestion ---
  flight_ID                 FlightDate Origin Dest  Distance  CRSElapsedTime  \
0   WN_3609 2025-01-20 19:03:20.453291    ABQ  AUS     619.0            95.0   

   DayOfWeek  Month  Quarter  DepDelay  WeatherDelay  NASDelay  SecurityDelay  \
0          7      9        3      15.0          10.0       0.0            0.0   

   LateAircraftDelay  ArrDelay  CarrierDelay  
0                0.0      25.0           0.0  


Here we can see the updated entry for flight `WN_3609` with the newly ingested data

In [54]:
print("\n--- Online features again with updated values from stream push ---")
fetch_online_features(store, source="push")


--- Online features again with updated values from stream push ---
CRSElapsedTime  :  [95.0, 145.0, 150.0]
DayOfWeek  :  [7, 7, 6]
DepDelay  :  [15.0, 45.0, 7.0]
Dest  :  ['AUS', 'MCO', 'PHX']
Distance  :  [619.0, 787.0, 879.0]
Month  :  [9, 9, 9]
NASDelay  :  [0.0, 8.0, None]
Origin  :  ['ABQ', 'BWI', 'DAL']
Quarter  :  [3, 3, 3]
WeatherDelay  :  [10.0, 0.0, None]
flight_ID  :  ['WN_3609', 'WN_3610', 'WN_3611']


As we have enabled the parameter `to=PushMode.ONLINE_AND_OFFLINE`, Feast updates the new entry in the online store and the batch source `(flights.parquet)` file

We can see that, when we use the get_historical_features for `'WN_3609'` with `datetime.now()` as the date parameter, it returns the newly ingested data

In [59]:
fetch_historical_features_entity_df(store, datetime.now())

+----------------------------+------------+-------------+
| Merge columns              | left dtype | right dtype |
+----------------------------+------------+-------------+
| ('flight_ID', 'flight_ID') | object     | string      |
+----------------------------+------------+-------------+
Cast dtypes explicitly to avoid unexpected results.


  flight_ID                  event_timestamp  Distance  CRSElapsedTime  \
0   WN_3609 2025-01-20 19:07:04.080467+00:00     619.0            95.0   

   DayOfWeek  Month  WeatherDelay  NASDelay  
0          7      9          10.0       0.0  
