# Lesson 5: Feature Stores with Feast

**Module 3: Data & Pipeline Engineering** | **Time**: 4-5 hours | **Difficulty**: Intermediate-Advanced

---

## 🎯 Learning Objectives

✅ Understand the training-serving skew problem and why feature stores exist  
✅ Learn the dual-database architecture behind feature stores  
✅ Implement a feature store using Feast  
✅ Understand point-in-time joins for temporal correctness  
✅ Answer 5 interview questions on feature stores  

---

## 📚 Table of Contents

1. [The Training-Serving Skew Problem](#1-skew)
2. [What Is a Feature Store?](#2-what-is)
3. [Architecture: Dual-Database Design](#3-architecture)
4. [Point-in-Time Joins](#4-pit-joins)
5. [Feast: Open-Source Feature Store](#5-feast)
6. [Hands-On: End-to-End Feast Demo](#6-hands-on)
7. [Feature Store Best Practices](#7-best-practices)
8. [Exercises](#8-exercises)
9. [Interview Preparation](#9-interview)

---

## 1. The Training-Serving Skew Problem <a id='1-skew'></a>

The **single biggest operational challenge** in production ML is ensuring that the features used during **training** are exactly the same as those used during **inference (serving)**.

### What Goes Wrong Without a Feature Store

```
  ❌ WITHOUT FEATURE STORE (Training-Serving Skew):

  TRAINING PIPELINE:                    SERVING PIPELINE:
  ┌─────────────────┐                ┌─────────────────┐
  │ Python script   │                │ Java service    │
  │ Pandas logic    │                │ Different logic │
  │ Version A       │                │ Version B       │
  └────────┴────────┘                └────────┴────────┘
         │                                    │
    Different                            Different
    feature logic!                       feature values!
         │                                    │
         └──────────▶ SKEW 💥 ◀──────────┘
```

### Types of Skew

| Skew Type | Description | Example |
|-----------|-------------|----------|
| **Code skew** | Different languages/logic for train vs serve | Python Pandas vs Java SQL |
| **Data skew** | Different data sources for train vs serve | Batch DB vs real-time API |
| **Time skew** | Using future data in training features | Averaging over a window that includes future |

### The Consequence

> A model trained with skewed features will perform differently in production than in development, often **degrading silently** without obvious errors.

---

## 2. What Is a Feature Store? <a id='2-what-is'></a>

A feature store is a **centralized repository** for storing, managing, and serving ML features. It ensures that **the same feature definitions and values** are used for both training and serving.

### Feature Store Benefits

```
  ✅ WITH FEATURE STORE:

  TRAINING PIPELINE:          FEATURE STORE          SERVING PIPELINE:
  ┌───────────────┐    ┌───────────────┐    ┌───────────────┐
  │ get_features()  │─▶│ Single Source │◀─│ get_features()  │
  │ (batch read)    │    │ of Truth      │    │ (online read)   │
  └───────────────┘    │               │    └───────────────┘
                         │ • Consistent   │
                         │ • Versioned    │
                         │ • Discoverable │
                         └───────────────┘
          SAME features for training AND serving!
```

### Key Capabilities

| Capability | Description |
|-----------|-------------|
| **Feature Registry** | Central catalog of all features with metadata |
| **Offline Store** | Historical features for training (batch) |
| **Online Store** | Low-latency features for serving (real-time) |
| **Point-in-Time Joins** | Temporal correctness for training data |
| **Feature Sharing** | Teams can discover and reuse features |
| **Monitoring** | Track feature drift, freshness, quality |

---

## 3. Architecture: Dual-Database Design <a id='3-architecture'></a>

Feature stores use a **dual-database architecture** to serve two very different access patterns:

### The Dual-Database Pattern

```
                    ┌─────────────────┐
                    │  Feature Store   │
                    │  (Central Hub)   │
                    └────────┬────────┘
                             │
              ┌────────────┴────────────┐
              │                           │
   ┌─────────────────┐      ┌─────────────────┐
   │ OFFLINE STORE    │      │  ONLINE STORE    │
   │ (Batch/Training) │      │ (Real-time/Serve)│
   ├─────────────────┤      ├─────────────────┤
   │ • Parquet/BQ     │      │ • Redis/DynamoDB │
   │ • Full history   │      │ • Latest values  │
   │ • High throughput│      │ • Low latency    │
   │ • Seconds OK     │      │ • <10ms response │
   └─────────────────┘      └─────────────────┘
       │                           │
   Used by:                    Used by:
   Model Training              Model Serving
   Batch Inference             Real-time Inference
```

### Why Two Databases?

| Requirement | Training (Offline) | Serving (Online) |
|------------|-------------------|------------------|
| **Data Volume** | Millions/billions of rows | Single entity lookup |
| **Latency** | Minutes acceptable | <10ms required |
| **Access Pattern** | Full scan / batch | Key-value lookup |
| **Data Span** | Historical (months/years) | Latest values only |
| **Storage** | Parquet, BigQuery | Redis, DynamoDB |

A data warehouse (Parquet) is optimized for the first; a key-value store (Redis) for the second. No single database excels at both.

---

## 4. Point-in-Time Joins <a id='4-pit-joins'></a>

When building training data, you need features **as they were at the time of each training example** — not as they are today. This is called a **point-in-time join**.

### Why Regular Joins Cause Leakage

```
  REGULAR JOIN (LEAKY):
  Label event: User X purchased on Jan 15
  Feature:     User X's avg_spend = $120  (computed over ALL time including after Jan 15!)
  → Uses future information! 💥

  POINT-IN-TIME JOIN (CORRECT):
  Label event: User X purchased on Jan 15
  Feature:     User X's avg_spend = $85  (computed ONLY with data before Jan 15)
  → Only past information ✅

  Timeline:
  ────────────────────┼─────────────────▶
    Past data only     │ Jan 15       Future
    (use for features) │ (event)      (NEVER use)
```

Feature stores like Feast automate these point-in-time joins, ensuring temporal correctness.

---

## 5. Feast: Open-Source Feature Store <a id='5-feast'></a>

**Feast** (Feature Store) is the most popular open-source feature store. It provides:

### Feast Architecture

```
  ┌───────────────────────────────────────────────────┐
  │                    FEAST                               │
  ├────────────────┬────────────────┬─────────────────┤
  │ Feature Defs  │ Offline Store  │  Online Store     │
  │ (Python)      │ (Parquet/BQ)   │  (SQLite/Redis)   │
  │               │                │                   │
  │ Entity        │ Historical     │  Latest values    │
  │ FeatureView   │ features for   │  for real-time    │
  │ FeatureService│ training       │  serving          │
  └────────────────┴────────────────┴─────────────────┘
```

### Core Feast Concepts

| Concept | Description | Example |
|---------|-------------|----------|
| **Entity** | The real-world object features describe | `user_id`, `product_id` |
| **Feature View** | A group of related features from a source | `user_purchase_features` |
| **Data Source** | Where the raw feature data lives | Parquet file, BigQuery table |
| **Feature Service** | A bundle of features for a specific model | `fraud_detection_features` |

---

## 6. Hands-On: End-to-End Feast Demo <a id='6-hands-on'></a>

We’ll build a complete feature store workflow:
1. Generate feature data
2. Define Feast entities and feature views
3. Materialize features to online store
4. Retrieve features for training and serving

### Prerequisites
```bash
pip install feast
```

In [None]:
# ============================================================
# Step 1: Generate Feature Data (simulating a data pipeline output)
# ============================================================
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os

np.random.seed(42)

# Generate user features over time
n_users = 500
n_days = 90
records = []

for day in range(n_days):
    event_time = datetime(2024, 1, 1) + timedelta(days=day)
    for user_id in range(1, n_users + 1):
        records.append({
            'event_timestamp': event_time,
            'user_id': user_id,
            'daily_spend': round(np.random.exponential(30) + np.sin(day/7) * 10, 2),
            'num_sessions': int(np.random.poisson(3)),
            'avg_session_duration_min': round(np.random.gamma(2, 5), 1),
            'items_viewed': int(np.random.poisson(10)),
            'items_purchased': int(np.random.poisson(1)),
            'is_premium': int(np.random.random() > 0.7),
        })

user_features_df = pd.DataFrame(records)

# Save as Parquet (this is what Feast reads)
os.makedirs('feast_demo/data', exist_ok=True)
user_features_df.to_parquet('feast_demo/data/user_features.parquet', index=False)

print(f"Generated {len(user_features_df):,} feature records")
print(f"Users: {n_users} | Days: {n_days}")
print(f"Columns: {list(user_features_df.columns)}")
print(f"\nDate range: {user_features_df['event_timestamp'].min()} to {user_features_df['event_timestamp'].max()}")
user_features_df.head()

In [None]:
# ============================================================
# Step 2: Define Feast Feature Store
# ============================================================

# Create the feature_store.yaml configuration
feast_config = """project: user_activity
registry: feast_demo/registry.db
provider: local
online_store:
  type: sqlite
  path: feast_demo/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 2
"""

with open('feast_demo/feature_store.yaml', 'w') as f:
    f.write(feast_config)

print("✅ Created feature_store.yaml")
print(feast_config)

In [None]:
# ============================================================
# Step 3: Define Entities and Feature Views programmatically
# ============================================================

feature_def_code = '''
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64

# Define the entity (the "who" or "what" the features describe)
user = Entity(
    name="user_id",
    description="Unique user identifier",
)

# Define the data source
user_source = FileSource(
    path="data/user_features.parquet",
    timestamp_field="event_timestamp",
)

# Define the feature view
user_activity_fv = FeatureView(
    name="user_activity",
    entities=[user],
    ttl=timedelta(days=7),  # Features older than 7 days are stale
    schema=[
        Field(name="daily_spend", dtype=Float32),
        Field(name="num_sessions", dtype=Int64),
        Field(name="avg_session_duration_min", dtype=Float32),
        Field(name="items_viewed", dtype=Int64),
        Field(name="items_purchased", dtype=Int64),
        Field(name="is_premium", dtype=Int64),
    ],
    source=user_source,
    online=True,
)
'''

with open('feast_demo/features.py', 'w') as f:
    f.write(feature_def_code)

print("✅ Created features.py with entity and feature view definitions")
print("\nKey components:")
print("  Entity:       user_id (the primary key)")
print("  FeatureView:  user_activity (6 features)")
print("  TTL:          7 days (features expire after 7 days)")
print("  Source:       Parquet file")

In [None]:
# ============================================================
# Step 4: Apply Feast definitions and materialize
# ============================================================
try:
    from feast import FeatureStore
    
    # Initialize the feature store
    store = FeatureStore(repo_path='feast_demo')
    
    # Apply the feature definitions
    store.apply([])  # This registers entities and feature views
    print("✅ Feature store initialized!")
    
    # Materialize features to online store
    # This copies the latest feature values to the low-latency online store
    store.materialize_incremental(
        end_date=datetime(2024, 4, 1)
    )
    print("✅ Features materialized to online store!")
    
except ImportError:
    print("⚠️ Feast not installed. Run: pip install feast")
    print("\nThe code above would:")
    print("1. Initialize the feature store from feast_demo/")
    print("2. Register entity and feature view definitions")
    print("3. Materialize (copy) latest features to online store (SQLite)")
except Exception as e:
    print(f"Note: {e}")
    print("This is expected in a notebook environment.")
    print("In production, run: feast apply && feast materialize-incremental")

In [None]:
# ============================================================
# Step 5: Retrieve features for Training (Offline)
# Point-in-time join demonstration
# ============================================================
print("="*60)
print("OFFLINE FEATURE RETRIEVAL (Training)")
print("="*60)

# Create entity DataFrame (who + when)
# These represent training examples with their timestamps
entity_df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5, 1, 2, 3],
    'event_timestamp': [
        datetime(2024, 2, 15),
        datetime(2024, 2, 15),
        datetime(2024, 2, 15),
        datetime(2024, 2, 15),
        datetime(2024, 2, 15),
        datetime(2024, 3, 1),
        datetime(2024, 3, 1),
        datetime(2024, 3, 1),
    ],
    'label': [1, 0, 1, 0, 1, 0, 1, 0]  # Our target variable
})

print("Entity DataFrame (training examples):")
print(entity_df)

try:
    # This performs a POINT-IN-TIME JOIN
    training_df = store.get_historical_features(
        entity_df=entity_df,
        features=[
            'user_activity:daily_spend',
            'user_activity:num_sessions',
            'user_activity:items_purchased',
        ]
    ).to_df()
    
    print("\nTraining data with features (point-in-time correct):")
    print(training_df)
except:
    # Simulate the output
    print("\n[Simulated] Point-in-time join would return:")
    print("Each row gets features AS THEY WERE at that timestamp.")
    print("User 1 on Feb 15 gets different features than User 1 on Mar 1.")

In [None]:
# ============================================================
# Step 6: Retrieve features for Serving (Online)
# ============================================================
print("="*60)
print("ONLINE FEATURE RETRIEVAL (Serving)")
print("="*60)

try:
    # Get the LATEST features for a user (low-latency lookup)
    online_features = store.get_online_features(
        features=[
            'user_activity:daily_spend',
            'user_activity:num_sessions',
            'user_activity:items_purchased',
        ],
        entity_rows=[{'user_id': 1}, {'user_id': 2}]
    ).to_dict()
    
    print("Online features (latest values):")
    for key, values in online_features.items():
        print(f"  {key}: {values}")
except:
    print("[Simulated] Online lookup for user_id=1:")
    print("  daily_spend: 42.50")
    print("  num_sessions: 5")
    print("  items_purchased: 2")
    print("\n>>> This would return in <10ms in production (Redis/DynamoDB)")

# Cleanup
import shutil
shutil.rmtree('feast_demo', ignore_errors=True)
print("\n(Cleaned up demo files)")

## 7. Feature Store Best Practices <a id='7-best-practices'></a>

| Practice | Description |
|----------|-------------|
| **Name features clearly** | `user_7d_avg_spend` not `feature_42` |
| **Set appropriate TTL** | Stale features can degrade model performance |
| **Monitor feature freshness** | Alert when features haven’t been updated |
| **Version feature definitions** | Track changes to feature logic |
| **Document features** | Include business context, not just technical description |
| **Share across teams** | Avoid duplicate feature engineering work |

---

## 8. Exercises <a id='8-exercises'></a>

### Exercise 1: Feature Design
Design a feature view for a fraud detection system. What entities, features, TTL, and data sources would you define?

### Exercise 2: Point-in-Time Join Manually
Given two DataFrames (user features with timestamps and training labels with timestamps), implement a point-in-time join using Pandas `.merge_asof()`.

### Exercise 3: Feature Freshness
Write a monitoring script that checks when each feature was last updated and alerts if any feature is older than its TTL.

---

## 9. Interview Preparation <a id='9-interview'></a>

### Q1: "What is training-serving skew and how does a feature store prevent it?"

**Answer:**  
"Training-serving skew occurs when features used during training differ from those at inference. Causes include different code (Python vs Java), different data sources, or different time windows.

A feature store prevents this by providing a **single source of truth** for features. Both training and serving read from the same definitions. The offline store serves historical features for training; the online store serves the latest values for inference. Same logic, same data — no skew."

---

### Q2: "Explain the dual-database architecture in a feature store."

**Answer:**  
"Feature stores use two databases because training and serving have fundamentally different access patterns:

- **Offline store** (Parquet/BigQuery): Stores full feature history. Used for batch training — high throughput, latency in seconds is fine.
- **Online store** (Redis/DynamoDB): Stores only the latest feature values. Used for real-time serving — key-value lookups in <10ms.

The `materialize` operation syncs data from offline to online. No single database optimizes for both patterns."

---

### Q3: "What is a point-in-time join and why is it critical?"

**Answer:**  
"A point-in-time join retrieves features **as they existed at the time of each training example**, not as they exist today. Without it, you get temporal leakage — using future feature values to predict past events.

Example: predicting churn for User X on Jan 15. A regular join might use User X’s latest activity (from March), but a point-in-time join uses only data available before Jan 15. Feast automates this, ensuring temporal correctness."

---

### Q4: "Compare Feast to Tecton/Hopsworks. When would you choose each?"

**Answer:**  
"**Feast** (open-source): Great for getting started, simple setup, Python-native. Best for teams that want control and don't need real-time feature computation.

**Tecton** (managed): Built on top of Feast concepts but adds real-time feature transformations, streaming integration, and enterprise features. Best for teams needing production-grade with real-time features.

**Hopsworks**: Full ML platform with built-in feature store. Best when you want an integrated solution. I’d start with Feast for POC, then migrate to Tecton for production scale."

---

### Q5: "How do you handle feature freshness and staleness?"

**Answer:**  
"Every feature view should have a **TTL (Time-To-Live)** that defines maximum allowed staleness. My approach:

1. **Set TTL per feature**: User demographics (30 days), purchase history (1 day), real-time signals (1 hour)
2. **Monitor materialization lag**: Alert when the gap between latest data and online store exceeds a threshold
3. **Fallback strategy**: When features are stale, use default values or a simpler model rather than serving stale features
4. **Dashboard**: Track feature freshness, null rates, and distribution drift"

---

## 🎓 Key Takeaways

1. **Training-serving skew** is the #1 operational ML problem — feature stores solve it
2. **Dual-database architecture**: offline (batch/training) + online (real-time/serving)
3. **Point-in-time joins** prevent temporal leakage in training data
4. **Feast** is the go-to open-source feature store for Python-based ML teams
5. **Feature sharing** across teams reduces duplicate engineering work
6. **TTL and monitoring** ensure features stay fresh and reliable

---

➡️ **Next Lesson**: [Lesson 6: Apache Spark & PySpark](./lesson_06_spark_pyspark.ipynb) — Scale your data processing.