# Lesson 5: Feature Stores (Feast)

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 2 hours  
**Difficulty**: Advanced

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand the problem of Training-Serving Skew  
âœ… Learn Feature Store architecture (Offline vs Online)  
âœ… Implement a basic feature store using **Feast**  
âœ… Master Point-in-Time Correctness (Time Travel)  
âœ… Answer "Why Feature Store?" interview questions  

---

## ðŸ“š Table of Contents

1. [The Problem: Inconsistent Features](#1-the-problem)
2. [What is a Feature Store?](#2-what-is-feature-store)
3. [Deep Dive: Feast Architecture](#3-feast-architecture)
4. [Hands-On: Defining Features](#4-hands-on)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The Problem: Inconsistent Features

### Scenario
You have a feature: `average_transactions_last_7_days`.

- **Training**: Data Scientist implements it in SQL / Pandas on Warehouse data.
- **Serving**: Backend Engineer implements it in Java / Go on Production DB.

### Training-Serving Skew
The two implementations mostly match, but handle edge cases differently (e.g., timezones, nulls). The model trained on Python features fails when fed Java features.

**Solution**: Define the feature logic ONCE, serve it EVERYWHERE.

## 2. What is a Feature Store?

A centralized interface between Data/Ops and ML.

1. **Offline Store**: High latency, huge scale (S3/BigQuery). Used for generating **Training Data**.
2. **Online Store**: Low latency (Redis/DynamoDB). Used for **Real-time Inference**.
3. **Registry**: Single source of truth for feature definitions.

## 3. Deep Dive: Feast Architecture

**Feast** (Feature Store) is the open-source standard.

### Point-in-Time Correctness
When generating training data, we need the feature values **as they were** at the time of the event.

- Event: Fraud check on `2023-01-05`.
- Feature: `avg_spend_7d`.
- We must compute the average from `2022-12-29` to `2023-01-05`.
- We must NOT include data from `2023-01-06` (Leakage!).

Feast handles this "Time Travel" join automatically.

## 4. Hands-On: Defining Features

Simulating a `feature_store.yaml` definition.

**Note**: Feast requires a running infrastructure, so we will look at the *Interface* code here.

In [None]:
# This is a conceptual representation of a feature definition file (features.py)

print("Defining Feature View...")

code = """
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta

# 1. Define Entity (The primary key)
driver = Entity(name="driver_id", value_type=ValueType.INT64)

# 2. Define Source (Where raw data lives)
driver_stats_source = FileSource(
    path="/data/driver_stats.parquet",
    event_timestamp_column="datetime",
    created_timestamp_column="created"
)

# 3. Define Feature View (Group of features)
driver_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=["driver_id"],
    ttl=timedelta(days=1), # Features older than 1 day are invalid
    features=[
        Feature(name="conv_rate", dtype=ValueType.FLOAT),
        Feature(name="acc_rate", dtype=ValueType.FLOAT),
    ],
    online=True, # Sync to Redis
    batch_source=driver_stats_source,
    tags={"owner": "driver_team"}
)
"""
print(code)

print("\n--- USAGE ---")
print("1. Training (Point-in-time join):")
print("training_df = store.get_historical_features(")
print("    entity_df=events_df,  # Contains timestamps")
print("    features=['driver_hourly_stats:conv_rate']")
print(").to_df()")

print("\n2. Serving (Low latency):")
print("features = store.get_online_features(")
print("    features=['driver_hourly_stats:conv_rate'],")
print("    entity_rows=[{'driver_id': 1001}]")
print(").to_dict()")

## 5. Interview Preparation

### Common Questions

#### Q1: "Why do we need an Online AND Offline store?"
**Answer**: "They serve different needs. The Offline store (S3) is optimized for scanning huge datasets for training (Throughput). The Online store (Redis) is optimized for single-record lookups for serving (Latency). Feast syncs data from Offline to Online automatically."

#### Q2: "What is Point-in-Time Correctness?"
**Answer**: "It ensures that for every training example, we join feature values that were known *at that specific timestamp*, avoiding future leakage. A standard SQL JOIN often fails this if features are updated in place."

#### Q3: "How do you share features across teams?"
**Answer**: "Through the Feature Registry. Team A defines `driver_stats`. Team B working on a different model can discover `driver_stats` in the registry and reuse them without re-engineering the pipeline."