# Day 05: Feature Stores for ML Systems

## Production ML - Week 23

Feature stores are a critical component of production ML systems, providing a centralized repository for storing, managing, and serving features. In quantitative finance, feature stores help ensure consistency between training and inference while managing the complexity of financial time-series data.

### Learning Objectives
1. Understand feature store architecture and components
2. Implement offline and online feature stores
3. Handle point-in-time correctness for financial data
4. Build feature pipelines with versioning and lineage
5. Implement feature serving for real-time trading systems

### Why Feature Stores Matter in Finance
- **Training-Serving Skew**: Ensure features computed during training match production
- **Point-in-Time Correctness**: Prevent look-ahead bias in backtesting
- **Feature Reuse**: Share features across multiple models and teams
- **Data Quality**: Centralized validation and monitoring

In [None]:
import numpy as np
import pandas as pd
import sqlite3
import json
import hashlib
import pickle
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Any, Tuple, Union
from dataclasses import dataclass, field, asdict
from abc import ABC, abstractmethod
from collections import defaultdict
import threading
from concurrent.futures import ThreadPoolExecutor
import time
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("Feature Store Development Environment Ready")

---
## Part 1: Feature Store Architecture

A feature store typically consists of:
1. **Feature Registry**: Metadata about features (schema, lineage, ownership)
2. **Offline Store**: Historical features for training (data warehouse)
3. **Online Store**: Low-latency features for inference (key-value store)
4. **Feature Pipelines**: Transform raw data into features
5. **Feature Serving**: APIs for retrieving features

In [None]:
@dataclass
class FeatureDefinition:
    """Metadata definition for a feature."""
    name: str
    dtype: str
    description: str
    entity: str  # e.g., 'ticker', 'portfolio'
    owner: str
    tags: List[str] = field(default_factory=list)
    version: int = 1
    created_at: datetime = field(default_factory=datetime.now)
    updated_at: datetime = field(default_factory=datetime.now)
    
    # Feature computation metadata
    computation_fn: Optional[str] = None  # Serialized function name
    dependencies: List[str] = field(default_factory=list)
    window_size: Optional[str] = None  # e.g., '20D', '1H'
    
    # Data quality constraints
    min_value: Optional[float] = None
    max_value: Optional[float] = None
    nullable: bool = False
    
    def to_dict(self) -> Dict:
        d = asdict(self)
        d['created_at'] = self.created_at.isoformat()
        d['updated_at'] = self.updated_at.isoformat()
        return d
    
    @classmethod
    def from_dict(cls, d: Dict) -> 'FeatureDefinition':
        d['created_at'] = datetime.fromisoformat(d['created_at'])
        d['updated_at'] = datetime.fromisoformat(d['updated_at'])
        return cls(**d)


@dataclass
class FeatureGroup:
    """Collection of related features."""
    name: str
    entity: str
    description: str
    features: List[FeatureDefinition] = field(default_factory=list)
    event_timestamp_column: str = 'timestamp'
    created_at: datetime = field(default_factory=datetime.now)
    
    def add_feature(self, feature: FeatureDefinition):
        if feature.entity != self.entity:
            raise ValueError(f"Feature entity {feature.entity} doesn't match group entity {self.entity}")
        self.features.append(feature)
    
    def get_feature_names(self) -> List[str]:
        return [f.name for f in self.features]


# Example: Define financial features
price_momentum_20d = FeatureDefinition(
    name='price_momentum_20d',
    dtype='float64',
    description='20-day price momentum (return)',
    entity='ticker',
    owner='quant_team',
    tags=['momentum', 'technical', 'alpha'],
    window_size='20D',
    min_value=-1.0,
    max_value=5.0
)

volatility_20d = FeatureDefinition(
    name='volatility_20d',
    dtype='float64',
    description='20-day realized volatility (annualized)',
    entity='ticker',
    owner='quant_team',
    tags=['volatility', 'risk', 'technical'],
    window_size='20D',
    min_value=0.0,
    max_value=5.0
)

print(f"Feature: {price_momentum_20d.name}")
print(f"Description: {price_momentum_20d.description}")
print(f"Tags: {price_momentum_20d.tags}")

In [None]:
class FeatureRegistry:
    """Central registry for feature metadata."""
    
    def __init__(self, db_path: str = ':memory:'):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self._init_tables()
        self._lock = threading.Lock()
    
    def _init_tables(self):
        """Initialize registry tables."""
        cursor = self.conn.cursor()
        
        # Feature groups table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS feature_groups (
                name TEXT PRIMARY KEY,
                entity TEXT NOT NULL,
                description TEXT,
                event_timestamp_column TEXT DEFAULT 'timestamp',
                created_at TEXT
            )
        ''')
        
        # Features table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS features (
                name TEXT,
                group_name TEXT,
                dtype TEXT,
                description TEXT,
                entity TEXT,
                owner TEXT,
                tags TEXT,
                version INTEGER DEFAULT 1,
                computation_fn TEXT,
                dependencies TEXT,
                window_size TEXT,
                min_value REAL,
                max_value REAL,
                nullable INTEGER DEFAULT 0,
                created_at TEXT,
                updated_at TEXT,
                PRIMARY KEY (name, group_name, version)
            )
        ''')
        
        # Feature lineage table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS feature_lineage (
                feature_name TEXT,
                source_feature TEXT,
                transformation TEXT,
                created_at TEXT
            )
        ''')
        
        self.conn.commit()
    
    def register_feature_group(self, group: FeatureGroup):
        """Register a feature group."""
        with self._lock:
            cursor = self.conn.cursor()
            
            # Insert group
            cursor.execute('''
                INSERT OR REPLACE INTO feature_groups 
                (name, entity, description, event_timestamp_column, created_at)
                VALUES (?, ?, ?, ?, ?)
            ''', (group.name, group.entity, group.description, 
                  group.event_timestamp_column, group.created_at.isoformat()))
            
            # Insert features
            for feature in group.features:
                cursor.execute('''
                    INSERT OR REPLACE INTO features
                    (name, group_name, dtype, description, entity, owner, tags, version,
                     computation_fn, dependencies, window_size, min_value, max_value,
                     nullable, created_at, updated_at)
                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                ''', (feature.name, group.name, feature.dtype, feature.description,
                      feature.entity, feature.owner, json.dumps(feature.tags),
                      feature.version, feature.computation_fn,
                      json.dumps(feature.dependencies), feature.window_size,
                      feature.min_value, feature.max_value, int(feature.nullable),
                      feature.created_at.isoformat(), feature.updated_at.isoformat()))
            
            self.conn.commit()
    
    def get_feature(self, name: str, group_name: Optional[str] = None) -> Optional[FeatureDefinition]:
        """Get feature definition."""
        cursor = self.conn.cursor()
        
        if group_name:
            cursor.execute('''
                SELECT * FROM features WHERE name = ? AND group_name = ?
                ORDER BY version DESC LIMIT 1
            ''', (name, group_name))
        else:
            cursor.execute('''
                SELECT * FROM features WHERE name = ?
                ORDER BY version DESC LIMIT 1
            ''', (name,))
        
        row = cursor.fetchone()
        if row:
            return self._row_to_feature(row)
        return None
    
    def _row_to_feature(self, row) -> FeatureDefinition:
        """Convert database row to FeatureDefinition."""
        return FeatureDefinition(
            name=row[0],
            dtype=row[2],
            description=row[3],
            entity=row[4],
            owner=row[5],
            tags=json.loads(row[6]) if row[6] else [],
            version=row[7],
            computation_fn=row[8],
            dependencies=json.loads(row[9]) if row[9] else [],
            window_size=row[10],
            min_value=row[11],
            max_value=row[12],
            nullable=bool(row[13]),
            created_at=datetime.fromisoformat(row[14]),
            updated_at=datetime.fromisoformat(row[15])
        )
    
    def search_features(self, tags: Optional[List[str]] = None, 
                       entity: Optional[str] = None) -> List[FeatureDefinition]:
        """Search features by tags or entity."""
        cursor = self.conn.cursor()
        
        query = "SELECT * FROM features WHERE 1=1"
        params = []
        
        if entity:
            query += " AND entity = ?"
            params.append(entity)
        
        cursor.execute(query, params)
        rows = cursor.fetchall()
        
        features = [self._row_to_feature(row) for row in rows]
        
        if tags:
            features = [f for f in features if any(t in f.tags for t in tags)]
        
        return features
    
    def list_feature_groups(self) -> List[str]:
        """List all feature groups."""
        cursor = self.conn.cursor()
        cursor.execute("SELECT name FROM feature_groups")
        return [row[0] for row in cursor.fetchall()]


# Create registry and register features
registry = FeatureRegistry()

# Create technical features group
technical_group = FeatureGroup(
    name='technical_indicators',
    entity='ticker',
    description='Technical analysis features for equities'
)
technical_group.add_feature(price_momentum_20d)
technical_group.add_feature(volatility_20d)

# Register
registry.register_feature_group(technical_group)

# Search features
momentum_features = registry.search_features(tags=['momentum'])
print(f"Features with 'momentum' tag: {[f.name for f in momentum_features]}")

# Get specific feature
feature = registry.get_feature('volatility_20d')
print(f"\nFeature details: {feature.name} - {feature.description}")

---
## Part 2: Offline Feature Store

The offline store holds historical feature values for training. Key considerations:
- **Point-in-time correctness**: Features must be computed as they would have been at each historical point
- **Efficient storage**: Handle large time-series datasets
- **Versioning**: Track feature computation changes

In [None]:
class OfflineFeatureStore:
    """Offline store for historical feature storage and retrieval."""
    
    def __init__(self, db_path: str = ':memory:'):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self._feature_tables: Dict[str, bool] = {}
        self._lock = threading.Lock()
    
    def _create_feature_table(self, group_name: str, entity_col: str = 'entity_id'):
        """Create table for a feature group."""
        table_name = f"features_{group_name}"
        
        if table_name in self._feature_tables:
            return
        
        cursor = self.conn.cursor()
        cursor.execute(f'''
            CREATE TABLE IF NOT EXISTS {table_name} (
                {entity_col} TEXT NOT NULL,
                event_timestamp TEXT NOT NULL,
                created_timestamp TEXT NOT NULL,
                feature_data TEXT NOT NULL,
                PRIMARY KEY ({entity_col}, event_timestamp)
            )
        ''')
        
        # Index for efficient point-in-time lookups
        cursor.execute(f'''
            CREATE INDEX IF NOT EXISTS idx_{table_name}_time 
            ON {table_name} (event_timestamp)
        ''')
        
        self.conn.commit()
        self._feature_tables[table_name] = True
    
    def write_features(self, group_name: str, df: pd.DataFrame,
                       entity_col: str = 'entity_id',
                       timestamp_col: str = 'timestamp'):
        """Write features to offline store."""
        self._create_feature_table(group_name, entity_col)
        table_name = f"features_{group_name}"
        
        # Identify feature columns (exclude entity and timestamp)
        feature_cols = [c for c in df.columns if c not in [entity_col, timestamp_col]]
        
        with self._lock:
            cursor = self.conn.cursor()
            created_ts = datetime.now().isoformat()
            
            for _, row in df.iterrows():
                feature_data = {col: row[col] for col in feature_cols}
                
                # Handle NaN values
                feature_data = {
                    k: (None if pd.isna(v) else v) 
                    for k, v in feature_data.items()
                }
                
                cursor.execute(f'''
                    INSERT OR REPLACE INTO {table_name}
                    ({entity_col}, event_timestamp, created_timestamp, feature_data)
                    VALUES (?, ?, ?, ?)
                ''', (str(row[entity_col]), 
                      str(row[timestamp_col]),
                      created_ts,
                      json.dumps(feature_data)))
            
            self.conn.commit()
    
    def read_features(self, group_name: str, 
                      entity_ids: List[str],
                      start_time: datetime,
                      end_time: datetime,
                      feature_names: Optional[List[str]] = None) -> pd.DataFrame:
        """Read historical features."""
        table_name = f"features_{group_name}"
        
        cursor = self.conn.cursor()
        placeholders = ','.join(['?' for _ in entity_ids])
        
        cursor.execute(f'''
            SELECT entity_id, event_timestamp, feature_data
            FROM {table_name}
            WHERE entity_id IN ({placeholders})
            AND event_timestamp >= ?
            AND event_timestamp <= ?
            ORDER BY event_timestamp
        ''', (*entity_ids, start_time.isoformat(), end_time.isoformat()))
        
        rows = cursor.fetchall()
        
        if not rows:
            return pd.DataFrame()
        
        # Parse results
        data = []
        for entity_id, event_ts, feature_json in rows:
            features = json.loads(feature_json)
            features['entity_id'] = entity_id
            features['timestamp'] = pd.to_datetime(event_ts)
            data.append(features)
        
        df = pd.DataFrame(data)
        
        if feature_names:
            cols = ['entity_id', 'timestamp'] + feature_names
            df = df[[c for c in cols if c in df.columns]]
        
        return df
    
    def get_point_in_time_features(self, group_name: str,
                                    entity_timestamps: pd.DataFrame,
                                    entity_col: str = 'entity_id',
                                    timestamp_col: str = 'timestamp',
                                    feature_names: Optional[List[str]] = None) -> pd.DataFrame:
        """
        Get features as of specific timestamps (point-in-time join).
        
        This is crucial for preventing look-ahead bias in backtesting.
        For each (entity, timestamp) pair, return features that were
        available at that exact moment.
        """
        table_name = f"features_{group_name}"
        results = []
        
        cursor = self.conn.cursor()
        
        for _, row in entity_timestamps.iterrows():
            entity_id = str(row[entity_col])
            as_of_time = row[timestamp_col]
            
            # Get most recent feature before or at as_of_time
            cursor.execute(f'''
                SELECT entity_id, event_timestamp, feature_data
                FROM {table_name}
                WHERE entity_id = ?
                AND event_timestamp <= ?
                ORDER BY event_timestamp DESC
                LIMIT 1
            ''', (entity_id, as_of_time.isoformat()))
            
            result_row = cursor.fetchone()
            
            if result_row:
                features = json.loads(result_row[2])
                features['entity_id'] = entity_id
                features['request_timestamp'] = as_of_time
                features['feature_timestamp'] = pd.to_datetime(result_row[1])
                results.append(features)
            else:
                # No features available at this time
                results.append({
                    'entity_id': entity_id,
                    'request_timestamp': as_of_time,
                    'feature_timestamp': None
                })
        
        df = pd.DataFrame(results)
        
        if feature_names and not df.empty:
            cols = ['entity_id', 'request_timestamp', 'feature_timestamp'] + feature_names
            df = df[[c for c in cols if c in df.columns]]
        
        return df


# Create offline store
offline_store = OfflineFeatureStore()
print("Offline Feature Store initialized")

In [None]:
# Generate sample financial data and features
def generate_sample_features(tickers: List[str], 
                             start_date: str, 
                             end_date: str) -> pd.DataFrame:
    """Generate sample feature data for demonstration."""
    dates = pd.date_range(start=start_date, end=end_date, freq='B')
    
    all_data = []
    for ticker in tickers:
        # Simulate prices
        np.random.seed(hash(ticker) % 2**32)
        returns = np.random.normal(0.0005, 0.02, len(dates))
        prices = 100 * np.exp(np.cumsum(returns))
        
        for i, date in enumerate(dates):
            # Compute features
            if i >= 20:
                momentum_20d = prices[i] / prices[i-20] - 1
                volatility_20d = np.std(returns[i-20:i]) * np.sqrt(252)
                volume_ma_ratio = np.random.uniform(0.8, 1.2)
                rsi = 50 + np.random.normal(0, 15)
                rsi = np.clip(rsi, 0, 100)
            else:
                momentum_20d = np.nan
                volatility_20d = np.nan
                volume_ma_ratio = np.nan
                rsi = np.nan
            
            all_data.append({
                'entity_id': ticker,
                'timestamp': date,
                'price': prices[i],
                'price_momentum_20d': momentum_20d,
                'volatility_20d': volatility_20d,
                'volume_ma_ratio': volume_ma_ratio,
                'rsi': rsi
            })
    
    return pd.DataFrame(all_data)


# Generate and store features
tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'META']
features_df = generate_sample_features(tickers, '2024-01-01', '2024-12-31')

print(f"Generated {len(features_df)} feature records")
print(f"\nSample data:")
features_df.head(25).tail(5)

In [None]:
# Write features to offline store
offline_store.write_features(
    group_name='technical_indicators',
    df=features_df
)

# Read features for specific entities and time range
read_df = offline_store.read_features(
    group_name='technical_indicators',
    entity_ids=['AAPL', 'GOOGL'],
    start_time=datetime(2024, 6, 1),
    end_time=datetime(2024, 6, 30),
    feature_names=['price_momentum_20d', 'volatility_20d']
)

print(f"Retrieved {len(read_df)} records")
read_df.head(10)

In [None]:
# Demonstrate point-in-time join (critical for backtesting)
# Create sample entity-timestamp requests (e.g., trade signals)
trade_signals = pd.DataFrame({
    'entity_id': ['AAPL', 'GOOGL', 'AAPL', 'MSFT', 'AAPL'],
    'timestamp': pd.to_datetime([
        '2024-06-15 10:30:00',  # Mid-day - should get features from 2024-06-14
        '2024-06-15 14:00:00',
        '2024-06-20 09:30:00',  # Early morning - should get features from 2024-06-19
        '2024-06-25 16:00:00',  # End of day
        '2024-02-01 12:00:00',  # Early in year
    ])
})

# Get point-in-time features
pit_features = offline_store.get_point_in_time_features(
    group_name='technical_indicators',
    entity_timestamps=trade_signals,
    feature_names=['price_momentum_20d', 'volatility_20d', 'rsi']
)

print("Point-in-Time Feature Join Result:")
print("(Note: feature_timestamp shows when features were actually computed)")
pit_features

---
## Part 3: Online Feature Store

The online store provides low-latency feature retrieval for real-time inference. Key characteristics:
- Sub-millisecond read latency
- Key-value based access
- Stores only the latest feature values
- Synchronized from offline store

In [None]:
class OnlineFeatureStore:
    """Low-latency online store for real-time feature serving."""
    
    def __init__(self):
        # In production: Redis, DynamoDB, or similar
        # Here: in-memory dict with TTL support
        self._store: Dict[str, Dict[str, Any]] = defaultdict(dict)
        self._timestamps: Dict[str, datetime] = {}
        self._ttl_seconds: Dict[str, int] = {}
        self._lock = threading.Lock()
        
        # Metrics
        self._metrics = {
            'reads': 0,
            'writes': 0,
            'cache_hits': 0,
            'cache_misses': 0,
            'latencies': []
        }
    
    def _make_key(self, group_name: str, entity_id: str) -> str:
        """Create composite key."""
        return f"{group_name}:{entity_id}"
    
    def write_feature(self, group_name: str, entity_id: str,
                      features: Dict[str, Any], ttl_seconds: int = 86400):
        """Write features for an entity."""
        key = self._make_key(group_name, entity_id)
        
        with self._lock:
            self._store[key] = features.copy()
            self._timestamps[key] = datetime.now()
            self._ttl_seconds[key] = ttl_seconds
            self._metrics['writes'] += 1
    
    def write_batch(self, group_name: str, 
                    entity_features: Dict[str, Dict[str, Any]],
                    ttl_seconds: int = 86400):
        """Batch write features for multiple entities."""
        for entity_id, features in entity_features.items():
            self.write_feature(group_name, entity_id, features, ttl_seconds)
    
    def read_feature(self, group_name: str, entity_id: str,
                     feature_names: Optional[List[str]] = None) -> Optional[Dict[str, Any]]:
        """Read features for an entity with latency tracking."""
        start_time = time.perf_counter()
        key = self._make_key(group_name, entity_id)
        
        with self._lock:
            self._metrics['reads'] += 1
            
            if key not in self._store:
                self._metrics['cache_misses'] += 1
                return None
            
            # Check TTL
            if self._is_expired(key):
                del self._store[key]
                self._metrics['cache_misses'] += 1
                return None
            
            self._metrics['cache_hits'] += 1
            features = self._store[key].copy()
        
        # Track latency
        latency = (time.perf_counter() - start_time) * 1000  # ms
        self._metrics['latencies'].append(latency)
        
        if feature_names:
            features = {k: v for k, v in features.items() if k in feature_names}
        
        return features
    
    def read_batch(self, group_name: str, entity_ids: List[str],
                   feature_names: Optional[List[str]] = None) -> Dict[str, Dict[str, Any]]:
        """Batch read features for multiple entities."""
        results = {}
        for entity_id in entity_ids:
            features = self.read_feature(group_name, entity_id, feature_names)
            if features:
                results[entity_id] = features
        return results
    
    def _is_expired(self, key: str) -> bool:
        """Check if a key has expired."""
        if key not in self._timestamps:
            return True
        age = (datetime.now() - self._timestamps[key]).total_seconds()
        return age > self._ttl_seconds.get(key, 86400)
    
    def get_metrics(self) -> Dict[str, Any]:
        """Get store metrics."""
        latencies = self._metrics['latencies']
        return {
            'total_reads': self._metrics['reads'],
            'total_writes': self._metrics['writes'],
            'cache_hit_rate': (
                self._metrics['cache_hits'] / max(1, self._metrics['reads'])
            ),
            'avg_latency_ms': np.mean(latencies) if latencies else 0,
            'p50_latency_ms': np.percentile(latencies, 50) if latencies else 0,
            'p99_latency_ms': np.percentile(latencies, 99) if latencies else 0,
            'stored_keys': len(self._store)
        }


# Create online store
online_store = OnlineFeatureStore()

# Populate with latest features for each ticker
latest_features = features_df.groupby('entity_id').last().reset_index()

for _, row in latest_features.iterrows():
    online_store.write_feature(
        group_name='technical_indicators',
        entity_id=row['entity_id'],
        features={
            'price': row['price'],
            'price_momentum_20d': row['price_momentum_20d'],
            'volatility_20d': row['volatility_20d'],
            'rsi': row['rsi']
        }
    )

print("Online store populated with latest features")
print(f"Stored keys: {len(online_store._store)}")

In [None]:
# Simulate real-time feature serving
def simulate_realtime_inference(online_store: OnlineFeatureStore,
                                 num_requests: int = 1000):
    """Simulate real-time model inference with feature fetching."""
    tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'META']
    feature_names = ['price_momentum_20d', 'volatility_20d']
    
    for _ in range(num_requests):
        ticker = np.random.choice(tickers)
        features = online_store.read_feature(
            'technical_indicators', ticker, feature_names
        )
    
    return online_store.get_metrics()


# Run simulation
metrics = simulate_realtime_inference(online_store, num_requests=5000)

print("Online Store Performance Metrics:")
print(f"  Total reads: {metrics['total_reads']}")
print(f"  Cache hit rate: {metrics['cache_hit_rate']:.2%}")
print(f"  Avg latency: {metrics['avg_latency_ms']:.4f} ms")
print(f"  P50 latency: {metrics['p50_latency_ms']:.4f} ms")
print(f"  P99 latency: {metrics['p99_latency_ms']:.4f} ms")

In [None]:
# Batch feature retrieval for portfolio
portfolio_tickers = ['AAPL', 'GOOGL', 'MSFT']

portfolio_features = online_store.read_batch(
    group_name='technical_indicators',
    entity_ids=portfolio_tickers,
    feature_names=['price_momentum_20d', 'volatility_20d', 'rsi']
)

print("Portfolio Features (Real-time):")
for ticker, features in portfolio_features.items():
    print(f"\n{ticker}:")
    for name, value in features.items():
        print(f"  {name}: {value:.4f}")

---
## Part 4: Feature Pipelines

Feature pipelines transform raw data into features. Key components:
- **Transformation logic**: Reusable feature computation functions
- **Scheduling**: Batch and streaming pipelines
- **Validation**: Ensure data quality

In [None]:
class FeatureTransformation(ABC):
    """Base class for feature transformations."""
    
    @abstractmethod
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply transformation to dataframe."""
        pass
    
    @abstractmethod
    def get_required_columns(self) -> List[str]:
        """Get columns required for transformation."""
        pass
    
    @abstractmethod
    def get_output_columns(self) -> List[str]:
        """Get columns produced by transformation."""
        pass


class MomentumFeatures(FeatureTransformation):
    """Compute momentum-based features."""
    
    def __init__(self, windows: List[int] = [5, 10, 20, 60]):
        self.windows = windows
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        result = df.copy()
        
        for window in self.windows:
            # Price momentum
            result[f'momentum_{window}d'] = (
                result.groupby('entity_id')['close']
                .pct_change(window)
            )
            
            # Normalized momentum (z-score)
            result[f'momentum_{window}d_zscore'] = (
                result.groupby('entity_id')[f'momentum_{window}d']
                .transform(lambda x: (x - x.rolling(60).mean()) / x.rolling(60).std())
            )
        
        return result
    
    def get_required_columns(self) -> List[str]:
        return ['entity_id', 'close']
    
    def get_output_columns(self) -> List[str]:
        cols = []
        for w in self.windows:
            cols.extend([f'momentum_{w}d', f'momentum_{w}d_zscore'])
        return cols


class VolatilityFeatures(FeatureTransformation):
    """Compute volatility-based features."""
    
    def __init__(self, windows: List[int] = [10, 20, 60]):
        self.windows = windows
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        result = df.copy()
        
        # Daily returns
        result['daily_return'] = result.groupby('entity_id')['close'].pct_change()
        
        for window in self.windows:
            # Realized volatility (annualized)
            result[f'volatility_{window}d'] = (
                result.groupby('entity_id')['daily_return']
                .transform(lambda x: x.rolling(window).std() * np.sqrt(252))
            )
            
            # Volatility ratio (short/long term)
            if window > 10:
                result[f'vol_ratio_10_{window}'] = (
                    result['volatility_10d'] / result[f'volatility_{window}d']
                )
        
        return result
    
    def get_required_columns(self) -> List[str]:
        return ['entity_id', 'close']
    
    def get_output_columns(self) -> List[str]:
        cols = ['daily_return']
        for w in self.windows:
            cols.append(f'volatility_{w}d')
            if w > 10:
                cols.append(f'vol_ratio_10_{w}')
        return cols


class TechnicalIndicators(FeatureTransformation):
    """Compute technical indicators."""
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        result = df.copy()
        
        # RSI
        delta = result.groupby('entity_id')['close'].diff()
        gain = delta.where(delta > 0, 0)
        loss = (-delta).where(delta < 0, 0)
        
        avg_gain = gain.groupby(result['entity_id']).transform(
            lambda x: x.rolling(14).mean()
        )
        avg_loss = loss.groupby(result['entity_id']).transform(
            lambda x: x.rolling(14).mean()
        )
        
        rs = avg_gain / avg_loss.replace(0, np.nan)
        result['rsi_14'] = 100 - (100 / (1 + rs))
        
        # MACD
        ema_12 = result.groupby('entity_id')['close'].transform(
            lambda x: x.ewm(span=12).mean()
        )
        ema_26 = result.groupby('entity_id')['close'].transform(
            lambda x: x.ewm(span=26).mean()
        )
        result['macd'] = ema_12 - ema_26
        result['macd_signal'] = result.groupby('entity_id')['macd'].transform(
            lambda x: x.ewm(span=9).mean()
        )
        result['macd_hist'] = result['macd'] - result['macd_signal']
        
        # Bollinger Bands
        sma_20 = result.groupby('entity_id')['close'].transform(
            lambda x: x.rolling(20).mean()
        )
        std_20 = result.groupby('entity_id')['close'].transform(
            lambda x: x.rolling(20).std()
        )
        result['bb_upper'] = sma_20 + 2 * std_20
        result['bb_lower'] = sma_20 - 2 * std_20
        result['bb_position'] = (
            (result['close'] - result['bb_lower']) / 
            (result['bb_upper'] - result['bb_lower'])
        )
        
        return result
    
    def get_required_columns(self) -> List[str]:
        return ['entity_id', 'close']
    
    def get_output_columns(self) -> List[str]:
        return ['rsi_14', 'macd', 'macd_signal', 'macd_hist', 
                'bb_upper', 'bb_lower', 'bb_position']


print("Feature transformations defined")

In [None]:
class FeaturePipeline:
    """Pipeline for computing and validating features."""
    
    def __init__(self, name: str):
        self.name = name
        self.transformations: List[FeatureTransformation] = []
        self.validators: List['FeatureValidator'] = []
    
    def add_transformation(self, transformation: FeatureTransformation):
        """Add a transformation to the pipeline."""
        self.transformations.append(transformation)
        return self
    
    def add_validator(self, validator: 'FeatureValidator'):
        """Add a validator to the pipeline."""
        self.validators.append(validator)
        return self
    
    def run(self, df: pd.DataFrame, validate: bool = True) -> Tuple[pd.DataFrame, Dict]:
        """Run the pipeline on input data."""
        result = df.copy()
        metrics = {
            'input_rows': len(df),
            'transformations': [],
            'validation_results': []
        }
        
        # Apply transformations
        for transform in self.transformations:
            start_time = time.time()
            result = transform.transform(result)
            elapsed = time.time() - start_time
            
            metrics['transformations'].append({
                'name': transform.__class__.__name__,
                'output_columns': transform.get_output_columns(),
                'elapsed_seconds': elapsed
            })
        
        # Validate results
        if validate:
            for validator in self.validators:
                validation_result = validator.validate(result)
                metrics['validation_results'].append(validation_result)
        
        metrics['output_rows'] = len(result)
        metrics['output_columns'] = list(result.columns)
        
        return result, metrics


class FeatureValidator:
    """Validate feature quality."""
    
    def __init__(self, name: str):
        self.name = name
        self.rules: List[Dict] = []
    
    def add_null_check(self, columns: List[str], max_null_pct: float = 0.1):
        """Add null value check."""
        self.rules.append({
            'type': 'null_check',
            'columns': columns,
            'max_null_pct': max_null_pct
        })
        return self
    
    def add_range_check(self, column: str, min_val: float, max_val: float):
        """Add value range check."""
        self.rules.append({
            'type': 'range_check',
            'column': column,
            'min': min_val,
            'max': max_val
        })
        return self
    
    def add_distribution_check(self, column: str, 
                                expected_mean: float, 
                                tolerance: float = 0.5):
        """Add distribution check."""
        self.rules.append({
            'type': 'distribution_check',
            'column': column,
            'expected_mean': expected_mean,
            'tolerance': tolerance
        })
        return self
    
    def validate(self, df: pd.DataFrame) -> Dict:
        """Run all validation rules."""
        results = {
            'validator': self.name,
            'passed': True,
            'checks': []
        }
        
        for rule in self.rules:
            check_result = self._run_check(df, rule)
            results['checks'].append(check_result)
            if not check_result['passed']:
                results['passed'] = False
        
        return results
    
    def _run_check(self, df: pd.DataFrame, rule: Dict) -> Dict:
        """Run a single validation check."""
        check_type = rule['type']
        
        if check_type == 'null_check':
            null_pcts = {}
            passed = True
            for col in rule['columns']:
                if col in df.columns:
                    null_pct = df[col].isna().mean()
                    null_pcts[col] = null_pct
                    if null_pct > rule['max_null_pct']:
                        passed = False
            return {
                'type': 'null_check',
                'passed': passed,
                'details': null_pcts
            }
        
        elif check_type == 'range_check':
            col = rule['column']
            if col not in df.columns:
                return {'type': 'range_check', 'passed': False, 'error': 'column not found'}
            
            out_of_range = (
                (df[col] < rule['min']) | (df[col] > rule['max'])
            ).sum()
            total = df[col].notna().sum()
            
            return {
                'type': 'range_check',
                'column': col,
                'passed': out_of_range == 0,
                'out_of_range_count': int(out_of_range),
                'out_of_range_pct': out_of_range / max(1, total)
            }
        
        elif check_type == 'distribution_check':
            col = rule['column']
            if col not in df.columns:
                return {'type': 'distribution_check', 'passed': False, 'error': 'column not found'}
            
            actual_mean = df[col].mean()
            diff = abs(actual_mean - rule['expected_mean'])
            passed = diff <= rule['tolerance']
            
            return {
                'type': 'distribution_check',
                'column': col,
                'passed': passed,
                'expected_mean': rule['expected_mean'],
                'actual_mean': actual_mean,
                'difference': diff
            }
        
        return {'type': check_type, 'passed': False, 'error': 'unknown check type'}


print("Feature Pipeline classes defined")

In [None]:
# Generate raw OHLCV data for pipeline
def generate_ohlcv_data(tickers: List[str], num_days: int = 500) -> pd.DataFrame:
    """Generate sample OHLCV data."""
    dates = pd.date_range(end=datetime.now(), periods=num_days, freq='B')
    
    all_data = []
    for ticker in tickers:
        np.random.seed(hash(ticker) % 2**32)
        
        # Generate price series
        returns = np.random.normal(0.0005, 0.02, num_days)
        close = 100 * np.exp(np.cumsum(returns))
        
        # Generate OHLCV
        for i, date in enumerate(dates):
            daily_vol = np.random.uniform(0.005, 0.02)
            open_price = close[i] * (1 + np.random.uniform(-daily_vol, daily_vol))
            high = max(open_price, close[i]) * (1 + np.random.uniform(0, daily_vol))
            low = min(open_price, close[i]) * (1 - np.random.uniform(0, daily_vol))
            volume = np.random.randint(1000000, 10000000)
            
            all_data.append({
                'entity_id': ticker,
                'timestamp': date,
                'open': open_price,
                'high': high,
                'low': low,
                'close': close[i],
                'volume': volume
            })
    
    return pd.DataFrame(all_data)


# Create raw data
raw_data = generate_ohlcv_data(['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'META'])
print(f"Raw data shape: {raw_data.shape}")
raw_data.head()

In [None]:
# Build feature pipeline
pipeline = FeaturePipeline('equity_features')

# Add transformations
pipeline.add_transformation(MomentumFeatures(windows=[5, 10, 20]))
pipeline.add_transformation(VolatilityFeatures(windows=[10, 20]))
pipeline.add_transformation(TechnicalIndicators())

# Add validators
validator = FeatureValidator('quality_checks')
validator.add_null_check(['momentum_20d', 'volatility_20d'], max_null_pct=0.15)
validator.add_range_check('rsi_14', min_val=0, max_val=100)
validator.add_distribution_check('rsi_14', expected_mean=50, tolerance=15)

pipeline.add_validator(validator)

# Run pipeline
features_result, pipeline_metrics = pipeline.run(raw_data)

print("Pipeline Execution Results:")
print(f"  Input rows: {pipeline_metrics['input_rows']}")
print(f"  Output rows: {pipeline_metrics['output_rows']}")
print(f"  Output columns: {len(pipeline_metrics['output_columns'])}")

print("\nTransformations:")
for t in pipeline_metrics['transformations']:
    print(f"  {t['name']}: {t['elapsed_seconds']:.3f}s, {len(t['output_columns'])} features")

print("\nValidation Results:")
for v in pipeline_metrics['validation_results']:
    status = '✓' if v['passed'] else '✗'
    print(f"  {status} {v['validator']}")
    for check in v['checks']:
        check_status = '✓' if check['passed'] else '✗'
        print(f"    {check_status} {check['type']}")

In [None]:
# Show sample of computed features
feature_cols = ['entity_id', 'timestamp', 'close', 'momentum_20d', 'momentum_20d_zscore',
                'volatility_20d', 'rsi_14', 'macd', 'bb_position']

features_result[feature_cols].dropna().tail(15)

---
## Part 5: Complete Feature Store System

Integrating all components into a cohesive feature store system.

In [None]:
class FeatureStore:
    """
    Complete Feature Store system integrating:
    - Feature Registry
    - Offline Store
    - Online Store
    - Feature Pipelines
    """
    
    def __init__(self, name: str = 'default'):
        self.name = name
        self.registry = FeatureRegistry()
        self.offline_store = OfflineFeatureStore()
        self.online_store = OnlineFeatureStore()
        self.pipelines: Dict[str, FeaturePipeline] = {}
        self._sync_lock = threading.Lock()
    
    def register_feature_group(self, group: FeatureGroup):
        """Register a feature group with the store."""
        self.registry.register_feature_group(group)
        print(f"Registered feature group: {group.name} with {len(group.features)} features")
    
    def register_pipeline(self, name: str, pipeline: FeaturePipeline):
        """Register a feature pipeline."""
        self.pipelines[name] = pipeline
        print(f"Registered pipeline: {name}")
    
    def materialize_features(self, pipeline_name: str, 
                             raw_data: pd.DataFrame,
                             group_name: str,
                             entity_col: str = 'entity_id',
                             timestamp_col: str = 'timestamp') -> Dict:
        """
        Run pipeline and write results to both offline and online stores.
        """
        if pipeline_name not in self.pipelines:
            raise ValueError(f"Pipeline {pipeline_name} not found")
        
        pipeline = self.pipelines[pipeline_name]
        
        # Run pipeline
        features_df, metrics = pipeline.run(raw_data)
        
        # Check validation
        if metrics['validation_results']:
            all_passed = all(v['passed'] for v in metrics['validation_results'])
            if not all_passed:
                print("Warning: Some validation checks failed")
        
        # Write to offline store
        self.offline_store.write_features(
            group_name=group_name,
            df=features_df,
            entity_col=entity_col,
            timestamp_col=timestamp_col
        )
        
        # Update online store with latest values
        self._sync_to_online(features_df, group_name, entity_col)
        
        metrics['materialization'] = {
            'offline_rows': len(features_df),
            'online_entities': features_df[entity_col].nunique()
        }
        
        return metrics
    
    def _sync_to_online(self, df: pd.DataFrame, group_name: str, entity_col: str):
        """Sync latest features to online store."""
        # Get latest row for each entity
        latest = df.groupby(entity_col).last().reset_index()
        
        # Get feature columns (exclude entity, timestamp, and price data)
        exclude_cols = {entity_col, 'timestamp', 'open', 'high', 'low', 'close', 'volume'}
        feature_cols = [c for c in latest.columns if c not in exclude_cols]
        
        with self._sync_lock:
            for _, row in latest.iterrows():
                features = {col: row[col] for col in feature_cols 
                           if pd.notna(row[col])}
                self.online_store.write_feature(
                    group_name=group_name,
                    entity_id=str(row[entity_col]),
                    features=features
                )
    
    def get_training_data(self, group_name: str,
                          entity_ids: List[str],
                          start_time: datetime,
                          end_time: datetime,
                          feature_names: Optional[List[str]] = None) -> pd.DataFrame:
        """Get historical features for training."""
        return self.offline_store.read_features(
            group_name=group_name,
            entity_ids=entity_ids,
            start_time=start_time,
            end_time=end_time,
            feature_names=feature_names
        )
    
    def get_online_features(self, group_name: str,
                            entity_ids: List[str],
                            feature_names: Optional[List[str]] = None) -> Dict[str, Dict]:
        """Get latest features for inference."""
        return self.online_store.read_batch(
            group_name=group_name,
            entity_ids=entity_ids,
            feature_names=feature_names
        )
    
    def get_point_in_time_features(self, group_name: str,
                                    entity_timestamps: pd.DataFrame,
                                    feature_names: Optional[List[str]] = None) -> pd.DataFrame:
        """Get point-in-time features for backtesting."""
        return self.offline_store.get_point_in_time_features(
            group_name=group_name,
            entity_timestamps=entity_timestamps,
            feature_names=feature_names
        )
    
    def search_features(self, tags: Optional[List[str]] = None,
                        entity: Optional[str] = None) -> List[FeatureDefinition]:
        """Search features in registry."""
        return self.registry.search_features(tags=tags, entity=entity)
    
    def get_metrics(self) -> Dict:
        """Get store metrics."""
        return {
            'online_store': self.online_store.get_metrics(),
            'registered_pipelines': list(self.pipelines.keys()),
            'feature_groups': self.registry.list_feature_groups()
        }


# Create feature store instance
feature_store = FeatureStore('quant_features')
print(f"Feature Store '{feature_store.name}' initialized")

In [None]:
# Register feature group
equity_group = FeatureGroup(
    name='equity_technical',
    entity='ticker',
    description='Technical analysis features for equities'
)

# Define features
features_to_register = [
    FeatureDefinition('momentum_5d', 'float64', '5-day momentum', 'ticker', 'quant_team', ['momentum']),
    FeatureDefinition('momentum_10d', 'float64', '10-day momentum', 'ticker', 'quant_team', ['momentum']),
    FeatureDefinition('momentum_20d', 'float64', '20-day momentum', 'ticker', 'quant_team', ['momentum']),
    FeatureDefinition('volatility_10d', 'float64', '10-day volatility', 'ticker', 'quant_team', ['volatility']),
    FeatureDefinition('volatility_20d', 'float64', '20-day volatility', 'ticker', 'quant_team', ['volatility']),
    FeatureDefinition('rsi_14', 'float64', '14-period RSI', 'ticker', 'quant_team', ['technical'], 
                      min_value=0, max_value=100),
    FeatureDefinition('macd', 'float64', 'MACD indicator', 'ticker', 'quant_team', ['technical']),
    FeatureDefinition('bb_position', 'float64', 'Bollinger Band position', 'ticker', 'quant_team', ['technical']),
]

for feature in features_to_register:
    equity_group.add_feature(feature)

feature_store.register_feature_group(equity_group)

# Register pipeline
feature_store.register_pipeline('equity_pipeline', pipeline)

In [None]:
# Materialize features
materialize_metrics = feature_store.materialize_features(
    pipeline_name='equity_pipeline',
    raw_data=raw_data,
    group_name='equity_technical'
)

print("Materialization Results:")
print(f"  Offline rows: {materialize_metrics['materialization']['offline_rows']}")
print(f"  Online entities: {materialize_metrics['materialization']['online_entities']}")

In [None]:
# Use feature store for training
training_data = feature_store.get_training_data(
    group_name='equity_technical',
    entity_ids=['AAPL', 'GOOGL', 'MSFT'],
    start_time=datetime(2025, 1, 1),
    end_time=datetime(2025, 12, 31),
    feature_names=['momentum_20d', 'volatility_20d', 'rsi_14']
)

print(f"Training data shape: {training_data.shape}")
training_data.head(10)

In [None]:
# Use feature store for real-time inference
inference_features = feature_store.get_online_features(
    group_name='equity_technical',
    entity_ids=['AAPL', 'GOOGL'],
    feature_names=['momentum_20d', 'volatility_20d', 'rsi_14']
)

print("Real-time Inference Features:")
for ticker, features in inference_features.items():
    print(f"\n{ticker}:")
    for name, value in features.items():
        if isinstance(value, float):
            print(f"  {name}: {value:.4f}")
        else:
            print(f"  {name}: {value}")

In [None]:
# Search for features
momentum_features = feature_store.search_features(tags=['momentum'])
print("Features with 'momentum' tag:")
for f in momentum_features:
    print(f"  - {f.name}: {f.description}")

print("\n" + "="*50)

# Get store metrics
metrics = feature_store.get_metrics()
print("\nFeature Store Metrics:")
print(f"  Feature groups: {metrics['feature_groups']}")
print(f"  Pipelines: {metrics['registered_pipelines']}")
print(f"  Online store cache hit rate: {metrics['online_store']['cache_hit_rate']:.2%}")

---
## Part 6: Feature Store Best Practices

### Production Considerations for Finance

In [None]:
class FeatureMonitor:
    """
    Monitor feature quality and drift in production.
    
    Critical for financial ML systems where data drift
    can significantly impact model performance.
    """
    
    def __init__(self):
        self.baseline_stats: Dict[str, Dict] = {}
        self.alerts: List[Dict] = []
    
    def set_baseline(self, feature_name: str, df: pd.DataFrame, column: str):
        """Set baseline statistics for a feature."""
        values = df[column].dropna()
        self.baseline_stats[feature_name] = {
            'mean': values.mean(),
            'std': values.std(),
            'min': values.min(),
            'max': values.max(),
            'median': values.median(),
            'q25': values.quantile(0.25),
            'q75': values.quantile(0.75),
            'null_rate': df[column].isna().mean(),
            'timestamp': datetime.now()
        }
    
    def check_drift(self, feature_name: str, current_values: pd.Series,
                    threshold_std: float = 2.0) -> Dict:
        """Check for feature drift compared to baseline."""
        if feature_name not in self.baseline_stats:
            return {'error': 'No baseline set for feature'}
        
        baseline = self.baseline_stats[feature_name]
        current_mean = current_values.mean()
        current_std = current_values.std()
        current_null_rate = current_values.isna().mean()
        
        # Calculate drift metrics
        mean_drift = (current_mean - baseline['mean']) / baseline['std']
        std_ratio = current_std / baseline['std']
        null_rate_change = current_null_rate - baseline['null_rate']
        
        alerts = []
        
        # Check mean drift
        if abs(mean_drift) > threshold_std:
            alerts.append({
                'type': 'mean_drift',
                'feature': feature_name,
                'severity': 'high' if abs(mean_drift) > 3 else 'medium',
                'value': mean_drift,
                'threshold': threshold_std
            })
        
        # Check volatility change
        if std_ratio > 1.5 or std_ratio < 0.5:
            alerts.append({
                'type': 'volatility_change',
                'feature': feature_name,
                'severity': 'high' if std_ratio > 2 or std_ratio < 0.25 else 'medium',
                'value': std_ratio
            })
        
        # Check null rate spike
        if null_rate_change > 0.1:
            alerts.append({
                'type': 'null_rate_spike',
                'feature': feature_name,
                'severity': 'high',
                'value': null_rate_change
            })
        
        self.alerts.extend(alerts)
        
        return {
            'feature': feature_name,
            'mean_drift_std': mean_drift,
            'std_ratio': std_ratio,
            'null_rate_change': null_rate_change,
            'alerts': alerts,
            'is_healthy': len(alerts) == 0
        }
    
    def generate_report(self) -> Dict:
        """Generate monitoring report."""
        high_severity = [a for a in self.alerts if a['severity'] == 'high']
        medium_severity = [a for a in self.alerts if a['severity'] == 'medium']
        
        return {
            'timestamp': datetime.now().isoformat(),
            'total_alerts': len(self.alerts),
            'high_severity': len(high_severity),
            'medium_severity': len(medium_severity),
            'alerts_by_feature': self._group_alerts_by_feature(),
            'recent_alerts': self.alerts[-10:]
        }
    
    def _group_alerts_by_feature(self) -> Dict[str, int]:
        """Group alerts by feature."""
        grouped = defaultdict(int)
        for alert in self.alerts:
            grouped[alert['feature']] += 1
        return dict(grouped)


# Demonstrate monitoring
monitor = FeatureMonitor()

# Set baselines from training period
baseline_data = features_result[features_result['timestamp'] < '2025-10-01']
monitor.set_baseline('momentum_20d', baseline_data, 'momentum_20d')
monitor.set_baseline('volatility_20d', baseline_data, 'volatility_20d')
monitor.set_baseline('rsi_14', baseline_data, 'rsi_14')

# Check current data
current_data = features_result[features_result['timestamp'] >= '2025-10-01']

print("Feature Drift Analysis:")
for feature in ['momentum_20d', 'volatility_20d', 'rsi_14']:
    result = monitor.check_drift(feature, current_data[feature])
    status = '✓ Healthy' if result['is_healthy'] else '⚠ Drift Detected'
    print(f"\n{feature}: {status}")
    print(f"  Mean drift (std): {result['mean_drift_std']:.2f}")
    print(f"  Std ratio: {result['std_ratio']:.2f}")

In [None]:
# Feature versioning example
class FeatureVersion:
    """Track feature computation versions."""
    
    def __init__(self):
        self.versions: Dict[str, List[Dict]] = defaultdict(list)
    
    def register_version(self, feature_name: str, version: int,
                         computation_code: str, description: str):
        """Register a new version of a feature."""
        # Create hash of computation code
        code_hash = hashlib.md5(computation_code.encode()).hexdigest()[:8]
        
        self.versions[feature_name].append({
            'version': version,
            'code_hash': code_hash,
            'description': description,
            'computation_code': computation_code,
            'created_at': datetime.now().isoformat()
        })
    
    def get_version(self, feature_name: str, 
                    version: Optional[int] = None) -> Optional[Dict]:
        """Get specific version or latest."""
        if feature_name not in self.versions:
            return None
        
        versions = self.versions[feature_name]
        if version is None:
            return versions[-1] if versions else None
        
        for v in versions:
            if v['version'] == version:
                return v
        return None
    
    def list_versions(self, feature_name: str) -> List[Dict]:
        """List all versions of a feature."""
        return self.versions.get(feature_name, [])


# Register feature versions
versioner = FeatureVersion()

versioner.register_version(
    'momentum_20d', 1,
    'price_t / price_t_minus_20 - 1',
    'Simple 20-day price momentum'
)

versioner.register_version(
    'momentum_20d', 2,
    'log(price_t / price_t_minus_20)',
    'Log 20-day momentum (more normally distributed)'
)

versioner.register_version(
    'momentum_20d', 3,
    '(price_t / price_t_minus_20 - 1) / volatility_20d',
    'Risk-adjusted 20-day momentum'
)

print("Feature Versions for 'momentum_20d':")
for v in versioner.list_versions('momentum_20d'):
    print(f"  v{v['version']} ({v['code_hash']}): {v['description']}")

---
## Summary: Feature Store Architecture

### Key Components Implemented

1. **Feature Registry**
   - Centralized metadata management
   - Feature search and discovery
   - Lineage tracking

2. **Offline Store**
   - Historical feature storage
   - Point-in-time correctness
   - Efficient time-range queries

3. **Online Store**
   - Low-latency serving
   - TTL support
   - Performance metrics

4. **Feature Pipelines**
   - Reusable transformations
   - Data validation
   - Batch and incremental processing

5. **Monitoring**
   - Feature drift detection
   - Quality alerts
   - Version tracking

### Production Considerations for Finance

| Aspect | Consideration |
|--------|---------------|
| **Look-ahead Bias** | Point-in-time joins critical for backtesting |
| **Data Latency** | Features must be computed before trading decisions |
| **Versioning** | Track feature computation changes for reproducibility |
| **Monitoring** | Detect drift that could impact model performance |
| **Consistency** | Same features for training and production |

### Popular Feature Store Solutions

- **Feast**: Open-source, Kubernetes-native
- **Tecton**: Enterprise, real-time focus
- **Databricks Feature Store**: Integrated with MLflow
- **AWS SageMaker Feature Store**: AWS-native
- **Hopsworks**: Open-source, Python-centric

In [None]:
# Final summary
print("="*60)
print("Feature Store Implementation Complete")
print("="*60)

print(f"""
Components Built:
  ✓ FeatureRegistry - Metadata and discovery
  ✓ OfflineFeatureStore - Historical storage with PIT joins
  ✓ OnlineFeatureStore - Real-time serving
  ✓ FeaturePipeline - Transformation and validation
  ✓ FeatureStore - Integrated system
  ✓ FeatureMonitor - Drift detection
  ✓ FeatureVersion - Computation versioning

Key Concepts:
  • Point-in-time correctness prevents look-ahead bias
  • Online/offline separation optimizes for different use cases
  • Feature pipelines ensure consistency
  • Monitoring detects production issues early
  • Versioning enables reproducibility

Next Steps:
  → Integrate with production databases (Redis, PostgreSQL)
  → Add streaming feature computation (Kafka, Spark)
  → Implement feature sharing across teams
  → Set up automated monitoring dashboards
""")