# Chapter 63: Feature Stores

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the concept of a feature store and its role in the MLOps lifecycle
- Differentiate between offline and online feature stores and their use cases
- Explain the importance of point‑in‑time correctness to avoid data leakage
- Design and implement a simple feature store using Feast for the NEPSE prediction system
- Define feature sets and register them in a feature registry
- In压 features for training (offline) and serve them for real‑time inference (online)
- Manage feature versioning and lineage to ensure reproducibility
- Integrate a feature store with existing data pipelines and model training scripts
- Recognise the benefits of feature stores: reusability, consistency, and reduced training‑serving skew
- Evaluate when to adopt a feature store versus simpler alternatives

---

## Introduction

In the previous chapters, we engineered features for the NEPSE prediction system—lags, rolling statistics, technical indicators—and used them to train models. However, we faced a common problem: the same features had to be recomputed both during training (on historical data) and during inference (on live data). This duplication of effort often leads to inconsistencies, known as **training‑serving skew**. Moreover, features were tied to specific models, making it difficult to reuse them across different projects or share them among team members.

A **feature store** solves these problems by acting as a central repository for features. It provides a consistent way to define, compute, store, and serve features for both training and inference. Feature stores have become a cornerstone of modern MLOps architectures, especially for time‑series and real‑time applications.

In this chapter, we will explore the concepts behind feature stores, their architecture, and how to implement one using the open‑source tool **Feast**. Using the NEPSE system as a concrete example, we will see how a feature store can streamline our workflow, ensure consistency, and enable collaboration.

---

## 63.1 What is a Feature Store?

A **feature store** is a data management layer for machine learning. It stores pre‑computed features and makes them available for both training (offline) and serving (online). It also manages metadata about features, such as their definitions, sources, and versions.

### 63.1.1 Core Components

A typical feature store consists of:

- **Offline Store**: A storage system (e.g., data lake, data warehouse) that holds historical feature data, used for training and batch inference. It supports large‑scale retrieval of feature data at specific points in time.
- **Online Store**: A low‑latency database (e.g., Redis, DynamoDB) that stores the latest feature values for each entity, used for real‑time inference.
- **Feature Registry**: A catalog of all defined features, including their metadata, source data, and transformation logic.
- **Serving API**: A service that provides features to models during training and inference, ensuring the correct features are retrieved efficiently.

### 63.1.2 Why Use a Feature Store?

For the NEPSE system, a feature store offers several benefits:

- **Consistency**: The same feature definitions are used for training and serving, eliminating training‑serving skew.
- **Reusability**: Features like `RSI_14` or `SMA_20` can be defined once and used by multiple models (e.g., different stocks or different prediction horizons).
- **Point‑in‑time correctness**: When joining features for training, we must ensure we only use data available at the prediction time (no look‑ahead). Feature stores handle this automatically.
- **Reduced engineering effort**: Data scientists can focus on feature logic, while the feature store handles storage, serving, and scalability.
- **Feature discovery and governance**: A central registry allows teams to discover existing features, understand their meaning, and track lineage.

---

## 63.2 Feature Store Architecture

A typical feature store architecture is shown below:

```
                    ┌─────────────────┐
                    │   Data Sources  │
                    │ (e.g., Kafka,   │
                    │  historical DB) │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Stream/Batch   │
                    │   Processing    │
                    │  (e.g., Spark,  │
                    │   Flink)        │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
    ┌─────────────────┐             ┌─────────────────┐
    │  Offline Store  │             │  Online Store   │
    │ (e.g., BigQuery,│             │ (e.g., Redis,   │
    │  S3 + Parquet)  │             │  DynamoDB)      │
    └────────┬────────┘             └────────┬────────┘
             │                               │
             │                               │
             ▼                               ▼
    ┌─────────────────┐             ┌─────────────────┐
    │ Training        │             │ Online Serving  │
    │ (historical     │             │ (real‑time)     │
    │  feature views) │             │                 │
    └─────────────────┘             └─────────────────┘
```

- **Data sources** can be batch (e.g., daily CSV files) or streaming (e.g., Kafka topics). For NEPSE, we might have daily CSV files and a live tick stream.
- **Batch/Stream processing** computes features from raw data. For example, a Spark job computes daily rolling statistics and stores them in the offline store, while a Flink job computes real‑time features and updates the online store.
- **Offline store** stores historical feature values with timestamps. This is used to generate training datasets.
- **Online store** stores the most recent feature values per entity (e.g., per stock) for low‑latency retrieval.
- **Training** retrieves features from the offline store, often for a specific time range, to create training data.
- **Online serving** retrieves the latest features from the online store during inference.

---

## 63.3 Point‑in‑Time Correctness

One of the most critical aspects of feature stores is ensuring **point‑in‑time correctness**. When training a model, we must join features that were available at the time of prediction. For example, if we want to predict tomorrow's return using today's RSI, we must ensure that the RSI used in training is computed from data up to today, not including tomorrow's prices.

Feature stores handle this by storing each feature value with a timestamp (event time) and an entity key (e.g., stock symbol). When creating a training dataset, we specify a set of timestamps (e.g., each trading day) and join the feature values that were valid at those times. This prevents look‑ahead bias.

**Example:** Suppose we have a feature `RSI_14` computed daily. For a training example with timestamp `2024-01-01`, we need the RSI value computed using data up to `2024-01-01` (i.e., the RSI for that day, which uses the last 14 days including `2024-01-01`). If we mistakenly used the RSI from `2024-01-02`, we would be leaking future information.

---

## 63.4 Feast: A Popular Open‑Source Feature Store

**Feast** (Feature Store) is an open‑source feature store that works with offline and online stores. It provides:

- A Python SDK to define features and feature views.
- A registry to store metadata.
- Retrieval APIs for training and serving.

Feast is designed to be cloud‑agnostic and integrates with various data sources and storage backends.

### 63.4.1 Installation

```bash
pip install feast
```

### 63.4.2 Defining Features for NEPSE

We start by creating a Feast feature repository. This is a directory with configuration and feature definitions.

```bash
feast init nepse_feast
cd nepse_feast
```

The repository contains `feature_store.yaml` and a `features` directory. We'll define our features in `features.py`.

**Example `features.py`:**

```python
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64, String
from feast.value_type import ValueType

# Define a data source (parquet files on disk or in cloud storage)
# For NEPSE, we could store daily feature values in Parquet.
nepse_source = FileSource(
    path="data/nepse_features.parquet",  # path to historical features
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp",
)

# Define an entity (the thing we are modelling)
stock = Entity(
    name="stock",
    value_type=ValueType.STRING,
    description="Stock symbol",
    join_keys=["stock"],
)

# Define feature views (groupings of features)
stock_features = FeatureView(
    name="stock_daily_features",
    entities=[stock],
    ttl=timedelta(days=365),  # how long features are retained
    schema=[
        Field(name="open", dtype=Float32),
        Field(name="high", dtype=Float32),
        Field(name="low", dtype=Float32),
        Field(name="close", dtype=Float32),
        Field(name="volume", dtype=Int64),
        Field(name="sma_20", dtype=Float32),
        Field(name="rsi_14", dtype=Float32),
        Field(name="volume_ratio", dtype=Float32),
    ],
    source=nepse_source,
)
```

**Explanation:**  
We define a data source (a Parquet file) containing historical features. The `Entity` represents the stock symbol. The `FeatureView` groups features that share the same source and entity. Feast will use this definition to serve features.

### 63.4.3 Ingesting Historical Features

We need to populate the Parquet file with historical features. This can be done by a separate batch pipeline (e.g., a Spark job or a Python script). The Parquet file should have columns: `stock`, `event_timestamp`, and all feature columns.

**Example Python script to create Parquet:**

```python
import pandas as pd
from datetime import datetime

# Assume we have a DataFrame with features for each stock and day
df = pd.read_csv('nepse_features.csv')
df['event_timestamp'] = pd.to_datetime(df['date'])
df['created_timestamp'] = datetime.now()
df = df.rename(columns={'symbol': 'stock'})
# Select only needed columns
feature_columns = ['stock', 'event_timestamp', 'created_timestamp',
                   'open', 'high', 'low', 'close', 'volume',
                   'sma_20', 'rsi_14', 'volume_ratio']
df[feature_columns].to_parquet('data/nepse_features.parquet')
```

### 63.4.4 Applying Feature Definitions

After defining the features, we apply them to the Feast registry.

```bash
feast apply
```

This registers the entities and feature views in the registry (by default, a local SQLite database). It also sets up the online store (if configured).

### 63.4.5 Retrieving Features for Training

To create a training dataset, we need a list of entity rows with timestamps. Feast will join the features that were valid at those timestamps.

```python
from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path=".")

# Create an entity dataframe: for each stock and day we want features
entity_df = pd.DataFrame({
    "stock": ["NABIL", "NABIL", "NTC", "NTC"],
    "event_timestamp": pd.to_datetime(["2024-01-01", "2024-01-02", "2024-01-01", "2024-01-02"])
})

# Retrieve features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "stock_daily_features:open",
        "stock_daily_features:high",
        "stock_daily_features:low",
        "stock_daily_features:close",
        "stock_daily_features:volume",
        "stock_daily_features:sma_20",
        "stock_daily_features:rsi_14",
        "stock_daily_features:volume_ratio",
    ]
).to_df()

print(training_df.head())
```

**Explanation:**  
Feast performs a point‑in‑time join: for each row in `entity_df`, it retrieves the feature values that were valid at that timestamp (i.e., the most recent feature value with `event_timestamp <= entity timestamp`, respecting the TTL). This ensures no look‑ahead.

### 63.4.6 Online Serving

For real‑time inference, we need to push the latest feature values to the online store. Feast supports writing features to an online store (e.g., Redis) via the `materialize` command.

First, configure an online store in `feature_store.yaml`. For example, using Redis:

```yaml
project: nepse
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"
offline_store:
  type: file
registry: data/registry.db
```

Then materialize features from the offline store to the online store:

```bash
feast materialize 2024-01-01T00:00:00 2024-12-31T00:00:00
```

This loads all feature values from the offline store within the time range into the online store.

Now, during inference, we can retrieve the latest features for a stock:

```python
features = store.get_online_features(
    features=[
        "stock_daily_features:open",
        "stock_daily_features:high",
        "stock_daily_features:low",
        "stock_daily_features:close",
        "stock_daily_features:volume",
        "stock_daily_features:sma_20",
        "stock_daily_features:rsi_14",
        "stock_daily_features:volume_ratio",
    ],
    entity_rows=[{"stock": "NABIL"}]
).to_dict()

print(features)
```

**Explanation:**  
`get_online_features` fetches the latest feature values for the given entity from the online store (Redis). This is a low‑latency operation suitable for real‑time prediction APIs.

---

## 63.5 Feature Versioning and Lineage

As features evolve (e.g., we change the definition of `rsi_14`), we need to version them. Feast supports versioning through the registry: each feature view has a name and can be updated. However, changing a feature view definition does not retroactively change historical data; it creates a new version. Feast does not automatically handle time‑travel of feature definitions—you must manage this by using different feature view names or by storing the definition alongside the data.

For lineage, Feast records metadata: which source each feature comes from, when it was created, etc. This can be exported for auditing.

---

## 63.6 Integrating with Model Training

We can now use the feature store in our NEPSE training script:

```python
import pandas as pd
from feast import FeatureStore
import xgboost as xgb

# 1. Get training data
store = FeatureStore(repo_path=".")
entity_df = pd.DataFrame({
    "stock": ["NABIL"] * 500,  # example: last 500 days for NABIL
    "event_timestamp": pd.date_range(end="2024-01-01", periods=500)
})
train_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "stock_daily_features:open",
        "stock_daily_features:high",
        "stock_daily_features:low",
        "stock_daily_features:close",
        "stock_daily_features:volume",
        "stock_daily_features:sma_20",
        "stock_daily_features:rsi_14",
        "stock_daily_features:volume_ratio",
    ]
).to_df()

# Assume we have a target column (next day return) pre‑joined or in a separate table
# For simplicity, we merge with target here
target_df = pd.read_csv('nepse_targets.csv')  # columns: stock, date, target
train_df = train_df.merge(target_df, left_on=['stock', 'event_timestamp'], right_on=['stock', 'date'])

X = train_df[['open', 'high', 'low', 'close', 'volume', 'sma_20', 'rsi_14', 'volume_ratio']]
y = train_df['target']

# Train model
model = xgb.XGBClassifier()
model.fit(X, y)

# Save model (as before)
```

For inference in a real‑time API, we would fetch online features and feed them to the model.

---

## 63.7 Feast in Production

For production use, consider:

- **Scaling the online store**: Use managed Redis or DynamoDB.
- **Streaming updates**: Use a stream processor (e.g., Flink) to compute features and push them to the online store via Feast's `write_to_online_store` method.
- **Monitoring**: Track feature retrieval latency and error rates.
- **Access control**: Use Feast's support for different projects to separate teams.

Feast also integrates with Kubernetes via the Feast Serving component, which provides a gRPC API for low‑latency feature retrieval.

---

## 63.8 Alternative Feature Stores

While Feast is popular and open‑source, other feature stores exist:

- **Tecton**: Commercial platform built on Feast, with additional features like automatic feature engineering and monitoring.
- **Hopsworks**: Open‑source feature store with integrated ML platform.
- **AWS SageMaker Feature Store**: Managed feature store integrated with SageMaker.
- **Vertex AI Feature Store**: Google Cloud's managed feature store.

The choice depends on your infrastructure and requirements. For the NEPSE system, starting with Feast is a good way to learn the concepts without vendor lock‑in.

---

## 63.9 Best Practices for Feature Stores

1. **Define features early**: Involve data scientists in defining the feature schema.
2. **Use consistent naming**: Adopt a naming convention (e.g., `sma_20` for 20‑day simple moving average).
3. **Version features carefully**: When changing a feature, consider creating a new feature view rather than modifying an existing one.
4. **Monitor feature freshness**: Ensure that the online store is updated promptly; stale features can hurt model performance.
5. **Backfill historical data**: Before using a feature store, backfill all historical features to enable training on past data.
6. **Secure sensitive features**: If some features are sensitive (e.g., proprietary signals), implement access controls.
7. **Test feature retrieval**: Write tests to ensure that features are correctly retrieved for both training and serving.

---

## 63.10 Implementing a Simple Custom Feature Store

If Feast seems too heavy for your initial NEPSE project, you can build a simple custom feature store using a database and a caching layer. For example:

- Store historical features in a PostgreSQL table with columns `(stock, date, feature_name, value)`.
- For training, query all features for a date range and pivot to wide format.
- For online, use Redis with keys like `feature:stock:NABIL` storing a JSON of latest features.

However, you'll need to implement point‑in‑time joins and ensure consistency yourself. Feast handles these complexities, making it worth the initial setup.

---

## Chapter Summary

In this chapter, we introduced the concept of a feature store and its role in MLOps. We covered:

- The core components: offline store, online store, feature registry, and serving API.
- The importance of point‑in‑time correctness to avoid data leakage.
- How to implement a feature store using Feast, with a concrete example for the NEPSE prediction system.
- Defining entities, feature views, and sources.
- Retrieving features for training (historical) and serving (online).
- Feature versioning, lineage, and integration with model training.
- Production considerations and best practices.

By adopting a feature store, the NEPSE system gains consistency, reusability, and reduced training‑serving skew. Features become a first‑class asset, shared across models and teams. In the next chapter, we will discuss **Experiment Tracking**, another key MLOps practice that helps manage the model development lifecycle.

---

**End of Chapter 63**