# Module 7: Feature Stores and Model Registries

**Course**: End-to-End Machine Learning (Datacamp)  
**Case Study**: CardioCare Heart Disease Prediction  
**Author**: Seif

---

## Overview

- Why feature stores matter (consistency, reuse, discovery)
- Feast basics (Entity, Field, FeatureView, sources)
- Applying and retrieving features
- Why model registries matter (versioning, lineage, governance)
- Using MLflow Model Registry (and its requirements)

## Feature stores

Feature stores are a central, versioned repository of curated and transformed features:
- Ensure the same transformations in training are used in production
- Reduce duplication and enable discovery/reuse across teams
- Provide lineage and governance for critical inputs (e.g., patient features)

We'll use [Feast](https://feast.dev/), an open-source feature store, to illustrate the concepts.

## Feast concepts

- `Entity`: The primary key of your features (e.g., `patient_id`)
- `Field`: The schema of a feature (name + dtype)
- `FileSource` (or other sources): Where features come from (CSV, data warehouse, etc.)
- `FeatureView`: A logical group of features tied to entities and a source

In [None]:
# Define Feast objects for the CardioCare dataset (safe to run if Feast is installed)
try:
    from feast import Entity, FeatureView, Field
    from feast.infra.offline_stores.file_source import FileSource
    from feast.types import Float32, Int64
    import pandas as pd

    # Entity (join key)
    patient = Entity(name="patient_id", join_keys=["patient_id"])

    # Offline source: CSV under ./data (replace with your real path)
    patient_source = FileSource(
        path="data/heart_disease.csv",
        event_timestamp_column="event_timestamp",  # required for point-in-time correctness
        created_timestamp_column=None
    )

    # Feature schema
    patient_features_schema = [
        Field(name="age", dtype=Int64),
        Field(name="sex", dtype=Int64),
        Field(name="cholesterol", dtype=Float32),
        # Add more fields as needed (e.g., blood_pressure, chest_pain_type, etc.)
    ]

    # Feature view
    patient_features_view = FeatureView(
        name="patient_features",
        entities=[patient],
        ttl=None,
        schema=patient_features_schema,
        source=patient_source,
        online=True,
        tags={"owner": "cardiocare"}
    )
    print("Defined Feast Entity, FileSource, and FeatureView.")
except ImportError:
    print("Feast is not installed. To try this, install feast:\n  pip install feast\nThen initialize a repo: \
  feast init cardiocare_repo\n" )

### Applying definitions

Feast expects a repo (with `feature_store.yaml`) to manage resources. Typical flow:
1. Create or open a Feast repo (e.g., `feast init cardiocare_repo`)
2. Add your entity/feature view definitions in Python files under the repo
3. Apply them: `feast apply`
4. Build an entity DataFrame and retrieve features for training or inference

In [None]:
# Example: retrieving features (historical)
try:
    from feast import FeatureStore
    import pandas as pd
    # Suppose we already ran `feast apply` in ./cardiocare_repo
    store = FeatureStore(repo_path="./cardiocare_repo")

    # entity_df must include 'patient_id' and 'event_timestamp' columns
    entity_df = pd.DataFrame({
        "patient_id": [101, 202, 303],
        "event_timestamp": pd.to_datetime(["2024-01-02", "2024-02-10", "2024-03-05"])
    })

    training_df = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "patient_features:age",
            "patient_features:sex",
            "patient_features:cholesterol"
        ],
    ).to_df()
    training_df.head()
except Exception as e:
    print("(Feast demo) Unable to retrieve features: 
,
Make sure you've created a Feast repo and ran `feast apply`.")

## Model registries

A model registry version-controls trained models and their metadata:
- Compare models, annotate with links to code/data/features, and track performance over time
- Promote models through stages (e.g., Staging → Production) with governance

We'll continue with MLflow—note that the Model Registry requires a tracking server backed by a database (file-based tracking alone doesn't support the registry UI).

In [None]:
# Register a model in MLflow Model Registry (illustrative)
import mlflow
from mlflow.tracking import MlflowClient

# Example: point MLflow to a tracking server with a DB backend
# e.g., SQLite (dev): mlflow server --backend-store-uri sqlite:///mlruns.db --default-artifact-root ./mlartifacts --host 127.0.0.1 --port 5000
# mlflow.set_tracking_uri("http://127.0.0.1:5000")

model_name = "CardioCareHeartDiseaseLR"
# After logging a model artifact (see Module 5), register it:
# result = mlflow.register_model("runs:/<your_run_id>/model", model_name)
# print("Registered model version:", result.version)

# Transition a version's stage (e.g., Staging → Production)
# client = MlflowClient()
# client.transition_model_version_stage(name=model_name, version=result.version, stage="Staging")
print("Refer to Module 5 for logging runs, then register the best run's model here.")

## Practice

- Define a `patient_id` entity and add more patient features (e.g., `blood_pressure`) via `Field`
- Create a Feast repo (`feast init cardiocare_repo`), place definitions, and run `feast apply`
- Build an `entity_df` covering a time window and retrieve a training DataFrame via `get_historical_features`
- Train a model and log it to MLflow (Module 5), then register it and set a stage

## Adding more patient features with Feast

Below we define additional feature fields from the classic heart-disease schema. Choose Feast dtypes to match your stored data:
- Use Int32/Int64 for integer-coded categorical features (e.g., cp, ca, thal)
- Use Float32/Float64 for continuous features (e.g., thalach if stored as float)



In [None]:
# Example: define entity and selected feature fields
try:
    from feast import Entity, Field
    from feast.types import Float32, Int32

    # Entity identifies the join key for your features
    patient = Entity(name="patient", join_keys=["patient_id"])  # same as 'patient_id' column in your data

    # Selected feature fields (choose types to match your stored data)
    # cp: chest pain type (usually integer-coded categories 0..3)
    cp = Field(name="cp", dtype=Int32)

    # thalach: maximum heart rate achieved (often an integer count)
    thalach = Field(name="thalach", dtype=Int32)

    # ca: number of major vessels colored by fluoroscopy (0..3)
    ca = Field(name="ca", dtype=Int32)

    # thal: thalassemia (categorical code)
    thal = Field(name="thal", dtype=Int32)

    print("Defined Entity 'patient' and fields: cp, thalach, ca, thal")
except ImportError:
    print("Feast is not installed. Install with: pip install feast")

In [None]:
# Save a DataFrame to Parquet and register it in Feast
# This demo will:
# 1) Create or reuse a heart_disease_df with required columns
# 2) Save it to heart_disease.parquet
# 3) Define a Feast FileSource pointing to that Parquet file
# 4) Define a FeatureView that uses previously defined Entity/Fields
# 5) Apply the definitions to a FeatureStore (requires a Feast repo)

import os
import pandas as pd
from datetime import datetime, timedelta

try:
    # 1) Create or reuse a DataFrame
    try:
        df = heart_disease_df.copy()
    except NameError:
        # Minimal synthetic example with required columns
        import numpy as np
        n = 50
        df = pd.DataFrame({
            "patient_id": np.arange(1000, 1000 + n),
            "cp": np.random.randint(0, 4, size=n),           # chest pain type (0..3)
            "thalach": np.random.randint(90, 200, size=n),   # max heart rate
            "ca": np.random.randint(0, 4, size=n),           # number of vessels (0..3)
            "thal": np.random.randint(0, 4, size=n),         # thalassemia code
        })
        # Add timestamps required by Feast (event_timestamp is mandatory)
        now = datetime.utcnow()
        df["timestamp"] = [now - timedelta(days=i) for i in range(n)]
        df["created"] = now

    # Ensure timestamp columns exist if user supplied heart_disease_df
    if "timestamp" not in df.columns:
        df["timestamp"] = pd.to_datetime(datetime.utcnow())
    if "created" not in df.columns:
        df["created"] = pd.to_datetime(datetime.utcnow())

    # 2) Save to Parquet (small local demo file)
    parquet_path = "heart_disease.parquet"
    df.to_parquet(parquet_path, index=False)
    print(f"Saved {len(df)} rows to {parquet_path}")

    # 3) Define Feast FileSource
    try:
        from feast.infra.offline_stores.file_source import FileSource
        from feast import FeatureView, FeatureStore
    except ImportError:
        print("Feast is not installed. Install with: pip install feast")
        raise SystemExit

    data_source = FileSource(
        path=parquet_path,
        event_timestamp_column="timestamp",
        created_timestamp_column="created",
    )

    # 4) Define Entity/Fields if not already available
    try:
        patient
        cp; thalach; ca; thal
    except NameError:
        from feast import Entity, Field
        from feast.types import Int32
        patient = Entity(name="patient", join_keys=["patient_id"])  # join on patient_id
        cp = Field(name="cp", dtype=Int32)
        thalach = Field(name="thalach", dtype=Int32)
        ca = Field(name="ca", dtype=Int32)
        thal = Field(name="thal", dtype=Int32)

    # 5) Define FeatureView and apply to a FeatureStore
    heart_disease_fv = FeatureView(
        name="heart_disease",
        entities=[patient],
        schema=[cp, thalach, ca, thal],
        source=data_source,
    )

    # Feast requires a repo (feature_store.yaml). If not present, print guidance.
    repo_path = "."
    store = FeatureStore(repo_path=repo_path)
    try:
        store.apply([patient, heart_disease_fv])
        print("Applied Entity and FeatureView to Feast store.")
    except Exception as e:
        print("Could not apply to Feast store.")
        print("Make sure you have a Feast repo initialized in this folder (feature_store.yaml present).")
        print("Quick start:\n  pip install feast\n  feast init cardiocare_repo\n  cd cardiocare_repo\n  feast apply")
        print("Underlying error:\n", str(e))
except SystemExit:
    pass
except Exception as e:
    print("Unexpected error:", str(e))

## Explanation: what the code does and how to adapt it

- Save DataFrame to Parquet
  - We take a DataFrame (`heart_disease_df` if available, else a synthetic one) and write it to `heart_disease.parquet`.
  - Parquet is a columnar, efficient file format well-supported by feature stores.

- Required timestamp columns
  - Feast requires an `event_timestamp_column` on every row for point-in-time correctness.
  - Optionally you can track `created_timestamp_column` for lineage. We added both: `timestamp` (event time) and `created` (load time).

- FileSource
  - Points Feast to the Parquet file and tells it which columns represent the event/created timestamps.

- Entity and Fields
  - `patient` is the entity (join key) using `patient_id`.
  - `cp`, `thalach`, `ca`, `thal` are `Field` definitions which describe feature names and dtypes.
  - Use Int32/Int64 for integer-coded features; use Float32/Float64 for continuous features.

- FeatureView
  - Groups the schema (`[cp, thalach, ca, thal]`), the entity (`patient`), and the source (`data_source`).

- FeatureStore.apply
  - Applies the definitions to a Feast repo (expects a `feature_store.yaml` in the repo path).
  - If you don't have one yet, run:
    - `pip install feast`
    - `feast init cardiocare_repo`
    - `cd cardiocare_repo && feast apply`

### Example: retrieving historical features (after apply)

```python
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path=".")
entity_df = pd.DataFrame({
    "patient_id": [1001, 1005],
    "event_timestamp": pd.to_datetime(["2025-01-01", "2025-01-02"]) ,
})
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "heart_disease:cp",
        "heart_disease:thalach",
        "heart_disease:ca",
        "heart_disease:thal",
    ],
).to_df()
training_df.head()
```

### Common gotchas
- Ensure your DataFrame has `patient_id`, `timestamp`, and the feature columns.
- The `event_timestamp_column` must be timezone-aware or consistently UTC.
- `FeatureStore(repo_path=...)` must point to a directory containing a `feature_store.yaml`.
- If your table is large, consider storing Parquet under a dedicated data folder and using absolute paths in `FileSource`.
