# Module 7: Feature Store and MLOps

## Business Context: TechCorp HR Analytics (finale)

**Where We Are:**
We've built a complete salary prediction pipeline: EDA -> Splitting -> Imputation -> Transformation -> Engineering -> ML Model. Now HR wants to **productionize** this solution.

**Why Feature Store Matters for HR:**
1. **Consistency:** When predicting salary for a new hire, we need the EXACT same feature transformations
2. **Time-travel:** If we retrain the model later, we need features as they were at prediction time
3. **Governance:** Auditors need to trace back how each salary prediction was made
4. **Reusability:** Other teams (Recruiting, Finance) can reuse our employee features

**The Production Flow:**
```
New Hire Application -> Feature Store (lookup) -> Model (predict) -> Recommended Salary
                              |
               Same features used in training!
```

---

**Training Objective:** Master Databricks Feature Store and Model Registry for production-grade ML workflows.

**Scope:**
- Feature Store Creation: Creating and populating Feature Tables
- Feature Lookup: Creating training datasets with point-in-time correctness
- Training & Logging: Training models with feature lineage
- Model Registry: Promoting models to Production with Unity Catalog

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals (Advanced)
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Feature Store enabled (default in UC)
  - Permissions: CREATE TABLE, SELECT, MODIFY, CREATE MODEL
- **Dependencies:** `05_Feature_Engineering.ipynb` (creates `customer_train_engineered` table)
- **Execution time:** ~25 minutes

> **Note:** Feature Store is essential for production ML - it ensures consistency between training and inference.

## Theoretical Introduction

**What is Feature Store?**

A centralized repository for features that enables:
- **Write Once, Use Everywhere**: Define feature logic once, reuse for training and inference
- **Training-Serving Consistency**: Same features in development and production
- **Point-in-Time Correctness**: Prevent data leakage with temporal joins
- **Feature Discovery**: Teams can find and reuse existing features

**Feature Store Workflow:**

```
Raw Data → Feature Engineering → Feature Table → Training Set → Model
                                       ↓
                              Inference (Batch/Online)
```

**Key Concepts:**

| Concept | Description |
|---------|-------------|
| **Feature Table** | Delta table with primary key and optional timestamp |
| **Feature Lookup** | Join features to training labels by key |
| **Point-in-Time Join** | Only use features available at observation time |
| **fs.log_model** | Model remembers which features it needs |
| **score_batch** | Auto-fetches features by ID for inference |

**Model Registry with Unity Catalog:**
- Models registered in `catalog.schema.model_name`
- Use aliases like "Champion" and "Challenger" for lifecycle management

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

**Initialize Feature Store Client:**

In [0]:
from databricks.feature_store import FeatureStoreClient
from pyspark.sql.functions import current_timestamp

fs = FeatureStoreClient()
df = spark.table("customer_train_engineered")

# Add timestamp for Feature Store
df_fs = df.withColumn("event_timestamp", current_timestamp())

## Section 1: Creating a Feature Table

**The Problem:**
In traditional ML, data scientists often rewrite feature engineering code for training, and engineers rewrite it again for production. This leads to **Training-Serving Skew** (inconsistencies).

**The Solution:**
The **Feature Store** acts as a centralized repository.
1.  **Write Once:** Define feature logic once.
2.  **Use Everywhere:** Fetch features for training (offline) or inference (online) ensuring consistency.
3.  **Discovery:** Other teams can find and reuse your features (e.g., "Customer LTV").

We register our engineered features so others can use them without re-running the engineering code.

Info for user: You probably get an error because VectorUTC type is not supported in FeatureStore.


In [0]:
display(df_fs.summary())

In [0]:
display(df_fs.printSchema())

In [0]:
from pyspark.ml.functions import vector_to_array

# Zakładamy, że Twoja ramka danych to 'df'
# Konwersja wszystkich kolumn VectorUDT na tablice
cols_to_convert = [
    "country_vec", "features_num", "features_std", 
    "features_minmax", "features_robust", "features_final"
]

for col_name in cols_to_convert:
    # "float32" oszczędza miejsce, "float64" to domyślna precyzja
    df_fs = df_fs.withColumn(col_name, vector_to_array(col_name, dtype="float64"))

# Teraz df ma typ ArrayType zamiast VectorUDT i Feature Store to przyjm

In [0]:
display(df_fs.printSchema())

Run that query again

In [0]:
display(df_fs)

In [0]:

%skip
fs.drop_table(name=f"{catalog_name}.{schema_name}.customer_features")

In [0]:
table_name = f"{catalog_name}.{schema_name}.customer_features"

# Create Feature Table
# We use mode="overwrite" to allow re-running this cell
fs.create_table(
    name=table_name,
    primary_keys=["id"], 
    # timestamp_keys=["event_timestamp"],
    df=df_fs,
    description="Customer features for salary prediction: Age, Experience Years, Tenure Days, LTV Proxy"
)

print(f"Feature Table {table_name} created.")

## Section 2: Reading & Training Set (Lookup)

One of the main benefits of Feature Store is **Point-in-Time Lookup**.
When creating a training set, we need to join features to our labels. Feature Store ensures that for each label (observation), we only use feature values that were available *at that time*, preventing data leakage.

In [0]:
# 1. Simple Read
df_features = fs.read_table(name=table_name)
display(df_features.limit(5))

# 2. Create Training Set with Feature Lookup
df_spine = spark.table("customer_train").select("id", "salary")

from databricks.feature_store import FeatureLookup

# Features for SALARY PREDICTION (no log_salary - that would be data leakage!)
feature_lookups = [
    FeatureLookup(
        table_name=table_name,
        feature_names=["salary_imputed","age_imputed", "experience_years", "tenure_days", "ltv_proxy"],
        lookup_key="id"
    )
]

training_set = fs.create_training_set(
    df=df_spine,
    feature_lookups=feature_lookups,
    label="salary",
    exclude_columns=["id"]
)



df_training = training_set.load_df()
display(df_training.limit(5))

## Section 3: Training & Logging with Feature Store

We will now train a model and log it using `fs.log_model`.
**Why?**
When we log with Feature Store, the model remembers exactly which features it needs. At inference time, we just provide the `id`, and the model automatically fetches the features from the store!

In [0]:
import mlflow
from sklearn.linear_model import LinearRegression

# Prepare data for Scikit-Learn
pdf = df_training.toPandas()
X = pdf.drop("salary_imputed", axis=1)
y = pdf["salary_imputed"]

# Train a simple model
model = LinearRegression()
model.fit(X, y)

# Log Model
# We use fs.log_model instead of mlflow.sklearn.log_model
model_name = f"{catalog_name}.{schema_name}.customer_salary_model"

with mlflow.start_run(run_name="FS_Model_Demo") as run:
    fs.log_model(
        model,
        artifact_path="model",
        flavor=mlflow.sklearn,
        training_set=training_set,
        registered_model_name=model_name # This registers the model automatically!
    )
    print(f"Model logged and registered as: {model_name}")
    print(f"Run ID: {run.info.run_id}")


## Section 4: Model Registry (Transition to Production)

The model is now registered in Unity Catalog. We can manage its lifecycle using aliases.

In [0]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Get the latest version of the model we just registered
latest_version = client.get_registered_model(model_name).latest_versions[0].version

print(f"Latest Version: {latest_version}")

# Transition to Production (Alias in Unity Catalog)
# In UC, we use Aliases like 'Champion' or 'Challenger' instead of Stages
client.set_registered_model_alias(model_name, "Champion", latest_version)

print(f"Model version {latest_version} set as 'Champion'.")


## Section 5: Batch Inference (Scoring)

The magic of Feature Store is that we don't need to assemble features manually for inference.
We just provide the **IDs**, and `fs.score_batch` automatically looks up the correct features from the store and applies the model.

In [0]:
# Simulate new data (Just IDs!)
df_new = spark.createDataFrame([(1,), (2,), (5,)], ["id"])

print("Scoring new data (IDs only)...")

# Score Batch
# Note: We use the model URI from Unity Catalog
predictions = fs.score_batch(
    model_uri=f"models:/{model_name}/Champion",
    df=df_new
)

display(predictions)


## Best Practices

###  Feature Store Strategy Guide:

| Aspect | Best Practice | Why |
|--------|--------------|-----|
| **Primary Keys** | Use stable IDs (customer_id, not row number) | Consistent lookups |
| **Timestamps** | Include `event_timestamp` | Point-in-time correctness |
| **Feature Names** | Descriptive (e.g., `customer_ltv_30d`) | Discoverability |
| **Granularity** | One table per entity type | Cleaner schema |
| **Updates** | Use `mode="merge"` for incremental | Efficiency |

### ️ Common Mistakes to Avoid:

1. **No timestamp column** → Cannot do point-in-time joins
2. **Duplicating features** → Use existing tables, don't recreate
3. **Logging without training_set** → Model doesn't know its features
4. **Using wrong model URI** → Include alias (Champion) for production
5. **Not testing inference** → Always verify score_batch works

###  Pro Tips:

- Use Unity Catalog for governance and lineage
- Set up online store for real-time inference (<10ms)
- Use aliases (Champion/Challenger) for A/B testing
- Monitor feature freshness in production
- Document features in the description field

## Summary

### What we achieved:

- **Feature Table**: Created centralized feature repository
- **Feature Lookup**: Built training set with automatic joins
- **fs.log_model**: Logged model with feature lineage
- **Model Registry**: Registered and promoted model to Champion
- **score_batch**: Demonstrated automatic feature fetching

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Feature Store ensures consistency** - same features everywhere |
| 2 | **Point-in-time joins prevent leakage** - use timestamps |
| 3 | **fs.log_model links model to features** - automatic inference |
| 4 | **Unity Catalog provides governance** - lineage and access control |
| 5 | **score_batch simplifies inference** - just provide IDs |

### Artifacts Created:

| Artifact | Location | Purpose |
|----------|----------|---------|
| Feature Table | `{catalog}.{schema}.customer_features` | Centralized features |
| Registered Model | `{catalog}.{schema}.customer_salary_model` | Production model |
| Champion Alias | Applied to latest version | Production deployment |

### Next Steps:

 **Next Module:** Module 8 (BONUS) - GenAI, Vector Search & AI Functions

## Cleanup

Optionally remove demo artifacts created during exercises:

In [0]:
# Cleanup - remove demo artifacts created in this notebook

# Uncomment the lines below to remove demo artifacts:

# spark.sql(f"DROP TABLE IF EXISTS {table_name}")  # Feature table
# client.delete_registered_model(model_name)  # Registered model

# print(" All demo artifacts removed")

print("ℹ️ Cleanup disabled (uncomment code to remove demo artifacts)")