# Databricks Data Preparation in ML - Notebook 07
## Feature Store Practical Demo

**Part of the Databricks Data Preparation in ML Training Series**

---

## Objectives

Hands-on Feature Store demo in Databricks:

- **Feature Store Setup** - Initialize and configure Feature Store
- **Feature Tables** - Create and manage feature tables
- **Feature Publishing** - Write and update features
- **Model Training** - Use features for ML training
- **MLflow Integration** - Track features with models

## Duration: ~30 minutes | Level: Intermediate

---

## What is Feature Store?

**Centralized feature management platform:**
- **Shared Features**: Reuse features across teams
- **Point-in-Time**: Historical feature values for training
- **Versioning**: Track feature changes over time
- **Serving**: Online and batch feature serving

## 1. Feature Store Setup

In [0]:
# Initialize Feature Store
from databricks.feature_store import FeatureStoreClient
from pyspark.sql.functions import *
from pyspark.sql.types import *
import mlflow

# Create Feature Store client
fs = FeatureStoreClient()

# Setup database
DATABASE_NAME = "feature_store_demo"
spark.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}")
spark.sql(f"USE {DATABASE_NAME}")

# Verify setup
spark.sql("SELECT 'Feature Store ready!' as status").show()

## 2. Create Sample Data

In [0]:
# Create customer features dataset
from datetime import datetime, timedelta
import random

# Generate 500 customer records (smaller for demo)
random.seed(42)
customers = []

for i in range(1, 501):
    customer_id = f"CUST_{i:04d}"
    feature_timestamp = datetime.now() - timedelta(days=random.randint(0, 30))
    
    customers.append((
        customer_id,
        feature_timestamp,
        random.randint(1, 50),                    # total_orders
        random.uniform(100, 5000),                # total_spent
        random.randint(1, 365),                   # days_since_last_order
        random.choice(['Electronics', 'Clothing', 'Home', 'Books'])  # favorite_category
    ))

# Create DataFrame
schema = StructType([
    StructField("customer_id", StringType(), False),
    StructField("feature_timestamp", TimestampType(), False),
    StructField("total_orders", IntegerType(), True),
    StructField("total_spent", DoubleType(), True),
    StructField("days_since_last_order", IntegerType(), True),
    StructField("favorite_category", StringType(), True)
])

customer_features = spark.createDataFrame(customers, schema)

In [0]:
# Preview sample data
display(customer_features)

## 3. Create Feature Table

In [0]:
# Create feature table in Feature Store
feature_table_name = f"{DATABASE_NAME}.customer_features"

# Create table with primary key and timestamp
fs.create_table(
    name=feature_table_name,
    primary_keys=["customer_id"],
    timestamp_keys=["feature_timestamp"],
    df=customer_features,
    description="Customer behavioral features for ML models"
)

# Display table info
fs.get_table(feature_table_name)

## 4. Read Features from Feature Store

In [0]:
# Read features from Feature Store
features_df = fs.read_table(feature_table_name)

# Show latest features
display(features_df.orderBy("customer_id").limit(100))

## 5. Training with Feature Store

In [0]:
import random

# Create training labels
training_labels = []
for i in range(1, 101):
    customer_id = f"CUST_{i:04d}"
    # Simulate high-value customer target
    is_high_value = random.choice([0, 1])
    training_labels.append((customer_id, is_high_value))

tr_label_df = spark.createDataFrame(training_labels, ["customer_id", "target"])

label_df = tr_label_df.join(features_df, on="customer_id")

label_df = label_df.select("customer_id","feature_timestamp","target")
# Preview sample data
display(label_df)

In [0]:
# Create training set with Feature Store
from databricks.feature_store import FeatureLookup

feature_lookups = [
    FeatureLookup(
        table_name=feature_table_name,
        lookup_key="customer_id",
        timestamp_lookup_key="lookup_timestamp",
        feature_names=["total_orders", "total_spent", "days_since_last_order"]
    )
]

# Create training set
training_set = fs.create_training_set(
    df=labels_df,
    feature_lookups=feature_lookups,
    label="target",
    exclude_columns=["lookup_timestamp"]
)

# Load training data
training_df = training_set.load_df()
display(training_df.limit(5))

In [0]:
display(training_df)

In [0]:

import mlflow
import mlflow.spark
from databricks.feature_store import FeatureStoreClient
from databricks.feature_store import FeatureLookup

# Initialize clients
fs = FeatureStoreClient()
mlflow.set_experiment("/Shared/feature_store_ml_experiments")

# Start MLflow run for feature store training
with mlflow.start_run(run_name="feature_store_training_pipeline"):
    
    # 1. Create feature lookups with MLflow tracking
    feature_lookups = [
        FeatureLookup(
            table_name=feature_table_name,
            lookup_key="customer_id",
            timestamp_lookup_key="lookup_timestamp",
            feature_names=["total_orders", "total_spent", "days_since_last_order"]  
        )
    ]
    
    # Log feature store configuration
    mlflow.log_param("feature_table", feature_table_name)
    mlflow.log_param("feature_count", len(feature_lookups[0].feature_names))
    mlflow.log_param("lookup_key", "customer_id")
    mlflow.log_param("timestamp_aware", True)
    
    # 2. Create training set with feature store
    print(" Creating Feature Store training set...")
    training_set = fs.create_training_set(
        df=labels_df,
        feature_lookups=feature_lookups,
        label="target",
        exclude_columns=["lookup_timestamp"]
    )
    
    # Load training data
    training_df = training_set.load_df()
    training_df = training_df.dropna()
    training_count = training_df.count()
    feature_columns = [col for col in training_df.columns if col not in ["customer_id", "target"]]
    
    mlflow.log_metric("training_records", training_count)
    mlflow.log_metric("feature_count_final", len(feature_columns))
    
    # 3. Train a simple model (for demonstration)
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    
    # Prepare features
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    training_df_features = assembler.transform(training_df)
    
    # Split for validation
    train_split, val_split = training_df_features.randomSplit([0.8, 0.2], seed=42)
    
    # Train model
    lr = LogisticRegression(featuresCol="features", labelCol="target")
    model = lr.fit(train_split)
    
    # Evaluate model
    predictions = model.transform(val_split)
    evaluator = BinaryClassificationEvaluator(labelCol="target")
    auc = evaluator.evaluate(predictions)
    
    mlflow.log_metric("validation_auc", auc)
    
    # 4. Log model with feature store dependencies
    print("📝 Logging model with Feature Store dependencies...")
    fs.log_model(
        model,
        "feature_store_model",
        flavor=mlflow.spark,
        training_set=training_set,
        registered_model_name="customer_prediction_with_features"
    )
    
    print(f" Model trained and logged with AUC: {auc:.3f}")
    print(f" Used {len(feature_columns)} features from Feature Store")
    print(f" Training set: {training_count} records")

## Summary

### What we accomplished:
✅ **Feature Store Setup**: Initialized Feature Store client  
✅ **Feature Table**: Created table with primary keys and timestamps  
✅ **Data Management**: Stored and retrieved customer features  
✅ **Training Integration**: Used features for ML model training  
✅ **MLflow Integration**: Tracked models with feature dependencies  

### Key Benefits:
- **Feature Reuse**: Share features across teams and projects
- **Point-in-Time**: Historical feature values for training
- **Consistency**: Same features for training and serving
- **Governance**: Centralized feature management

### Next Steps:
1. **Feature Updates**: Implement regular feature refresh pipelines
2. **Online Serving**: Set up real-time feature serving
3. **Monitoring**: Add feature drift and quality monitoring
4. **Production**: Deploy models with Feature Store integration

**Duration**: ~30 minutes | **Level**: Intermediate