# End-to-End Feature Store Example

> A comprehensive guide to setting up and using the Snowflake Feature Store for ML workflows.

## Overview

This notebook demonstrates a complete ML feature engineering workflow using Snowflake's Feature Store. We'll create a permanent feature store that:
1. Persists across sessions
2. Is accessible through Snowsight UI
3. Includes feature monitoring
4. Demonstrates best practices

### Why Use a Feature Store?

Feature stores solve several critical ML challenges:

1. **Feature Consistency**
   - Ensure features are computed identically in training and production
   - Version control for feature definitions
   - Single source of truth for feature transformations

2. **Feature Reuse**
   - Share features across teams and projects
   - Reduce duplicate computation
   - Standardize common transformations

3. **Feature Monitoring**
   - Track feature drift over time
   - Monitor feature quality
   - Alert on data quality issues

4. **Feature Discovery**
   - Make features searchable and discoverable
   - Document feature definitions and usage
   - Track feature dependencies

## Prerequisites

Before running this example, you need:

1. Appropriate permissions (see below)

2. Python environment with required packages


## Data Generation

First, we'll create some example customer data:

In [None]:
import os
from typing import Optional
from pathlib import Path
import logging
from datetime import datetime, timedelta

import snowflake.snowpark.functions as F

from snowflake_feature_store.connection import get_connection
from snowflake_feature_store.manager import FeatureStoreManager
from snowflake_feature_store.config import (
    FeatureViewConfig, FeatureConfig, RefreshConfig, 
    FeatureValidationConfig
)
from snowflake_feature_store.transforms import (
    Transform, TransformConfig, moving_agg, 
    fill_na, date_diff
)
from snowflake_feature_store.examples import (
    get_example_data, create_feature_configs
)
from snowflake_feature_store.logging import logger



## Step 1: Setting Up the Feature Store

First, we'll create a permanent feature store in Snowflake. This differs from our previous examples which used temporary schemas.

### Required Permissions

To create a permanent feature store, you need:
- `CREATE DATABASE` if creating a new database
- `CREATE SCHEMA` in the target database
- `USAGE` on the warehouse
- `CREATE TABLE` in the target schema

### Why Permanent vs Temporary?

Permanent feature stores offer several advantages:
1. Persistence across sessions
2. Accessibility through Snowsight UI
3. Ability to share with other users/roles
4. Integration with other Snowflake tools

In [None]:
# Remove trailing commas from string assignments
database: str = "DATASCIENCE"
schema: str = "FEATURE_STORE_DEMO"
warehouse: str = "DS_XS_WH"

conn = get_connection()

# Create database and schema
conn.session.sql(f"CREATE DATABASE IF NOT EXISTS {database}").collect()
conn.session.sql(f"CREATE SCHEMA IF NOT EXISTS {database}.{schema}").collect()

logger.info(f"Created and using {database}.{schema}")

2025-02-17 20:31:53,306 - snowflake_feature_store - INFO - No active session found, creating new connection from environment
2025-02-17 20:31:54,142 - snowflake_feature_store - INFO - Initialized connection to "DATASCIENCE"."FEATURE_STORE_DEMO"
2025-02-17 20:31:54,559 - snowflake_feature_store - INFO - Created and using DATASCIENCE.FEATURE_STORE_DEMO


## Step 2: Data Generation and Loading

In this section, we'll create example customer data. In a real scenario, you'd load your own data, but this example shows:
1. Proper data typing for Snowflake
2. Handling temporal data correctly
3. Setting up data quality checks
4. Creating realistic patterns in the data

### Why This Structure?

This structure demonstrates common ML feature engineering challenges:
1. **Time-based Features**: Rolling averages, time windows
2. **Missing Data**: Handling sparse observations
3. **Multiple Metrics**: Combining different data types
4. **Entity Resolution**: Linking data to customers

### Why LTV Prediction?

LTV prediction is a common ML use case that demonstrates key feature store benefits:
1. **Time-based Features**: Customer spending patterns over time
2. **Multiple Data Sources**: Combining transactions, web analytics, and customer data
3. **Feature Freshness**: Regular updates as new transactions occur
4. **Point-in-Time Correctness**: Avoiding data leakage in training

### Data Structure
Our LTV example includes:
- `LIFE_TIME_VALUE`: Current customer value (target)
- `SESSION_LENGTH`: Customer engagement metric
- `TRANSACTIONS`: Number of transactions
- `TIME_ON_APP/WEBSITE`: Engagement channels


In [None]:
# Get start date
num_customers =  100

# Generate data with patterns
df = get_example_data(
    conn.session,
    schema,
    num_customers,
)

# Show data profile
logger.info("\nData Profile:")
for col in df.columns:
    null_count = df.filter(F.col(col).is_null()).count()
    null_pct = null_count / df.count() * 100
    logger.info(f"{col}: {null_pct:.1f}% null")

2025-02-17 20:31:58,160 - snowflake_feature_store - INFO - Generated 2400 rows of demo data in "DATASCIENCE".FEATURE_STORE_DEMO.CUSTOMER_ACTIVITY
2025-02-17 20:31:58,162 - snowflake_feature_store - INFO - 
Sample Data:
----------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"DATE"      |"LIFE_TIME_VALUE"   |"SESSION_LENGTH"    |"TIME_ON_APP"      |"TIME_ON_WEBSITE"   |"TRANSACTIONS"  |
----------------------------------------------------------------------------------------------------------------------------------
|C0             |2025-01-18  |411.9831046823803   |6.101285704798359   |9.912268452627666  |7.269712691809914   |4               |
|C93            |2025-02-16  |587.2728207366863   |7.409478170285487   |9.255783785633916  |10.025373746735333  |5               |
|C22            |2025-02-16  |266.1622473753326   |7.3149273852223144  |4.897446257228738  |9.4240638710075     |2            

## Step 3: Entity Creation

Entities are the foundation of your feature store. They represent the objects you're collecting features about (e.g., customers, products, transactions).

### Why Entities Matter

Proper entity design is crucial because:
1. Entities determine how features can be joined
2. Entities define the granularity of your features
3. Entities enable point-in-time correct feature retrieval
4. Entities help organize and discover features

### Entity Best Practices

1. **Unique Keys**: Choose stable, unique identifiers
2. **Granularity**: Pick the right level (e.g., customer vs. session)
3. **Documentation**: Clearly describe what the entity represents
4. **Consistency**: Use the same keys across feature views


### Entity Design for LTV

For LTV prediction, we need a customer entity that:
1. Has a stable identifier
2. Links to all customer interactions
3. Supports time-based feature aggregation

Key considerations for our customer entity:
1. **Identifier**: Use `CUSTOMER_ID` as stable key
2. **Temporal Aspect**: Track customer since first transaction
3. **Granularity**: Customer-level for LTV prediction
4. **Documentation**: Clear description for feature discovery

In [None]:
# Create detailed documentation
description = """
Customer Entity for LTV Prediction

This entity represents individual customers and their behavior over time.
It serves as the primary entity for customer lifetime value prediction.

Key Information:
- Primary Key: CUSTOMER_ID (stable identifier)
- Temporal Key: DATE (for point-in-time correct features)
- Granularity: One record per customer per day

Usage:
1. Base entity for customer-level features
2. Join key for transaction and session data
3. Temporal alignment for time-based features

Best Practices:
- Always join using CUSTOMER_ID
- Use DATE for point-in-time correctness
- Aggregate features to customer-day level

Example:
```sql
SELECT CUSTOMER_ID, DATE, COUNT(*) as daily_transactions
FROM transactions
GROUP BY CUSTOMER_ID, DATE
```
""".strip()

import tempfile
from pathlib import Path

metrics_dir = Path(tempfile.mkdtemp()) / "feature_store_metrics"

manager = FeatureStoreManager(
    connection=conn,
    metrics_path=metrics_dir,
    overwrite=True
)

# Create entity
manager.add_entity(
    name="CUSTOMER",
    join_keys=["CUSTOMER_ID"],
    description=description,
)

logger.info("Created CUSTOMER entity for LTV prediction")
# Verify entity creation
entity = manager.feature_store.get_entity("CUSTOMER")
logger.info("\nEntity Details:")
logger.info(f"Name: {entity.name}")
logger.info(f"Join Keys: {entity.join_keys}")


2025-02-17 20:32:04,320 - snowflake_feature_store - INFO - FeatureStoreManager initialized
2025-02-17 20:32:04,494 - snowflake_feature_store - INFO - Created entity: CUSTOMER with keys: ['CUSTOMER_ID']
2025-02-17 20:32:04,496 - snowflake_feature_store - INFO - Created CUSTOMER entity for LTV prediction


  return f(self, *args, **kargs)


2025-02-17 20:32:05,516 - snowflake_feature_store - INFO - 
Entity Details:
2025-02-17 20:32:05,517 - snowflake_feature_store - INFO - Name: CUSTOMER
2025-02-17 20:32:05,518 - snowflake_feature_store - INFO - Join Keys: ['CUSTOMER_ID']


### Key Points About This Implementation

1. **Documentation**
   - Clear description of entity purpose
   - Usage examples included
   - Best practices documented
   - SQL example provided


2. **Validation**
   - Entity creation is verified
   - Join keys are explicitly defined
   - Logging provides creation confirmation
   - Error handling is included

3. **LTV Specific**
   - Designed for customer-level predictions
   - Supports temporal feature creation
   - Enables point-in-time correct joins
   - Facilitates customer behavior tracking


## Step 4: Feature Configuration

Feature configuration is where we define what features we want to create and how they should behave.

### Feature Configuration Concepts

A feature configuration defines:
1. **Validation Rules**: Data quality checks and thresholds
2. **Dependencies**: What other features this feature needs
3. **Metadata**: Description, tags, and ownership
4. **Refresh Settings**: How often to update the feature

### Why Configuration Matters

Good feature configuration ensures:
1. Data quality is maintained
2. Features are well-documented
3. Dependencies are tracked
4. Feature freshness is appropriate

### LTV Feature Configuration

For LTV prediction, we need several types of features:
1. **Behavioral Features**
   - Session metrics
   - Engagement patterns
   - Transaction history

2. **Temporal Features**
   - Time since first purchase
   - Weekly/monthly patterns
   - Rolling aggregations

3. **Derived Features**
   - Average transaction value
   - Engagement ratios
   - Time-based metrics


In [None]:

# Base features (from source data)
feature_configs = {
    "LIFE_TIME_VALUE": FeatureConfig(
        name="LIFE_TIME_VALUE",
        description="Current customer lifetime value",
        validation=FeatureValidationConfig(
            null_threshold=0.0,
            range_check=True,
            min_value=0
        )
    ),
    "SESSION_LENGTH": FeatureConfig(
        name="SESSION_LENGTH",
        description="Session length in minutes",
        validation=FeatureValidationConfig(
            null_threshold=0.2,
            range_check=True,
            min_value=0
        )
    ),
    "TRANSACTIONS": FeatureConfig(
        name="TRANSACTIONS",
        description="Number of transactions",
        validation=FeatureValidationConfig(
            null_threshold=0.0,
            range_check=True,
            min_value=0
        )
    )
}

# Time window features (match moving_agg output names)
for window in [7, 30]:
    for metric in ['TRANSACTIONS', 'LIFE_TIME_VALUE']:
        for agg in ['SUM', 'AVG']:
            feature_name = f"{agg}_{metric}_{window}"
            feature_configs[feature_name] = FeatureConfig(
                name=feature_name,
                description=f"{agg.lower()} of {metric.lower()} over {window} days",
                validation=FeatureValidationConfig(
                    null_threshold=0.1,
                    range_check=True,
                    min_value=0
                ),
                dependencies=[metric]
            )

# Derived features
feature_configs.update({
    "ENGAGEMENT_SCORE": FeatureConfig(
        name="ENGAGEMENT_SCORE",
        description="Combined engagement metric",
        validation=FeatureValidationConfig(
            null_threshold=0.1,
            range_check=True,
            min_value=0
        ),
        dependencies=["SESSION_LENGTH", "TIME_ON_APP", "TIME_ON_WEBSITE"]
    ),
    "AVG_TRANSACTION_VALUE": FeatureConfig(
        name="AVG_TRANSACTION_VALUE",
        description="Average value per transaction",
        validation=FeatureValidationConfig(
            null_threshold=0.1,
            range_check=True,
            min_value=0
        ),
        dependencies=["LIFE_TIME_VALUE", "TRANSACTIONS"]
    )
})

feature_configs

{'LIFE_TIME_VALUE': FeatureConfig(name='LIFE_TIME_VALUE', description='Current customer lifetime value', validation=FeatureValidationConfig(null_check=True, null_threshold=0.0, range_check=True, min_value=0.0, max_value=None, unique_check=False, unique_threshold=0.9), dependencies=[]),
 'SESSION_LENGTH': FeatureConfig(name='SESSION_LENGTH', description='Session length in minutes', validation=FeatureValidationConfig(null_check=True, null_threshold=0.2, range_check=True, min_value=0.0, max_value=None, unique_check=False, unique_threshold=0.9), dependencies=[]),
 'TRANSACTIONS': FeatureConfig(name='TRANSACTIONS', description='Number of transactions', validation=FeatureValidationConfig(null_check=True, null_threshold=0.0, range_check=True, min_value=0.0, max_value=None, unique_check=False, unique_threshold=0.9), dependencies=[]),
 'SUM_TRANSACTIONS_7': FeatureConfig(name='SUM_TRANSACTIONS_7', description='sum of transactions over 7 days', validation=FeatureValidationConfig(null_check=True,

## Step 5: Feature Transformations

Feature transformations convert raw data into ML-ready features. This is a critical step in the ML pipeline.

### Why Transformations Matter

Transformations serve multiple purposes:
1. **Data Quality**: Handle missing values and outliers
2. **Feature Engineering**: Create more predictive features
3. **ML Requirements**: Format data for model consumption
4. **Business Logic**: Encode domain knowledge

### Types of Transformations

1. **Basic Transformations**
   - Missing value imputation
   - Type conversion
   - Scaling/normalization

2. **Time-Based Transformations**
   - Rolling windows
   - Time since event
   - Seasonal patterns

3. **Business Transformations**
   - Derived metrics
   - Domain-specific calculations
   - Feature combinations

### LTV-Specific Transformations

For LTV prediction, we need several key transformations:


In [None]:
from snowflake.snowpark import DataFrame
from typing import Callable

from snowflake_feature_store.transforms import ValidationMixin

class CustomTransform(ValidationMixin):
    """Wrapper for custom transformations"""
    def __init__(
        self,
        transform_func: Callable[[DataFrame], DataFrame],
        config: TransformConfig
    ):
        self._transform = transform_func
        self._config = config
        
    @property
    def config(self) -> TransformConfig:
        return self._config
        
    def __call__(self, df: DataFrame) -> DataFrame:
        return self._transform(df)

In [None]:
# Default config if none provided
transform_config = TransformConfig(
    name="ltv_transforms",
    null_threshold=0.1,
    expected_types=['DECIMAL', 'DOUBLE', 'NUMBER']
)

transforms = [
    # 1. Handle Missing Values
    fill_na(
        ['SESSION_LENGTH', 'TIME_ON_APP', 'TIME_ON_WEBSITE'],
        fill_value=0,
        config=TransformConfig(
            name="engagement_imputation",
            description="Fill missing engagement metrics with 0"
        )
    ),
    
    # 2. Time-Based Features
    moving_agg(
        cols=['TRANSACTIONS', 'LIFE_TIME_VALUE'],
        window_sizes=[7, 30],  # 7 and 30 day windows
        agg_funcs=['SUM', 'AVG'],
        partition_by=['CUSTOMER_ID'],
        order_by=['DATE'],
        config=TransformConfig(
            name="time_windows",
            description="Rolling window aggregations"
        )
    ),
    # 3. Engagement Metrics
    CustomTransform(
        transform_func=lambda df: df.with_column(
            'ENGAGEMENT_SCORE',
            (F.col('SESSION_LENGTH') + 
                F.col('TIME_ON_APP') + 
                F.col('TIME_ON_WEBSITE')) / 3.0
        ),
        config=TransformConfig(
            name="engagement_score",
            description="Combined engagement metric",
            expected_types=['DOUBLE']
        )
    ),
    
    # 4. Transaction Metrics
    CustomTransform(
        transform_func=lambda df: df.with_column(
            'AVG_TRANSACTION_VALUE',
            F.col('LIFE_TIME_VALUE') / 
            F.when(F.col('TRANSACTIONS') > 0, F.col('TRANSACTIONS'))
            .otherwise(1)
        ),
        config=TransformConfig(
            name="avg_transaction_value",
            description="Average value per transaction",
            expected_types=['DOUBLE']
        )
    )
]

### Best Practices for Transformations

1. **Validation**
   - Check input data types
   - Validate output ranges
   - Monitor null ratios

2. **Performance**
   - Use vectorized operations
   - Minimize data movement
   - Leverage Snowflake optimizations

3. **Documentation**
   - Document business logic
   - Explain transformation choices
   - Track dependencies

### Example Usage

Let's apply our transformations and examine the results:

In [None]:
from snowflake_feature_store.transforms import apply_transforms

In [None]:
# Apply transforms
transformed_df = apply_transforms(df, transforms)
print(transformed_df.show(2))

show = False
if show:
    # Show new features
    print("\nNew Features Created:")
    new_cols = set(transformed_df.columns) - set(df.columns)
    for col in sorted(new_cols):
        print(f"\n{col}:")
        transformed_df.select([
            F.min(col).alias('min'),
            F.max(col).alias('max'),
            F.avg(col).alias('mean')
        ]).show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"DATE"      |"LIFE_TIME_VALUE"  |"SESSION_LENGTH"    |"TIME_ON_APP"      |"TIME_ON_WEBSITE"  |"TRANSACTIONS"  |"SUM_TRANSACTIONS_7"  |"AVG_TRANSACTIONS_7"  |"SUM_TRANSACTIONS_30"  |"AVG_TRANSACTIONS_30"  |"SUM_LIFE_TIME_VALUE_7"  |"AVG_LIFE_TIME_VALUE_7"  |"SUM_LIFE_TIME_VALUE_30"  |"AVG_LIFE_TIME_VALUE_30"  |"ENGAGEMENT_SCORE"  |"AVG_TRANSACTION_VALUE"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Step 6: Feature View Creation

Feature views combine configurations, transformations, and source data into production-ready features.

### What is a Feature View?

A feature view is:
1. **Source Data**: Raw data input
2. **Transformations**: Feature engineering logic
3. **Configurations**: Validation and refresh rules
4. **Metadata**: Documentation and lineage

### Why Feature Views Matter

Feature views provide:
1. **Reproducibility**: Consistent feature computation
2. **Monitoring**: Track feature health
3. **Discovery**: Make features findable
4. **Governance**: Control access and updates

### LTV Feature View Design

For LTV prediction, our feature view needs to:
1. Combine engagement and transaction data
2. Apply time-based transformations
3. Maintain point-in-time correctness
4. Enable regular refreshes


In [None]:
# 1. Create feature view config
entity_name = "CUSTOMER"
feature_view_name = "customer_ltv_features"

# 2. Create feature view config
config = FeatureViewConfig(
    name=feature_view_name,
    domain="RETAIL",
    entity=entity_name,
    feature_type="BEHAVIOR",
    refresh=RefreshConfig(
        frequency="1 day",
        mode="INCREMENTAL"
    ),
    features=feature_configs,  #  Created Above Pass the dictionary of FeatureConfigs
    description="""
    Customer LTV prediction features combining:
    - Transaction history
    - Engagement metrics
    - Time-based patterns
    
    Updated daily with incremental processing.
    Use for LTV prediction and customer segmentation.
    """.strip(),
    timestamp_col="DATE"
)

# 3. Create transformations
# Created above

# 4. Create and register feature view
feature_view = manager.add_feature_view(
    config=config,
    df=df, # Original DataFrame above
    entity_name=entity_name,
    transforms=transforms,
    collect_stats=True  # Enable monitoring
)

# 5. Log feature view details
logger.info(f"\nCreated feature view: {feature_view_name}")
logger.info(f"Features created: {len(feature_configs)}")
logger.info(f"Transformations applied: {len(transforms)}")

# 6. Show feature statistics
logger.info("\nFeature Statistics:")
for feature_name, stats in manager.feature_stats[config.name].items():
    logger.info(f"\n{feature_name}:")
    logger.info(str(stats))

2025-02-17 20:32:18,914 - snowflake_feature_store - INFO - Validated feature LIFE_TIME_VALUE (stats: {'timestamp': '2025-02-18T04:32:17.726988', 'row_count': 2400, 'null_count': 0, 'null_ratio': 0.0, 'unique_count': 2400, 'min_value': 1.648862185296137, 'max_value': 749.9894377990938, 'mean_value': 386.7765027344634, 'std_value': 216.32557697363958})
2025-02-17 20:32:20,275 - snowflake_feature_store - INFO - Validated feature SESSION_LENGTH (stats: {'timestamp': '2025-02-18T04:32:19.402874', 'row_count': 2400, 'null_count': 0, 'null_ratio': 0.0, 'unique_count': 1950, 'min_value': 0.0, 'max_value': 12.410005497480682, 'mean_value': 5.229708861335616, 'std_value': 3.431862682684851})
2025-02-17 20:32:21,854 - snowflake_feature_store - INFO - Validated feature TRANSACTIONS (stats: {'timestamp': '2025-02-18T04:32:20.733519', 'row_count': 2400, 'null_count': 0, 'null_ratio': 0.0, 'unique_count': 7, 'min_value': 1.0, 'max_value': 7.0, 'mean_value': 3.498333, 'std_value': 2.0083186998083744})

### Best Practices for Feature Views

1. **Documentation**
   - Clear descriptions
   - Usage examples
   - Update frequency
   - Dependencies

2. **Monitoring**
   - Feature statistics
   - Data quality metrics
   - Refresh status
   - Drift detection

3. **Performance**
   - Incremental updates
   - Efficient transformations
   - Appropriate refresh schedule

4. **Governance**
   - Access controls
   - Version control
   - Audit logging
   - Data lineage


## Step 7: Feature Monitoring

Monitoring is crucial for maintaining feature quality and detecting issues early.

### Why Monitor Features?

Feature monitoring helps:
1. **Detect Data Quality Issues**: Missing values, outliers, type mismatches
2. **Track Feature Drift**: Changes in feature distributions
3. **Ensure Freshness**: Verify timely updates
4. **Validate Business Rules**: Check domain-specific constraints

### Types of Monitoring

1. **Data Quality**
   - Null ratios
   - Type consistency
   - Value ranges
   - Cardinality

2. **Statistical Monitoring**
   - Distribution shifts
   - Correlation changes
   - Seasonality patterns
   - Outlier detection

3. **Operational Monitoring**
   - Refresh status
   - Computation time
   - Resource usage
   - Error rates

### LTV-Specific Monitoring


In [None]:
import json

from typing import Union, List, Callable, Optional, Protocol, Dict, Any

import decimal
from typing import Any

In [None]:
class LTVMonitor:
    """Monitor for LTV feature quality and drift"""
    
    def __init__(
        self,
        manager: FeatureStoreManager,
        feature_view_name: str,
        metrics_path: Optional[str] = None
    ):
        self.manager = manager
        self.feature_view_name = feature_view_name
        self.metrics_path = metrics_path
        self.baseline_stats = {}
        
    def _convert_decimal(self, obj: Any) -> Any:
        """Convert Decimal objects to float for JSON serialization"""
        if isinstance(obj, decimal.Decimal):
            return float(obj)
        return obj
    
    def _process_metrics(self, metrics: Dict) -> Dict:
        """Process metrics dictionary to ensure JSON serializable values"""
        return {
            k: {
                'timestamp': v['timestamp'],
                'metrics': {
                    mk: self._convert_decimal(mv)
                    for mk, mv in v['metrics'].items()
                }
            }
            for k, v in metrics.items()
        }
    
    def compute_feature_metrics(
        self,
        df: DataFrame,
        timestamp: Optional[datetime] = None
    ) -> Dict[str, Dict]:
        """Compute comprehensive feature metrics"""
        metrics = {}
        timestamp = timestamp or datetime.now()
        
        for col in df.columns:
            # Skip identifier columns
            if col in ['CUSTOMER_ID', 'DATE']:
                continue
                
            # Basic stats
            stats = df.select([
                F.count(col).alias('count'),
                F.count_distinct(col).alias('unique'),
                F.sum(F.when(F.col(col).is_null(), 1).otherwise(0)).alias('nulls')
            ]).collect()[0].asDict()
            
            # Convert Decimal to float
            stats = {k: self._convert_decimal(v) for k, v in stats.items()}
            
            # Numeric stats for appropriate columns
            if col in ['LIFE_TIME_VALUE', 'SESSION_LENGTH', 'TRANSACTIONS']:
                numeric_stats = df.select([
                    F.min(col).alias('min'),
                    F.max(col).alias('max'),
                    F.avg(col).alias('mean'),
                    F.stddev(col).alias('std')
                ]).collect()[0].asDict()
                
                # Convert Decimal to float
                numeric_stats = {k: self._convert_decimal(v) for k, v in numeric_stats.items()}
                stats.update(numeric_stats)
            
            metrics[col] = {
                'timestamp': timestamp.isoformat(),
                'metrics': stats
            }
            
        return metrics
    
    def set_baseline(self, df: DataFrame) -> None:
        """Set baseline statistics for drift detection"""
        self.baseline_stats = self.compute_feature_metrics(df)
        logger.info("Set baseline statistics")
        
        # Save baseline if metrics path provided
        if self.metrics_path:
            baseline_file = Path(self.metrics_path) / "baseline_stats.json"
            processed_stats = self._process_metrics(self.baseline_stats)
            with open(baseline_file, 'w') as f:
                json.dump(processed_stats, f, indent=2)
    
    def check_feature_health(
        self,
        df: DataFrame,
        drift_threshold: float = 0.1
    ) -> None:
        """Check overall feature health"""
        try:
            # Compute current metrics
            current_metrics = self.compute_feature_metrics(df)
            
            # Detect drift if baseline exists
            if self.baseline_stats:
                drift_alerts = self.detect_drift(
                    current_metrics,
                    drift_threshold
                )
                
                if drift_alerts:
                    logger.warning("\nFeature Drift Detected:")
                    for feature, alerts in drift_alerts.items():
                        logger.warning(f"\n{feature}:")
                        for alert in alerts:
                            logger.warning(f"- {alert}")
            
            # Log current metrics
            logger.info("\nCurrent Feature Metrics:")
            for feature, metrics in current_metrics.items():
                logger.info(f"\n{feature}:")
                for metric, value in metrics['metrics'].items():
                    logger.info(f"  {metric}: {value}")
                    
            # Save metrics if path provided
            if self.metrics_path:
                metrics_file = Path(self.metrics_path) / f"metrics_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
                processed_metrics = self._process_metrics(current_metrics)
                with open(metrics_file, 'w') as f:
                    json.dump(processed_metrics, f, indent=2)
                    
        except Exception as e:
            logger.error(f"Error checking feature health: {str(e)}")
            raise

    def detect_drift(
        self,
        current_metrics: Dict,
        drift_threshold: float = 0.1
    ) -> Dict[str, List[str]]:
        """Detect significant changes in feature distributions"""
        drift_alerts = {}
        
        for feature, metrics in current_metrics.items():
            if feature not in self.baseline_stats:
                continue
                
            alerts = []
            baseline = self.baseline_stats[feature]['metrics']
            current = metrics['metrics']
            
            # Check for distribution changes
            for metric in ['mean', 'std']:
                if metric not in current or metric not in baseline:
                    continue
                    
                change = abs(current[metric] - baseline[metric]) / baseline[metric]
                if change > drift_threshold:
                    alerts.append(
                        f"{metric.upper()} changed by {change:.1%}"
                    )
            
            # Check for data quality changes
            null_ratio = current['NULLS'] / current['COUNT']
            baseline_null_ratio = baseline['NULLS'] / baseline['COUNT']
            if abs(null_ratio - baseline_null_ratio) > drift_threshold:
                alerts.append(
                    f"NULL ratio changed from {baseline_null_ratio:.1%} to {null_ratio:.1%}"
                )
            
            if alerts:
                drift_alerts[feature] = alerts
                
        return drift_alerts


### Example Usage

Let's set up monitoring for our LTV features:


In [None]:
# Set up metrics directory
metrics_dir = Path("feature_metrics")
metrics_dir.mkdir(exist_ok=True)

In [None]:
# Set up monitoring
monitor = LTVMonitor(
    manager=manager,
    feature_view_name=feature_view.name,
    metrics_path=str(metrics_dir)
)

In [None]:
from snowflake_feature_store.examples import get_example_data

In [None]:
# Set baseline
monitor.set_baseline(feature_view.feature_df)
# Generate some drift
drift_df = get_example_data(
    conn.session,
    schema,
    num_customers=250,
    ltv_multiplier=4.5,  # Increase values to simulate drift
    table_type = 'TEST'
)
# Check for drift
monitor.check_feature_health(drift_df)

2025-02-17 20:33:07,458 - snowflake_feature_store - INFO - Set baseline statistics
2025-02-17 20:33:13,744 - snowflake_feature_store - INFO - Generated 6000 rows of demo data in "DATASCIENCE".FEATURE_STORE_DEMO.CUSTOMER_ACTIVITY_TEST
2025-02-17 20:33:13,747 - snowflake_feature_store - INFO - 
Sample Data:
----------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"DATE"      |"LIFE_TIME_VALUE"   |"SESSION_LENGTH"    |"TIME_ON_APP"      |"TIME_ON_WEBSITE"   |"TRANSACTIONS"  |
----------------------------------------------------------------------------------------------------------------------------------
|C0             |2025-01-18  |411.9831046823803   |6.101285704798359   |9.912268452627666  |7.269712691809914   |4               |
|C93            |2025-02-16  |587.2728207366863   |7.409478170285487   |9.255783785633916  |10.025373746735333  |5               |
|C22            |2025-02-16  |266.1622

## Step 8: Training Data Generation

Generating training data from a feature store requires special consideration to avoid data leakage and ensure point-in-time correctness.

### Why Training Data Generation Matters

Proper training data generation:
1. **Prevents Data Leakage**: Ensures future data doesn't leak into training
2. **Maintains Consistency**: Uses same feature computations as production
3. **Enables Reproducibility**: Training sets can be recreated exactly
4. **Supports Experimentation**: Easy to create different feature combinations

### LTV Training Data Requirements

For LTV prediction, we need to:
1. Use historical data to predict future LTV
2. Include time-based features correctly
3. Handle missing values consistently
4. Maintain customer context


In [None]:
import tempfile
from pathlib import Path

metrics_dir = Path(tempfile.mkdtemp()) / "feature_store_metrics"

training_start_date='2025-01-01'
training_end_date='2025-03-01'
prediction_window=90  # Predict 90-day LTV
save_table='DATASCIENCE.FEATURE_STORE_DEMO.LTV_TRAINING_DATA'

manager = FeatureStoreManager(
    connection=conn,
    metrics_path=metrics_dir,
    overwrite=True
)

# Get existing feature view
feature_view = manager.feature_store.get_feature_view(
    name="customer_ltv_features",
    version="V1_0"  # Use the version from earlier
)

2025-02-17 20:37:43,451 - snowflake_feature_store - INFO - FeatureStoreManager initialized


In [None]:
# Get the fully qualified table name
table_name = (
    f"{manager.connection.database}."
    f"{manager.connection.schema}."
    f"CUSTOMER_ACTIVITY"
)

# 1. Create spine query for point-in-time correct features
spine_df = manager.connection.session.sql(f"""
    WITH customer_dates AS (
        -- Get all customer-date combinations
    SELECT DISTINCT
            CUSTOMER_ID,
            DATE
        FROM {table_name}
        WHERE DATE BETWEEN '{training_start_date}' AND '{training_end_date}'
    ),
    future_ltv AS (
        -- Calculate future LTV for each customer-date
        SELECT 
            cd.CUSTOMER_ID,
            cd.DATE as FEATURE_DATE,
            MAX(f.DATE) as LABEL_DATE,
            MAX(f.LIFE_TIME_VALUE) as FUTURE_LTV
        FROM customer_dates cd
        LEFT JOIN {table_name} f
            ON cd.CUSTOMER_ID = f.CUSTOMER_ID
            AND f.DATE BETWEEN cd.DATE 
                AND DATEADD(days, {prediction_window}, cd.DATE)
        GROUP BY 1, 2
)
    -- Final spine query
    SELECT 
        CUSTOMER_ID,
        FEATURE_DATE as "DATE",
        FUTURE_LTV as "TARGET_LTV",
        LABEL_DATE as "LABEL_DATE"
    FROM future_ltv
""")
spine_df.show(2)

logger.info(f"Created spine with {spine_df.count()} rows")


-----------------------------------------------------------------
|"CUSTOMER_ID"  |"DATE"      |"TARGET_LTV"       |"LABEL_DATE"  |
-----------------------------------------------------------------
|C0             |2025-01-18  |744.4746675216771  |2025-02-15    |
|C93            |2025-02-16  |587.2728207366863  |2025-02-16    |
-----------------------------------------------------------------

2025-02-17 20:37:46,924 - snowflake_feature_store - INFO - Created spine with 2400 rows


In [None]:
# 2. Get features using point-in-time correct joins
training_data = manager.get_features(
    spine_df=spine_df,
    feature_views=[feature_view],
    spine_timestamp_col="DATE",
    label_cols=["TARGET_LTV", "LABEL_DATE"]
)
logger.info("\nSample Data:")
training_data.show(2)
logger.info("\nSchema:")
for field in training_data.schema.fields:
    logger.info(f"{field.name}: {field.datatype}")

2025-02-17 20:37:47,252 - snowflake_feature_store - INFO - Spine DataFrame columns: ['CUSTOMER_ID', 'DATE', 'TARGET_LTV', 'LABEL_DATE']
2025-02-17 20:37:47,253 - snowflake_feature_store - INFO - Spine DataFrame schema: StructType([StructField('CUSTOMER_ID', StringType(), nullable=True), StructField('DATE', DateType(), nullable=True), StructField('TARGET_LTV', DoubleType(), nullable=True), StructField('LABEL_DATE', DateType(), nullable=True)])
2025-02-17 20:37:47,254 - snowflake_feature_store - INFO - Generating dataset with name: DATASET_20250218_043747_95577426
2025-02-17 20:37:47,255 - snowflake_feature_store - INFO - Label columns: ['"TARGET_LTV"', '"LABEL_DATE"']
2025-02-17 20:37:47,255 - snowflake_feature_store - INFO - Timestamp column: "DATE"
2025-02-17 20:37:50,789 - snowflake_feature_store - INFO - 
Sample Data:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# 3. Add metadata columns
training_data = training_data.select(
    "*",  # Keep all existing columns
    F.lit(training_start_date).alias("TRAINING_START_DATE"),
    F.lit(training_end_date).alias("TRAINING_END_DATE"),
    F.lit(prediction_window).alias("PREDICTION_WINDOW_DAYS"),
    F.current_timestamp().alias("GENERATED_AT")
)

# 4. Save if table name provided
if save_table:
    training_data.write.mode("overwrite").save_as_table(save_table)
    logger.info(f"Saved training data to {save_table}")

# 5. Log data generation stats
logger.info("\nTraining Data Statistics:")
logger.info(f"Total rows: {training_data.count()}")
logger.info(f"Date range: {training_start_date} to {training_end_date}")
logger.info(f"Prediction window: {prediction_window} days")


2025-02-17 20:37:53,766 - snowflake_feature_store - INFO - Saved training data to DATASCIENCE.FEATURE_STORE_DEMO.LTV_TRAINING_DATA
2025-02-17 20:37:53,767 - snowflake_feature_store - INFO - 
Training Data Statistics:
2025-02-17 20:37:54,593 - snowflake_feature_store - INFO - Total rows: 2400
2025-02-17 20:37:54,594 - snowflake_feature_store - INFO - Date range: 2025-01-01 to 2025-03-01
2025-02-17 20:37:54,594 - snowflake_feature_store - INFO - Prediction window: 90 days


### Best Practices for Training Data

1. **Time Windows**
   - Use appropriate training/validation splits
   - Consider seasonal patterns
   - Match prediction window to business needs

2. **Feature Selection**
   - Include all relevant features
   - Document feature importance
   - Track feature dependencies

3. **Data Quality**
   - Handle missing values consistently
   - Check for data leakage
   - Validate label quality

4. **Documentation**
   - Record generation parameters
   - Track data lineage
   - Document assumptions


## Conclusion and Next Steps

### What We've Built

We've created a comprehensive example of using Snowflake's Feature Store for LTV prediction that demonstrates:
1. **Feature Store Setup**: Creating and managing a feature store
2. **Entity Management**: Defining and documenting customer entities
3. **Feature Engineering**: Creating and transforming features
4. **Feature Views**: Organizing and versioning features
5. **Monitoring**: Tracking feature quality and drift
6. **Training Data**: Generating point-in-time correct datasets

### Potential Enhancements

Future versions could include:
1. **Advanced Monitoring**
   - Automated drift detection alerts
   - Custom validation rules
   - Feature quality dashboards
   - Historical metrics tracking

2. **Feature Discovery**
   - Feature search capabilities
   - Metadata management
   - Usage tracking
   - Documentation generation

3. **Production Integration**
   - CI/CD pipeline integration
   - Automated testing
   - Deployment workflows
   - Model registry integration

4. **Performance Optimization**
   - Incremental updates
   - Caching strategies
   - Query optimization
   - Resource management

### Best Practices

When using this template:
1. **Documentation**
   - Document feature definitions
   - Explain business logic
   - Track dependencies
   - Maintain version history

2. **Testing**
   - Validate feature logic
   - Check data quality
   - Test transformations
   - Verify point-in-time correctness

3. **Monitoring**
   - Set up drift detection
   - Track feature freshness
   - Monitor data quality
   - Alert on issues

4. **Governance**
   - Manage access controls
   - Track lineage
   - Enforce standards
   - Maintain audit logs

### Using This Template

To adapt this example:
1. Replace LTV-specific logic with your use case
2. Adjust feature definitions and transformations
3. Customize monitoring thresholds
4. Add domain-specific validation

### Resources

For more information:
1. [Snowflake Feature Store Documentation](https://docs.snowflake.com/en/user-guide/feature-store)
2. [Feature Store Best Practices](https://docs.snowflake.com/en/user-guide/feature-store-use)
3. [Snowpark ML Documentation](https://docs.snowflake.com/en/developer-guide/snowpark-ml/index)
4. [Feature Store Examples](https://github.com/Snowflake-Labs/snowpark-python-examples)
