# Databricks + Skyflow Integration with Unity Catalog

This notebook demonstrates integrating Skyflow tokenization and detokenization into Databricks using **Unity Catalog Batch Python UDFs**.

## Key Features

- ‚úÖ **Batched execution** - High-throughput batching to Lambda, then Lambda batches to Skyflow
- ‚úÖ **Persistent functions** - Functions stored in Unity Catalog, available across all clusters
- ‚úÖ **Governed and shareable** - Fine-grained access control for tokenization and detokenization
- ‚úÖ **Persistent views** - Create views that automatically tokenize/detokenize data
- ‚úÖ **Production ready** - Perfect for ETL pipelines, BI tools, and team collaboration

## Architecture

```
Databricks ‚Üí Lambda (batched) ‚Üí Skyflow (batched)
```

- **Credentials**: Managed in Lambda (not in notebooks)
- **Batching to Lambda**: Configurable (default 500 rows per call for high throughput)
- **Lambda to Skyflow**: Automatic batching at 25 rows per Skyflow API call
- **Parallelization**: Spark automatically distributes across partitions

## Prerequisites

1. **Unity Catalog enabled** - Modern Databricks runtime (DBR 13.3+) or SQL warehouse
2. **Lambda function deployed** - See main README for deployment instructions
3. **Skyflow credentials** - Cluster ID, Vault ID, Table name

---

# Quick Start

1. **Configure cell 1** with your Lambda URL and Skyflow credentials
2. **Run cells 2-3** to create the persistent UDFs in Unity Catalog
3. **Run cells 4+** for usage examples and testing

The functions persist across cluster restarts and are available to all authorized users!

In [None]:
# ============================================================================
# Step 1: Configuration - UPDATE THESE VALUES
# ============================================================================

# Unity Catalog location
CATALOG = "your_catalog_name"
SCHEMA = "your_schema_name"

# Lambda API configuration
LAMBDA_URL = "https://YOUR_API_ID.execute-api.YOUR_REGION.amazonaws.com/processDatabricks"

# Skyflow configuration
CLUSTER_ID = "YOUR_CLUSTER_ID"
VAULT_ID = "YOUR_VAULT_ID"
TABLE = "TABLE_NAME"

# Performance tuning
BATCH_SIZE = 500  # Rows per Lambda API call (Lambda then batches at 25 rows per Skyflow call)
REQUEST_TIMEOUT = 30  # HTTP timeout in seconds

print("=" * 60)
print("Configuration Summary")
print("=" * 60)
print(f"Catalog:     {CATALOG}")
print(f"Schema:      {SCHEMA}")
print(f"Lambda URL:  {LAMBDA_URL}")
print(f"Cluster ID:  {CLUSTER_ID}")
print(f"Vault ID:    {VAULT_ID}")
print(f"Table:       {TABLE}")
print(f"Batch Size:  {BATCH_SIZE} (to Lambda)")
print(f"Timeout:     {REQUEST_TIMEOUT}s")
print("=" * 60)
print("\n‚úì Configuration loaded")
print("\nNote: Lambda internally batches at 25 rows per Skyflow API call")
print("\nNext: Run cells 2-3 to create batch Python UDFs")

In [None]:
# ============================================================================
# Step 2: Create Tokenization Batch Python UDF
# ============================================================================
#
# This creates a persistent Unity Catalog function using PARAMETER STYLE PANDAS
# for efficient batched tokenization.
#
# IMPORTANT: Due to UC PARAMETER STYLE PANDAS limitations, you cannot pass literal
# strings directly. Use the derived column pattern shown below.
#

spark.sql(f"""
CREATE OR REPLACE FUNCTION {CATALOG}.{SCHEMA}.skyflow_tokenize_column(
    column_value STRING,
    column_name STRING
)
RETURNS STRING
LANGUAGE PYTHON
PARAMETER STYLE PANDAS
HANDLER 'tokenize_handler'
AS $$
import pandas as pd
import requests
from typing import Iterator, Tuple

# Configuration embedded at function creation time
LAMBDA_URL = "{LAMBDA_URL}"
CLUSTER_ID = "{CLUSTER_ID}"
VAULT_ID = "{VAULT_ID}"
TABLE = "{TABLE}"
BATCH_SIZE = {BATCH_SIZE}
REQUEST_TIMEOUT = {REQUEST_TIMEOUT}

def tokenize_handler(batch_iter: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    '''
    Batch tokenization handler using Skyflow Lambda API.
    
    Args:
        batch_iter: Iterator yielding tuples of (values_series, column_names_series)
    
    Yields:
        Series of tokens for each batch
    '''
    for values, column_names in batch_iter:
        if column_names.empty:
            # No rows in this batch
            yield column_names
            continue
        
        # Extract column name (constant across batch)
        col_name = column_names.iloc[0]
        
        # Build records list
        records = []
        index_map = []  # Track indices of non-null values
        for idx, v in enumerate(values):
            if v is not None and pd.notna(v):
                records.append({{col_name: v}})
                index_map.append(idx)
        
        # If all values are null, return original
        if not records:
            yield values
            continue
        
        # Initialize results
        tokenized = [None] * len(values)
        
        # Process in sub-batches
        start = 0
        while start < len(records):
            end = min(start + BATCH_SIZE, len(records))
            batch_records = records[start:end]
            batch_indices = index_map[start:end]
            
            # Call Lambda API
            resp = requests.post(
                LAMBDA_URL,
                json={{"records": batch_records}},
                headers={{
                    "Content-Type": "application/json",
                    "X-Skyflow-Operation": "tokenize",
                    "X-Skyflow-Cluster-ID": CLUSTER_ID,
                    "X-Skyflow-Vault-ID": VAULT_ID,
                    "X-Skyflow-Table": TABLE
                }},
                timeout=REQUEST_TIMEOUT
            )
            resp.raise_for_status()
            
            # Parse response
            data = resp.json().get("data", [])
            
            # Map tokens back to original indices
            for local_i, rec in enumerate(data):
                global_idx = batch_indices[local_i]
                tokenized[global_idx] = rec.get(col_name)
            
            start = end
        
        # Yield result as Series
        yield pd.Series(tokenized)
$$
""")

print("‚úì Created batch Python UDF: skyflow_tokenize_column(column_value, column_name)")
print(f"  Location: {CATALOG}.{SCHEMA}.skyflow_tokenize_column")
print(f"  Type: PARAMETER STYLE PANDAS (batched execution)")
print(f"  Batch Size: {BATCH_SIZE} rows per Lambda call")
print(f"  Lambda URL: {LAMBDA_URL}")
print()
print("=" * 70)
print("USAGE PATTERN - Derived Column Required")
print("=" * 70)
print()
print("‚ö†Ô∏è  IMPORTANT: You cannot pass literal strings directly to UC Batch Python UDFs.")
print("   Use the 'derived column' pattern to work around this limitation:")
print()
print("‚úó DOESN'T WORK:")
print("  SELECT skyflow_tokenize_column(email, 'email') FROM users")
print()
print("‚úì CORRECT PATTERN:")
print("  WITH prepared_data AS (")
print("    SELECT email, 'email' AS email_col")
print("    FROM users")
print("  )")
print("  SELECT skyflow_tokenize_column(email, email_col)")
print("  FROM prepared_data")
print()
print("Multiple columns in one query:")
print("  WITH prepared_data AS (")
print("    SELECT")
print("      user_id,")
print("      email, 'email' AS email_col,")
print("      phone, 'phone' AS phone_col")
print("    FROM users")
print("  )")
print("  SELECT")
print("    user_id,")
print("    skyflow_tokenize_column(email, email_col) as email_token,")
print("    skyflow_tokenize_column(phone, phone_col) as phone_token")
print("  FROM prepared_data")

In [None]:
# ============================================================================
# Step 3: Create Detokenization Batch Python UDF
# ============================================================================
#
# This creates a persistent Unity Catalog function using PARAMETER STYLE PANDAS
# for efficient batched detokenization.
#

spark.sql(f"""
CREATE OR REPLACE FUNCTION {CATALOG}.{SCHEMA}.skyflow_detokenize(
    token STRING
)
RETURNS STRING
LANGUAGE PYTHON
PARAMETER STYLE PANDAS
HANDLER 'detokenize_handler'
AS $$
import pandas as pd
import requests
from typing import Iterator

# Configuration embedded at function creation time
LAMBDA_URL = "{LAMBDA_URL}"
CLUSTER_ID = "{CLUSTER_ID}"
VAULT_ID = "{VAULT_ID}"
BATCH_SIZE = {BATCH_SIZE}
REQUEST_TIMEOUT = {REQUEST_TIMEOUT}

def detokenize_handler(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
    '''
    Batch detokenization handler using Skyflow Lambda API.
    
    Args:
        batch_iter: Iterator yielding pandas Series (one per batch)
    
    Yields:
        Series of detokenized values for each batch
    
    Note: Single-argument UDF receives Iterator[pd.Series], not Iterator[Tuple].
    '''
    for tokens in batch_iter:
        # tokens is a pandas Series for this batch
        if tokens.empty:
            yield tokens
            continue
        
        # Build mask for non-null tokens
        mask = tokens.notna()
        token_list = tokens[mask].tolist()
        
        if not token_list:
            # All nulls, return original
            yield tokens
            continue
        
        # Initialize output as copy of input
        output = tokens.copy()
        
        # Process in sub-batches
        start = 0
        while start < len(token_list):
            end = min(start + BATCH_SIZE, len(token_list))
            batch_tokens = token_list[start:end]
            
            # Call Lambda API
            resp = requests.post(
                LAMBDA_URL,
                json={{"tokens": batch_tokens}},
                headers={{
                    "Content-Type": "application/json",
                    "X-Skyflow-Operation": "detokenize",
                    "X-Skyflow-Cluster-ID": CLUSTER_ID,
                    "X-Skyflow-Vault-ID": VAULT_ID
                }},
                timeout=REQUEST_TIMEOUT
            )
            resp.raise_for_status()
            
            # Parse response and create token->value mapping
            data = resp.json().get("data", [])
            token_to_value = {{r["token"]: r["value"] for r in data}}
            
            # Write back into output Series only where not null
            sub_mask_idx = mask[mask].index[start:end]
            for idx, tok in zip(sub_mask_idx, batch_tokens):
                output.at[idx] = token_to_value.get(tok)
            
            start = end
        
        # Yield result as Series
        yield output
$$
""")

print("‚úì Created batch Python UDF: skyflow_detokenize(token)")
print(f"  Location: {CATALOG}.{SCHEMA}.skyflow_detokenize")
print(f"  Type: PARAMETER STYLE PANDAS (batched execution)")
print(f"  Batch Size: {BATCH_SIZE} rows per Lambda call")
print(f"  Lambda URL: {LAMBDA_URL}")
print("\nThis function is now:")
print("  - Persistent in Unity Catalog")
print("  - Shareable across workspaces/users")
print("  - Usable in SQL queries and persistent views")
print("  - Batched for optimal performance")

## Setup Complete!

Two Unity Catalog Batch Python UDFs have been created:

1. **skyflow_tokenize_column(column_value, column_name)** - Batched tokenization
2. **skyflow_detokenize(token)** - Batched detokenization

These functions are now:
- ‚úÖ Persistent in Unity Catalog
- ‚úÖ Governed and shareable with proper access control
- ‚úÖ Callable from SQL queries
- ‚úÖ Usable in persistent views
- ‚úÖ Batched for optimal throughput (configurable batch size to Lambda)

**Key Technology:** These functions use `PARAMETER STYLE PANDAS` which enables batched, vectorized execution while maintaining persistence and governance in Unity Catalog.

---

# Usage Examples

The cells below demonstrate how to use the batch Python UDFs.

## Generate Test Data

Create sample data for testing the functions:

In [None]:
from pyspark.sql.functions import expr, current_timestamp

# Configure number of test rows
NUM_ROWS = 100

# Create test data
test_df = spark.range(NUM_ROWS).select(
    (expr("id + 1").alias("user_id")),
    expr("concat('user_', id)").alias("username"),
    expr("concat('user', id, '@example.com')").alias("email"),
    expr("concat('+1-555-', LPAD(id % 1000, 3, '0'), '-', LPAD((id * 7) % 10000, 4, '0'))").alias("phone"),
    current_timestamp().alias("created_at")
)

# Save as table
test_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.raw_users")

print(f"‚úì Created {CATALOG}.{SCHEMA}.raw_users table with {NUM_ROWS} rows")
display(spark.table(f"{CATALOG}.{SCHEMA}.raw_users").limit(10))

## Tokenize Data with Batch Python UDF

Use the batch Python UDF to tokenize email addresses efficiently.

**Note:** Due to Unity Catalog `PARAMETER STYLE PANDAS` limitations, you must use the **derived column pattern** when passing column names. The literal string `'email'` must be converted to a column expression using a subquery.

### Pattern Explanation

```sql
-- Convert literal 'email' to a column
SELECT
    skyflow_tokenize_column(email, column_name) as email_token
FROM (
    SELECT email, 'email' AS column_name
    FROM users
) t
```

The subquery creates a `column_name` column filled with `'email'`, which UC properly converts to a pandas Series in the UDF.

In [None]:
# Tokenize using the batch Python UDF with derived column pattern
# Using CTE to ensure column_name is a proper column expression
result = spark.sql(f"""
    WITH prepared_data AS (
        SELECT
            user_id,
            username,
            email,
            phone,
            created_at,
            'email' AS column_name
        FROM {CATALOG}.{SCHEMA}.raw_users
    )
    SELECT 
        user_id,
        username,
        email,
        {CATALOG}.{SCHEMA}.skyflow_tokenize_column(email, column_name) as email_token,
        phone,
        created_at
    FROM prepared_data
""")

# Save tokenized data
result.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.tokenized_users")

print(f"‚úì Created {CATALOG}.{SCHEMA}.tokenized_users table with tokenized emails")
print(f"  Tokenization was batched at {BATCH_SIZE} rows per Lambda API call")
display(spark.table(f"{CATALOG}.{SCHEMA}.tokenized_users").limit(10))

## Create Persistent Detokenized View

Create a **persistent view** (not possible with temporary UDFs!) that automatically detokenizes data:

In [None]:
# Create a PERSISTENT view that detokenizes email tokens on-the-fly
spark.sql(f"""
    CREATE OR REPLACE VIEW {CATALOG}.{SCHEMA}.users_detokenized AS
    SELECT 
        user_id,
        username,
        {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email,
        phone,
        created_at
    FROM {CATALOG}.{SCHEMA}.tokenized_users
""")

print(f"‚úì Created PERSISTENT view: {CATALOG}.{SCHEMA}.users_detokenized")
print("\nüéâ Key Achievement: This is a PERSISTENT view using batched detokenization!")
print("   - Not possible with temporary Pandas UDFs")
print("   - Much more efficient than scalar UC UDFs")
print(f"   - Batches at {BATCH_SIZE} rows per API call")
print("   - Accessible to all users with permissions")
print("\nQuerying the view:")
display(spark.sql(f"SELECT * FROM {CATALOG}.{SCHEMA}.users_detokenized LIMIT 10"))

## Verify Roundtrip Accuracy

Compare original values with tokenized and detokenized values:

In [None]:
# Compare original vs detokenized
verification_df = spark.sql(f"""
    SELECT
        t.user_id,
        r.email as original_email,
        t.email_token,
        d.email as detokenized_email,
        CASE
            WHEN r.email = d.email THEN 'MATCH'
            ELSE 'MISMATCH'
        END as verification
    FROM {CATALOG}.{SCHEMA}.tokenized_users t
    JOIN {CATALOG}.{SCHEMA}.raw_users r ON t.user_id = r.user_id
    JOIN {CATALOG}.{SCHEMA}.users_detokenized d ON t.user_id = d.user_id
    LIMIT 10
""")

display(verification_df)

# Check for any mismatches
mismatches = verification_df.filter("verification = 'MISMATCH'").count()
if mismatches == 0:
    print("\n‚úì All records match! Tokenization ‚Üí Detokenization working correctly.")
    print("\nüéâ Batch Python UDFs are working perfectly with batched Lambda calls!")
else:
    print(f"\n‚úó Found {mismatches} mismatches - investigate!")

## Create Analytics Dashboard with Detokenization

Now let's create a simple dashboard that demonstrates using the detokenize function in SQL queries for analytics. This shows how your BI tools and dashboards can work seamlessly with Skyflow-protected data.

**Key Benefits:**
- Query tokenized data tables directly (fast, no PII exposure)
- Selectively detokenize only when needed (e.g., for display or domain analysis)
- Use standard SQL with batched detokenization (efficient API usage)
- Perfect for Databricks SQL dashboards, Tableau, PowerBI, etc.

In [None]:
# ============================================================================
# Dashboard Query 1: Email Domain Distribution
# ============================================================================
# This query detokenizes emails to analyze which email domains are most common

print("=" * 70)
print("Dashboard: Email Domain Distribution")
print("=" * 70)
print("Demonstrating: Detokenize ‚Üí Extract domain ‚Üí Aggregate\n")

domain_analysis = spark.sql(f"""
    WITH detokenized AS (
        SELECT
            user_id,
            username,
            {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email
        FROM {CATALOG}.{SCHEMA}.tokenized_users
    ),
    domains AS (
        SELECT
            SUBSTRING_INDEX(email, '@', -1) as email_domain,
            COUNT(*) as user_count
        FROM detokenized
        GROUP BY email_domain
        ORDER BY user_count DESC
    )
    SELECT * FROM domains
""")

display(domain_analysis)

print("\n‚úì Email domain analysis complete")
print("  This query batched detokenization of all emails efficiently")

# ============================================================================
# Dashboard Query 2: Recent User Activity with Selective Detokenization
# ============================================================================

print("\n" + "=" * 70)
print("Dashboard: Recent User Activity")
print("=" * 70)
print("Demonstrating: Show tokens by default, detokenize only on demand\n")

recent_users = spark.sql(f"""
    SELECT
        user_id,
        username,
        email_token,
        {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email_plaintext,
        DATE(created_at) as registration_date
    FROM {CATALOG}.{SCHEMA}.tokenized_users
    ORDER BY created_at DESC
    LIMIT 20
""")

display(recent_users)

print("\n‚úì Recent user activity dashboard complete")
print("  Shows both tokenized (for auditing) and detokenized (for display) values")

# ============================================================================
# Dashboard Query 3: User Summary Statistics
# ============================================================================

print("\n" + "=" * 70)
print("Dashboard: User Summary Statistics")
print("=" * 70)
print("Demonstrating: Aggregate analytics without detokenization\n")

summary_stats = spark.sql(f"""
    SELECT
        COUNT(*) as total_users,
        COUNT(DISTINCT email_token) as unique_emails,
        DATE(MIN(created_at)) as first_registration,
        DATE(MAX(created_at)) as last_registration
    FROM {CATALOG}.{SCHEMA}.tokenized_users
""")

display(summary_stats)

print("\n‚úì Summary statistics complete")
print("  This query ran entirely on tokenized data - no detokenization needed!")
print("\n" + "=" * 70)
print("Dashboard Created Successfully!")
print("=" * 70)
print("\nüí° Key Insights:")
print("   1. Detokenization is batched automatically for efficiency")
print("   2. You can mix tokenized and detokenized columns in the same query")
print("   3. Many analytics queries don't need detokenization at all")
print("   4. Use Unity Catalog permissions to control who can see plaintext data")
print("\nüéØ BI Tool Integration:")
print("   - These queries can be saved as Databricks SQL dashboards")
print("   - Connect Tableau, PowerBI, or other tools to the detokenized view")
print("   - Access control ensures only authorized users see plaintext PII")

## Create Interactive Dashboard Visualizations

Now let's create visual charts from our queries. These visualizations can be:
- Viewed interactively in the notebook
- Scheduled to run automatically (Databricks Jobs)
- Shared with stakeholders via notebook links
- Exported to Databricks SQL Dashboards

**Note:** After running the cell below, click the chart icons in the output to configure visualization types (bar charts, pie charts, etc.).

In [None]:
# ============================================================================
# Create Dashboard Visualizations
# ============================================================================

print("=" * 70)
print("Creating Interactive Dashboard Visualizations")
print("=" * 70)
print()

# Visualization 1: Email Domain Distribution (Bar Chart)
print("üìä Visualization 1: Email Domain Distribution")
print("-" * 70)

domain_viz = spark.sql(f"""
    WITH detokenized AS (
        SELECT
            user_id,
            {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email
        FROM {CATALOG}.{SCHEMA}.tokenized_users
    ),
    domains AS (
        SELECT
            SUBSTRING_INDEX(email, '@', -1) as email_domain,
            COUNT(*) as user_count
        FROM detokenized
        GROUP BY email_domain
        ORDER BY user_count DESC
    )
    SELECT * FROM domains
""")

display(domain_viz)
print("üí° Tip: Click the chart icon above to visualize as a bar chart")
print("   X-axis: email_domain, Y-axis: user_count\n")

# Visualization 2: User Registration Timeline
print("üìä Visualization 2: User Registration Timeline")
print("-" * 70)

timeline_viz = spark.sql(f"""
    SELECT
        DATE(created_at) as registration_date,
        COUNT(*) as new_users
    FROM {CATALOG}.{SCHEMA}.tokenized_users
    GROUP BY DATE(created_at)
    ORDER BY registration_date
""")

display(timeline_viz)
print("üí° Tip: Click the chart icon above to visualize as a line chart")
print("   X-axis: registration_date, Y-axis: new_users\n")

# Visualization 3: Summary Metrics
print("üìä Visualization 3: Key Metrics Summary")
print("-" * 70)

metrics_viz = spark.sql(f"""
    SELECT
        'Total Users' as metric,
        CAST(COUNT(*) AS STRING) as value
    FROM {CATALOG}.{SCHEMA}.tokenized_users
    UNION ALL
    SELECT
        'Unique Emails' as metric,
        CAST(COUNT(DISTINCT email_token) AS STRING) as value
    FROM {CATALOG}.{SCHEMA}.tokenized_users
    UNION ALL
    SELECT
        'Days Active' as metric,
        CAST(DATEDIFF(MAX(created_at), MIN(created_at)) AS STRING) as value
    FROM {CATALOG}.{SCHEMA}.tokenized_users
""")

display(metrics_viz)
print("üí° This shows key performance indicators (KPIs)\n")

# Visualization 4: Sample User Records with Detokenization
print("üìä Visualization 4: Recent Users (with PII)")
print("-" * 70)
print("‚ö†Ô∏è  This visualization detokenizes PII - ensure proper access controls!")

recent_users_viz = spark.sql(f"""
    SELECT
        user_id,
        username,
        {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email,
        DATE(created_at) as registration_date
    FROM {CATALOG}.{SCHEMA}.tokenized_users
    ORDER BY created_at DESC
    LIMIT 10
""")

display(recent_users_viz)

print("\n" + "=" * 70)
print("‚úì Dashboard Visualizations Created Successfully!")
print("=" * 70)
print("\nüìå Next Steps:")
print("   1. Configure chart types by clicking the visualization icons")
print("   2. Save this notebook and schedule it to run periodically")
print("   3. Share the notebook URL with stakeholders")
print("   4. Export specific queries to Databricks SQL for persistent dashboards")
print("\nüîí Security Reminder:")
print("   - Use Unity Catalog permissions to control who can run this notebook")
print("   - Only authorized users should access visualizations with detokenized data")
print("   - Consider creating separate dashboards for tokenized vs. detokenized views")

## Create Custom HTML Dashboard (Optional)

For a more polished look, you can create a custom HTML dashboard using `displayHTML()`. This is great for executive reports and stakeholder presentations.

In [None]:
# ============================================================================
# Create Custom HTML Dashboard
# ============================================================================

# Fetch metrics data
metrics_data = spark.sql(f"""
    SELECT
        COUNT(*) as total_users,
        COUNT(DISTINCT email_token) as unique_emails,
        DATE(MIN(created_at)) as first_registration,
        DATE(MAX(created_at)) as last_registration
    FROM {CATALOG}.{SCHEMA}.tokenized_users
""").collect()[0]

# Fetch domain distribution
domain_data = spark.sql(f"""
    WITH detokenized AS (
        SELECT
            user_id,
            {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email
        FROM {CATALOG}.{SCHEMA}.tokenized_users
    ),
    domains AS (
        SELECT
            SUBSTRING_INDEX(email, '@', -1) as email_domain,
            COUNT(*) as user_count
        FROM detokenized
        GROUP BY email_domain
        ORDER BY user_count DESC
        LIMIT 5
    )
    SELECT * FROM domains
""").collect()

# Build HTML dashboard
html = f"""
<!DOCTYPE html>
<html>
<head>
    <style>
        body {{
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            padding: 20px;
            margin: 0;
        }}
        .dashboard {{
            max-width: 1200px;
            margin: 0 auto;
        }}
        .header {{
            background: white;
            border-radius: 10px;
            padding: 30px;
            margin-bottom: 20px;
            box-shadow: 0 4px 6px rgba(0,0,0,0.1);
        }}
        .header h1 {{
            margin: 0 0 10px 0;
            color: #333;
            font-size: 32px;
        }}
        .header p {{
            margin: 0;
            color: #666;
            font-size: 16px;
        }}
        .metrics {{
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
            gap: 20px;
            margin-bottom: 20px;
        }}
        .metric-card {{
            background: white;
            border-radius: 10px;
            padding: 25px;
            box-shadow: 0 4px 6px rgba(0,0,0,0.1);
            transition: transform 0.2s;
        }}
        .metric-card:hover {{
            transform: translateY(-5px);
        }}
        .metric-label {{
            color: #888;
            font-size: 14px;
            text-transform: uppercase;
            letter-spacing: 1px;
            margin-bottom: 10px;
        }}
        .metric-value {{
            color: #333;
            font-size: 36px;
            font-weight: bold;
        }}
        .chart-card {{
            background: white;
            border-radius: 10px;
            padding: 30px;
            box-shadow: 0 4px 6px rgba(0,0,0,0.1);
            margin-bottom: 20px;
        }}
        .chart-card h2 {{
            margin: 0 0 20px 0;
            color: #333;
            font-size: 20px;
        }}
        .domain-bar {{
            display: flex;
            align-items: center;
            margin-bottom: 15px;
        }}
        .domain-label {{
            min-width: 150px;
            color: #666;
            font-size: 14px;
        }}
        .bar-container {{
            flex: 1;
            height: 30px;
            background: #f0f0f0;
            border-radius: 5px;
            overflow: hidden;
            margin: 0 15px;
        }}
        .bar-fill {{
            height: 100%;
            background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
            display: flex;
            align-items: center;
            justify-content: flex-end;
            padding-right: 10px;
            color: white;
            font-weight: bold;
            font-size: 12px;
            transition: width 0.3s ease;
        }}
        .footer {{
            background: white;
            border-radius: 10px;
            padding: 20px;
            text-align: center;
            box-shadow: 0 4px 6px rgba(0,0,0,0.1);
            color: #666;
            font-size: 14px;
        }}
        .security-badge {{
            display: inline-block;
            background: #10b981;
            color: white;
            padding: 5px 15px;
            border-radius: 20px;
            font-size: 12px;
            font-weight: bold;
            margin-top: 10px;
        }}
    </style>
</head>
<body>
    <div class="dashboard">
        <div class="header">
            <h1>Skyflow User Analytics Dashboard</h1>
            <p>Real-time analytics with secure detokenization powered by Skyflow + Databricks</p>
        </div>
        
        <div class="metrics">
            <div class="metric-card">
                <div class="metric-label">Total Users</div>
                <div class="metric-value">{metrics_data.total_users:,}</div>
            </div>
            <div class="metric-card">
                <div class="metric-label">Unique Emails</div>
                <div class="metric-value">{metrics_data.unique_emails:,}</div>
            </div>
            <div class="metric-card">
                <div class="metric-label">First Registration</div>
                <div class="metric-value" style="font-size: 24px;">{metrics_data.first_registration}</div>
            </div>
            <div class="metric-card">
                <div class="metric-label">Latest Registration</div>
                <div class="metric-value" style="font-size: 24px;">{metrics_data.last_registration}</div>
            </div>
        </div>
        
        <div class="chart-card">
            <h2>Top Email Domains</h2>
            {"".join([f'''
            <div class="domain-bar">
                <div class="domain-label">{row.email_domain}</div>
                <div class="bar-container">
                    <div class="bar-fill" style="width: {(row.user_count / domain_data[0].user_count) * 100}%">
                        {row.user_count}
                    </div>
                </div>
            </div>
            ''' for row in domain_data])}
        </div>
        
        <div class="footer">
            <strong>Powered by Skyflow Data Privacy Vault</strong>
            <br>
            All PII is tokenized at rest and detokenized on-demand with batched API calls
            <div class="security-badge">üîí SECURE & COMPLIANT</div>
        </div>
    </div>
</body>
</html>
"""

displayHTML(html)

print("‚úì Custom HTML dashboard created successfully!")
print("\nüí° Benefits of HTML dashboards:")
print("   - Professional, polished appearance")
print("   - Fully customizable styling and branding")
print("   - Can be scheduled and emailed automatically")
print("   - Great for executive reports and stakeholder updates")

## Publishing to Databricks SQL Dashboards

To create a persistent Databricks SQL Dashboard from these queries:

### Method 1: Manual Export (Recommended)
1. Navigate to **Databricks SQL** in your workspace
2. Create a new **Query** for each visualization:
   - Copy the SQL query from the cells above
   - Save as a named query (e.g., "User Email Domains")
3. Go to **Dashboards** ‚Üí **Create Dashboard**
4. Add your saved queries as widgets
5. Configure visualizations (bar charts, line charts, etc.)
6. Set up automatic refresh schedules

### Method 2: Scheduled Notebook
- Schedule this notebook to run on a regular cadence (hourly, daily, etc.)
- Share the notebook URL with stakeholders
- Users can view the latest results by opening the notebook

### Method 3: Databricks Apps (New)
If you have access to Databricks Apps, you can create an interactive web application:
- Export queries as REST API endpoints
- Build a custom frontend with React/Vue/etc.
- Deploy as a Databricks App for production use

### Key Advantages of Databricks SQL Dashboards:
- ‚úÖ **Persistent** - Dashboards survive cluster restarts
- ‚úÖ **Scheduled refresh** - Automatic data updates
- ‚úÖ **Access control** - Fine-grained permissions via Unity Catalog
- ‚úÖ **Sharing** - Easy to share links with stakeholders
- ‚úÖ **Interactive** - Click to drill down, filter, and explore
- ‚úÖ **Alerting** - Set up alerts on metric thresholds

### Security Best Practices:
- Create separate dashboards for tokenized vs. detokenized views
- Use Unity Catalog permissions to control access to detokenization functions
- Audit who accesses dashboards with PII using Databricks audit logs
- Consider using Skyflow's column-level redaction policies for fine-grained control

## Generate Importable Dashboard JSON

Create a Databricks Lakeview dashboard file (`.lvdash.json`) that can be imported directly into Databricks SQL.

In [None]:
import json
import os

# ============================================================================
# Generate Databricks Lakeview Dashboard JSON
# ============================================================================

dashboard_json = {
    "datasets": [
        {
            "name": "skyflow_users_detokenized",
            "displayName": "Skyflow Users (Detokenized)",
            "queryLines": [
                "SELECT\n",
                "    user_id,\n",
                "    username,\n",
                f"    {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email,\n",
                "    email_token,\n",
                "    phone,\n",
                "    DATE(created_at) as registration_date,\n",
                "    created_at\n",
                f"FROM {CATALOG}.{SCHEMA}.tokenized_users;"
            ]
        },
        {
            "name": "skyflow_email_domains",
            "displayName": "Email Domain Distribution",
            "queryLines": [
                "WITH detokenized AS (\n",
                "    SELECT\n",
                "        user_id,\n",
                f"        {CATALOG}.{SCHEMA}.skyflow_detokenize(email_token) as email\n",
                f"    FROM {CATALOG}.{SCHEMA}.tokenized_users\n",
                "),\n",
                "domains AS (\n",
                "    SELECT\n",
                "        SUBSTRING_INDEX(email, '@', -1) as email_domain,\n",
                "        COUNT(*) as user_count\n",
                "    FROM detokenized\n",
                "    GROUP BY email_domain\n",
                ")\n",
                "SELECT * FROM domains\n",
                "ORDER BY user_count DESC;"
            ]
        },
        {
            "name": "skyflow_summary_metrics",
            "displayName": "Summary Metrics",
            "queryLines": [
                "SELECT\n",
                "    COUNT(*) as total_users,\n",
                "    COUNT(DISTINCT email_token) as unique_emails,\n",
                "    DATE(MIN(created_at)) as first_registration,\n",
                "    DATE(MAX(created_at)) as last_registration,\n",
                "    DATEDIFF(MAX(created_at), MIN(created_at)) as days_active\n",
                f"FROM {CATALOG}.{SCHEMA}.tokenized_users;"
            ]
        },
        {
            "name": "skyflow_registration_timeline",
            "displayName": "Registration Timeline",
            "queryLines": [
                "SELECT\n",
                "    DATE(created_at) as registration_date,\n",
                "    COUNT(*) as new_users\n",
                f"FROM {CATALOG}.{SCHEMA}.tokenized_users\n",
                "GROUP BY DATE(created_at)\n",
                "ORDER BY registration_date;"
            ]
        }
    ],
    "pages": [
        {
            "name": "skyflow_analytics",
            "displayName": "Skyflow User Analytics",
            "layout": [
                {
                    "widget": {
                        "name": "header",
                        "multilineTextboxSpec": {
                            "lines": [
                                "\n",
                                "# Skyflow User Analytics Dashboard\n",
                                "Secure analytics with on-demand detokenization powered by Skyflow + Databricks Unity Catalog\n",
                                "üîí All PII is tokenized at rest and detokenized in batches for optimal performance\n"
                            ]
                        }
                    },
                    "position": {
                        "x": 0,
                        "y": 0,
                        "width": 6,
                        "height": 2
                    }
                },
                {
                    "widget": {
                        "name": "total_users_metric",
                        "queries": [
                            {
                                "name": "main_query",
                                "query": {
                                    "datasetName": "skyflow_summary_metrics",
                                    "fields": [
                                        {
                                            "name": "total_users",
                                            "expression": "`total_users`"
                                        }
                                    ],
                                    "disaggregated": True
                                }
                            }
                        ],
                        "spec": {
                            "version": 2,
                            "widgetType": "counter",
                            "encodings": {
                                "value": {
                                    "fieldName": "total_users",
                                    "displayName": "Total Users"
                                }
                            },
                            "frame": {
                                "title": "Total Users",
                                "showTitle": True
                            }
                        }
                    },
                    "position": {
                        "x": 0,
                        "y": 2,
                        "width": 2,
                        "height": 2
                    }
                },
                {
                    "widget": {
                        "name": "unique_emails_metric",
                        "queries": [
                            {
                                "name": "main_query",
                                "query": {
                                    "datasetName": "skyflow_summary_metrics",
                                    "fields": [
                                        {
                                            "name": "unique_emails",
                                            "expression": "`unique_emails`"
                                        }
                                    ],
                                    "disaggregated": True
                                }
                            }
                        ],
                        "spec": {
                            "version": 2,
                            "widgetType": "counter",
                            "encodings": {
                                "value": {
                                    "fieldName": "unique_emails",
                                    "displayName": "Unique Emails"
                                }
                            },
                            "frame": {
                                "title": "Unique Email Addresses",
                                "showTitle": True
                            }
                        }
                    },
                    "position": {
                        "x": 2,
                        "y": 2,
                        "width": 2,
                        "height": 2
                    }
                },
                {
                    "widget": {
                        "name": "days_active_metric",
                        "queries": [
                            {
                                "name": "main_query",
                                "query": {
                                    "datasetName": "skyflow_summary_metrics",
                                    "fields": [
                                        {
                                            "name": "days_active",
                                            "expression": "`days_active`"
                                        }
                                    ],
                                    "disaggregated": True
                                }
                            }
                        ],
                        "spec": {
                            "version": 2,
                            "widgetType": "counter",
                            "encodings": {
                                "value": {
                                    "fieldName": "days_active",
                                    "displayName": "Days Active"
                                }
                            },
                            "frame": {
                                "title": "Days of Activity",
                                "showTitle": True
                            }
                        }
                    },
                    "position": {
                        "x": 4,
                        "y": 2,
                        "width": 2,
                        "height": 2
                    }
                },
                {
                    "widget": {
                        "name": "email_domains_bar",
                        "queries": [
                            {
                                "name": "main_query",
                                "query": {
                                    "datasetName": "skyflow_email_domains",
                                    "fields": [
                                        {
                                            "name": "email_domain",
                                            "expression": "`email_domain`"
                                        },
                                        {
                                            "name": "user_count",
                                            "expression": "`user_count`"
                                        }
                                    ],
                                    "disaggregated": True
                                }
                            }
                        ],
                        "spec": {
                            "version": 3,
                            "widgetType": "bar",
                            "encodings": {
                                "x": {
                                    "fieldName": "email_domain",
                                    "scale": {
                                        "type": "categorical",
                                        "sort": {
                                            "by": "y-reversed"
                                        }
                                    },
                                    "displayName": "Email Domain"
                                },
                                "y": {
                                    "fieldName": "user_count",
                                    "scale": {
                                        "type": "quantitative"
                                    },
                                    "displayName": "Number of Users"
                                }
                            },
                            "frame": {
                                "title": "User Distribution by Email Domain",
                                "showTitle": True,
                                "description": "Shows which email domains are most common among registered users"
                            }
                        }
                    },
                    "position": {
                        "x": 0,
                        "y": 4,
                        "width": 3,
                        "height": 4
                    }
                },
                {
                    "widget": {
                        "name": "registration_timeline",
                        "queries": [
                            {
                                "name": "main_query",
                                "query": {
                                    "datasetName": "skyflow_registration_timeline",
                                    "fields": [
                                        {
                                            "name": "registration_date",
                                            "expression": "`registration_date`"
                                        },
                                        {
                                            "name": "new_users",
                                            "expression": "`new_users`"
                                        }
                                    ],
                                    "disaggregated": True
                                }
                            }
                        ],
                        "spec": {
                            "version": 3,
                            "widgetType": "line",
                            "encodings": {
                                "x": {
                                    "fieldName": "registration_date",
                                    "scale": {
                                        "type": "temporal"
                                    },
                                    "displayName": "Registration Date"
                                },
                                "y": {
                                    "fieldName": "new_users",
                                    "scale": {
                                        "type": "quantitative"
                                    },
                                    "displayName": "New Users"
                                }
                            },
                            "frame": {
                                "title": "User Registration Timeline",
                                "showTitle": True,
                                "description": "Daily new user registrations over time"
                            }
                        }
                    },
                    "position": {
                        "x": 3,
                        "y": 4,
                        "width": 3,
                        "height": 4
                    }
                },
                {
                    "widget": {
                        "name": "recent_users_table",
                        "queries": [
                            {
                                "name": "main_query",
                                "query": {
                                    "datasetName": "skyflow_users_detokenized",
                                    "fields": [
                                        {
                                            "name": "user_id",
                                            "expression": "`user_id`"
                                        },
                                        {
                                            "name": "username",
                                            "expression": "`username`"
                                        },
                                        {
                                            "name": "email",
                                            "expression": "`email`"
                                        },
                                        {
                                            "name": "email_token",
                                            "expression": "`email_token`"
                                        },
                                        {
                                            "name": "registration_date",
                                            "expression": "`registration_date`"
                                        }
                                    ],
                                    "disaggregated": True
                                }
                            }
                        ],
                        "spec": {
                            "version": 2,
                            "widgetType": "table",
                            "encodings": {
                                "columns": [
                                    {
                                        "fieldName": "user_id",
                                        "displayName": "User ID"
                                    },
                                    {
                                        "fieldName": "username",
                                        "displayName": "Username"
                                    },
                                    {
                                        "fieldName": "email",
                                        "displayName": "Email (Detokenized)"
                                    },
                                    {
                                        "fieldName": "email_token",
                                        "displayName": "Email Token"
                                    },
                                    {
                                        "fieldName": "registration_date",
                                        "displayName": "Registration Date"
                                    }
                                ]
                            },
                            "frame": {
                                "title": "Recent Users (with Detokenized PII)",
                                "showTitle": True,
                                "description": "‚ö†Ô∏è Contains detokenized PII - Access controlled via Unity Catalog permissions"
                            }
                        }
                    },
                    "position": {
                        "x": 0,
                        "y": 8,
                        "width": 6,
                        "height": 5
                    }
                }
            ],
            "pageType": "PAGE_TYPE_CANVAS"
        }
    ]
}

# Convert to formatted JSON string
dashboard_json_str = json.dumps(dashboard_json, indent=2)

# Save to local /tmp first
tmp_path = "/tmp/skyflow_analytics_dashboard.lvdash.json"
with open(tmp_path, 'w') as f:
    f.write(dashboard_json_str)

print("=" * 70)
print("‚úì Databricks Lakeview Dashboard JSON Generated!")
print("=" * 70)
print(f"\nüìä Dashboard Configuration:")
print(f"   - Name: Skyflow User Analytics")
print(f"   - Datasets: {len(dashboard_json['datasets'])}")
print(f"   - Widgets: {len(dashboard_json['pages'][0]['layout'])}")
print(f"   - Catalog: {CATALOG}")
print(f"   - Schema: {SCHEMA}")

print(f"\nüì• How to Use This Dashboard:")
print(f"\n   Option 1: Copy JSON from output below (manual)")
print(f"   Option 2: Automatically save to DBFS (uncomment Option 2 code)")
print(f"   Option 3: Automatically import to Workspace (uncomment Option 3 code)")

print(f"\nüìÑ Complete Dashboard JSON:")
print("=" * 70)
print(dashboard_json_str)
print("=" * 70)

# ============================================================================
# OPTION 2: Save to DBFS (Recommended for artifacts)
# ============================================================================
# Uncomment the lines below to automatically save to DBFS and get a download link

# dbfs_path = "dbfs:/FileStore/skyflow/skyflow_analytics_dashboard.lvdash.json"
# dbutils.fs.cp(f"file:{tmp_path}", dbfs_path, True)
# print(f"\n‚úì OPTION 2 COMPLETE: Saved to DBFS")
# print(f"   DBFS Path: {dbfs_path}")
# print(f"   Download URL: /files/skyflow/skyflow_analytics_dashboard.lvdash.json")
# print(f"   Access from browser: <your-databricks-url>/files/skyflow/skyflow_analytics_dashboard.lvdash.json")

# ============================================================================
# OPTION 3: Import to Workspace (makes it visible alongside notebooks)
# ============================================================================
# Uncomment the lines below to automatically import into your Workspace
# Note: Update the email/path to match your Databricks username

# try:
#     # Get current user (works in most Databricks environments)
#     current_user = spark.sql("SELECT current_user()").collect()[0][0]
#     workspace_path = f"/Users/{current_user}/skyflow_analytics_dashboard.lvdash.json"
# except:
#     # Fallback: manually specify your email
#     workspace_path = "/Users/your.email@company.com/skyflow_analytics_dashboard.lvdash.json"
# 
# with open(tmp_path, "r", encoding="utf-8") as f:
#     content = f.read()
# 
# # Import into Workspace (will overwrite if exists)
# dbutils.workspace.import_(workspace_path, content, overwrite=True)
# print(f"\n‚úì OPTION 3 COMPLETE: Imported to Workspace")
# print(f"   Workspace Path: {workspace_path}")
# print(f"   Navigate to Workspace ‚Üí Users ‚Üí {current_user} to find the file")

print(f"\nüíæ Manual Import Instructions:")
print(f"   1. Copy the JSON above (between the === lines)")
print(f"   2. Save to a local file: skyflow_analytics_dashboard.lvdash.json")
print(f"   3. Navigate to Databricks SQL ‚Üí Dashboards ‚Üí Import")
print(f"   4. Upload the .lvdash.json file")

print(f"\nüé® Dashboard Includes:")
print(f"   ‚úì Header with security description")
print(f"   ‚úì 3 metric counters (Total Users, Unique Emails, Days Active)")
print(f"   ‚úì Email domain distribution bar chart")
print(f"   ‚úì User registration timeline (line chart)")
print(f"   ‚úì Recent users table with detokenized PII")

print(f"\nüîí Security Note:")
print(f"   Only users with EXECUTE permissions on {CATALOG}.{SCHEMA}.skyflow_detokenize()")
print(f"   can view detokenized data in this dashboard")

print("\n" + "=" * 70)
print("‚úì Dashboard JSON ready! Choose your preferred option above.")
print("=" * 70)

## Grant Permissions (Optional)

Control access to the batch Python UDFs:

In [None]:
# Example: Grant function execution permissions
# Uncomment and adjust as needed

# spark.sql(f"GRANT EXECUTE ON FUNCTION {CATALOG}.{SCHEMA}.skyflow_tokenize_column TO `data_engineers`")
# spark.sql(f"GRANT EXECUTE ON FUNCTION {CATALOG}.{SCHEMA}.skyflow_detokenize TO `data_engineers`")
# spark.sql(f"GRANT SELECT ON VIEW {CATALOG}.{SCHEMA}.users_detokenized TO `analysts`")

print("üí° Tip: Use Unity Catalog's access control to govern who can:")
print("   - Execute tokenization (potentially create PII tokens)")
print("   - Execute detokenization (access plaintext PII)")
print("   - Query detokenized views (read PII)")

## Cleanup (Optional)

To remove all test resources:

In [None]:
# Uncomment to drop test tables, views, and functions
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{SCHEMA}.raw_users")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{SCHEMA}.tokenized_users")
# spark.sql(f"DROP VIEW IF EXISTS {CATALOG}.{SCHEMA}.users_detokenized")
# spark.sql(f"DROP FUNCTION IF EXISTS {CATALOG}.{SCHEMA}.skyflow_detokenize")
# spark.sql(f"DROP FUNCTION IF EXISTS {CATALOG}.{SCHEMA}.skyflow_tokenize_column")
# print("‚úì Cleanup complete")

## Summary

This notebook demonstrated **Unity Catalog Batch Python UDFs** for Skyflow integration in Databricks.

### Key Achievements

‚úÖ **Batched Execution**
- Configurable batch size to Lambda (default 500 rows per call for high throughput)
- Lambda internally batches at 25 rows per Skyflow API call
- Dramatically reduces API costs compared to row-by-row processing

‚úÖ **Persistent & Governed**
- Functions stored in Unity Catalog
- Shareable across workspaces and users
- Fine-grained access control

‚úÖ **Production Ready**
- Usable in persistent views
- Callable from SQL queries
- Survives cluster restarts
- Perfect for BI tools and team collaboration

### The Magic: PARAMETER STYLE PANDAS

```sql
CREATE FUNCTION ... 
LANGUAGE PYTHON
PARAMETER STYLE PANDAS  -- This enables batched execution!
HANDLER 'handler_function'
```

This directive tells Databricks to:
1. Group rows into batches (pandas Series)
2. Call the handler function with batches (not individual rows)
3. Process results as batches

### Handler Signature Patterns

**Single-argument UDF** (e.g., detokenization):
```python
def handler(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
    for values in batch_iter:
        # Process batch
        yield pd.Series(results)
```

**Two-argument UDF** (e.g., tokenization):
```python
def handler(batch_iter: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    for arg1, arg2 in batch_iter:
        # Process batch
        yield pd.Series(results)
```

### Performance Notes

- **Batch Size to Lambda:** Default 500 rows per call (configurable via BATCH_SIZE parameter)
- **Lambda to Skyflow:** Automatic internal batching at 25 rows per Skyflow API call
- **Timeout:** 30 seconds (must be ‚â§ Lambda timeout)
- **Parallelization:** Spark automatically parallelizes across partitions
- **Scalability:** High batch size to Lambda allows Lambda to scale out and churn through data quickly

For 100K rows with BATCH_SIZE=500:
- ~200 Lambda calls
- Lambda then makes ~4,000 Skyflow API calls (at 25 rows each)
- Compare to row-by-row scalar UDFs: 100,000 Lambda calls (500x more expensive!)

### Derived Column Pattern

Due to Unity Catalog limitations, you cannot pass literal strings directly to `PARAMETER STYLE PANDAS` functions. Use the derived column pattern:

```sql
-- Convert literal to column
WITH prepared AS (
  SELECT email, 'email' AS email_col
  FROM users
)
SELECT skyflow_tokenize_column(email, email_col)
FROM prepared
```

### Next Steps

1. **Monitor Performance:** Check Lambda CloudWatch logs for batch sizes and timing
2. **Tune Batch Size:** Adjust BATCH_SIZE based on your data size, network latency, and Lambda timeout
3. **Set Up Governance:** Grant appropriate permissions to users/groups
4. **Create Views:** Build persistent views for your BI tools and analysts