# Skyflow Tokenization & Detokenization for Databricks

This notebook demonstrates how to tokenize and detokenize sensitive data in Databricks using Pandas UDFs that call the Skyflow Lambda API.

## Setup Instructions
1. Update the configuration values in the next cell
2. Run all cells to register both UDFs
3. The notebook includes a complete example:
   - Generate test data with sensitive information
   - Tokenize sensitive columns
   - Detokenize tokens to verify roundtrip

## Features
- Batch processing (10,000 values per API call by default)
- Handles NULL values gracefully
- Configurable timeout and batch size
- Efficient distributed processing with Pandas UDF
- Complete tokenization → detokenization roundtrip example

In [None]:
# Configuration - Update these values for your environment
LAMBDA_URL  = "LAMBDA_URL/processDatabricks"
CLUSTER_ID  = "CLUSTER_ID"
VAULT_ID    = "VAULT_ID"
TABLE       = "TABLE_NAME"
COLUMN_NAME = "COLUMN_NAME"
BATCH_SIZE  = 10000

In [None]:
import pandas as pd
import requests
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def skyflow_tokenize(values: pd.Series) -> pd.Series:
    """
    Tokenize a column of sensitive data using the Lambda API.

    Args:
        values: Pandas Series containing plaintext values to tokenize (may include NULLs)

    Returns:
        Pandas Series containing Skyflow tokens
    """
    results = [None] * len(values)

    # Process in batches to avoid overwhelming the API
    for start in range(0, len(values), BATCH_SIZE):
        end = min(start + BATCH_SIZE, len(values))
        batch = values.iloc[start:end].tolist()

        # Filter out NULL values and build records
        records = []
        indices = []
        for i in range(start, end):
            val = values.iloc[i]
            if val is not None:
                records.append({COLUMN_NAME: val})
                indices.append(i)

        if not records:
            continue

        # Call Lambda API
        resp = requests.post(
            LAMBDA_URL,
            json={"records": records},
            headers={
                "Content-Type": "application/json",
                "X-Skyflow-Operation": "tokenize",
                "X-Skyflow-Cluster-ID": CLUSTER_ID,
                "X-Skyflow-Vault-ID": VAULT_ID,
                "X-Skyflow-Table": TABLE
            },
            timeout=10
        )
        resp.raise_for_status()

        # Parse response and extract tokens
        data = resp.json().get("data", [])

        # Fill results for this batch
        for idx, record in enumerate(data):
            result_index = indices[idx]
            results[result_index] = record.get(COLUMN_NAME)

    return pd.Series(results)

In [None]:
# Register the UDF for SQL use
spark.udf.register("skyflow_tokenize", skyflow_tokenize)

print("✅ Skyflow tokenization UDF registered successfully!")
print(f"   Lambda URL: {LAMBDA_URL}")
print(f"   Cluster ID: {CLUSTER_ID}")
print(f"   Vault ID: {VAULT_ID}")
print(f"   Table: {TABLE}")
print(f"   Column: {COLUMN_NAME}")
print(f"   Batch Size: {BATCH_SIZE}")
print("\nUsage:")
print("  SELECT skyflow_tokenize(sensitive_column) as token FROM my_table")

## Generate Test Data

Run this cell to create a sample `raw_users` table for testing:

In [None]:
from pyspark.sql.functions import expr, current_timestamp

# Configure number of test rows
NUM_ROWS = 100

# Create test data with auto-generated rows
test_df = spark.range(NUM_ROWS).select(
    (expr("id + 1").alias("user_id")),
    expr("concat('user_', id)").alias("username"),
    expr("concat('user', id, '@example.com')").alias("email"),
    expr("concat('555-01-', LPAD(id % 10000, 4, '0'))").alias("ssn"),
    expr("concat('+1-555-', LPAD(id % 1000, 3, '0'), '-', LPAD((id * 7) % 10000, 4, '0'))").alias("phone"),
    current_timestamp().alias("created_at")
)

# Save as table
test_df.write.mode("overwrite").saveAsTable("raw_users")

print(f"✅ Created raw_users table with {NUM_ROWS} rows")
display(spark.table("raw_users").limit(10))

## Tokenize Single Column

In [None]:
from pyspark.sql.functions import col

# Tokenize email column only
df = spark.table("raw_users")
df_tokenized = df.withColumn("email_token", skyflow_tokenize(col("email")))

# Display results
display(df_tokenized.select("user_id", "username", "email", "email_token", "phone").limit(10))

## Create Tokenized Table

Save tokenized data to a new table (keeping non-sensitive columns as plaintext):

In [None]:
# Create a new table with tokenized sensitive columns
tokenized_df = spark.table("raw_users").select(
    col("user_id"),
    col("username"),
    skyflow_tokenize(col("email")).alias("email_token"),
    col("phone"),  # Non-sensitive: keep as plaintext
    col("created_at")
)

# Save tokenized data
tokenized_df.write.mode("overwrite").saveAsTable("tokenized_users")

print(f"✅ Created tokenized_users table with {tokenized_df.count()} rows")
display(spark.table("tokenized_users").limit(10))

---

## Detokenization

Now let's register the detokenization UDF to convert tokens back to plaintext:

In [None]:
@pandas_udf("string")
def skyflow_detokenize(tokens: pd.Series) -> pd.Series:
    """
    Detokenize a column of Skyflow tokens using the Lambda API.

    Args:
        tokens: Pandas Series containing Skyflow tokens (may include NULLs)

    Returns:
        Pandas Series containing decrypted values
    """
    results = [None] * len(tokens)

    # Process in batches to avoid overwhelming the API
    for start in range(0, len(tokens), BATCH_SIZE):
        end = min(start + BATCH_SIZE, len(tokens))
        batch = tokens.iloc[start:end].tolist()

        # Filter out NULL values
        non_null = [t for t in batch if t is not None]
        if not non_null:
            continue

        # Call Lambda API
        resp = requests.post(
            LAMBDA_URL,
            json={"tokens": non_null},
            headers={
                "Content-Type": "application/json",
                "X-Skyflow-Operation": "detokenize",
                "X-Skyflow-Cluster-ID": CLUSTER_ID,
                "X-Skyflow-Vault-ID": VAULT_ID
            },
            timeout=10
        )
        resp.raise_for_status()

        # Parse response and map tokens to values
        data = resp.json().get("data", [])
        token_to_value = {r["token"]: r["value"] for r in data}

        # Fill results for this batch
        for i in range(start, end):
            tok = tokens.iloc[i]
            results[i] = None if tok is None else token_to_value.get(tok)

    return pd.Series(results)

# Register the UDF for SQL use
spark.udf.register("skyflow_detokenize", skyflow_detokenize)

print("✅ Skyflow detokenization UDF registered successfully!")
print("\nUsage:")
print("  SELECT skyflow_detokenize(token_column) as value FROM my_table")

## Detokenize the Tokenized Table

Now let's detokenize the tokens we created to verify the roundtrip:

In [None]:
# Detokenize the tokenized_users table
df_tokens = spark.table("tokenized_users")
df_decrypted = df_tokens.withColumn("email_decrypted", skyflow_detokenize(col("email_token")))

# Display: token vs decrypted value
display(df_decrypted.select("user_id", "username", "email_token", "email_decrypted").limit(10))

print("\n✅ Roundtrip complete! Tokens → Decrypted values")

## Create Detokenized View

Create a temporary view that automatically detokenizes tokens for authorized queries:

**Note:** The view must be temporary because the UDFs are registered as temporary functions. The view will exist for the duration of your Spark session.

In [None]:
# Create a temporary view that detokenizes email tokens on-the-fly
spark.sql("""
    CREATE OR REPLACE TEMP VIEW users_decrypted AS
    SELECT 
        user_id,
        username,
        skyflow_detokenize(email_token) as email,
        phone,
        created_at
    FROM tokenized_users
""")

print("✅ Created users_decrypted temporary view")
print("\nQuery the view to see decrypted emails:")
display(spark.sql("SELECT * FROM users_decrypted LIMIT 10"))

---

## SQL Usage Examples

Both UDFs can be used directly in SQL queries:

### Tokenize with SQL

In [None]:
# Example 1: Tokenize a literal value
result = spark.sql("SELECT skyflow_tokenize('test@example.com') as token")
display(result)

# Example 2: Tokenize multiple rows
result = spark.sql("""
    SELECT 
        user_id,
        email,
        skyflow_tokenize(email) as email_token
    FROM raw_users
    WHERE user_id <= 5
""")
display(result)

# Example 3: Insert tokenized data directly
spark.sql("""
    CREATE OR REPLACE TABLE secure_contacts AS
    SELECT 
        user_id,
        username,
        skyflow_tokenize(email) as email_token,
        phone
    FROM raw_users
""")
print("✅ Created secure_contacts table with tokenized emails")

### Detokenize with SQL

In [None]:
# Example 1: Detokenize from the view
result = spark.sql("""
    SELECT 
        user_id,
        username,
        email
    FROM users_decrypted
    WHERE user_id <= 5
""")
display(result)

# Example 2: Detokenize with filtering
result = spark.sql("""
    SELECT 
        user_id,
        skyflow_detokenize(email_token) as email
    FROM tokenized_users
    WHERE phone LIKE '+1-555-000%'
    LIMIT 10
""")
display(result)

# Example 3: Join tokenized and raw data
result = spark.sql("""
    SELECT 
        t.user_id,
        t.username,
        skyflow_detokenize(t.email_token) as decrypted_email,
        r.email as original_email,
        CASE 
            WHEN skyflow_detokenize(t.email_token) = r.email THEN '✅ Match'
            ELSE '❌ Mismatch'
        END as verification
    FROM tokenized_users t
    JOIN raw_users r ON t.user_id = r.user_id
    LIMIT 10
""")
display(result)
print("\n✅ All SQL examples complete!")

### Query the Detokenized View

The view acts like a normal table - Skyflow detokenization happens transparently:

In [None]:
# Example 1: Simple SELECT
result = spark.sql("SELECT * FROM users_decrypted LIMIT 5")
display(result)

# Example 2: Filter by email domain (works on decrypted values!)
result = spark.sql("""
    SELECT 
        user_id,
        username,
        email
    FROM users_decrypted
    WHERE email LIKE '%@example.com'
    LIMIT 10
""")
display(result)

# Example 3: Aggregate queries
result = spark.sql("""
    SELECT 
        SUBSTRING(email, LOCATE('@', email) + 1) as domain,
        COUNT(*) as user_count
    FROM users_decrypted
    GROUP BY domain
""")
display(result)

# Example 4: Use in Python DataFrame API
df_decrypted = spark.table("users_decrypted")
result = df_decrypted.filter(col("user_id") <= 10).select("user_id", "email")
display(result)

print("\n✅ View queries complete! The view provides transparent access to decrypted data.")