In [0]:
%pip install -q -r ../../requirements.txt

# Data Encryption in Databricks

This notebook provides in-depth demonstrations of data encryption techniques available in Databricks.

## Topics Covered:

1. **Server-side Encryption for Cloud Storage Services** - Automatic encryption at rest
2. **AES Encryption/Decryption** - Column-level encryption using AES algorithms
3. **Format-Preserving Encryption (FPE)** - Maintain data format while encrypting
4. **Envelope Encryption with Unity Catalog** - Multi-layer encryption approach
5. **Databricks Multi-key Protection** - Customer-managed + Databricks-managed keys

---

### Why Encryption?

Encryption is a critical component of data security that:
- **Protects data at rest** in cloud storage
- **Protects data in transit** between systems
- **Ensures compliance** with regulations (GDPR, HIPAA, PCI-DSS)
- **Prevents unauthorized access** even if storage is compromised
- **Provides customer control** with customer-managed keys (CMK)

---


In [None]:
# Setup: Import libraries and configure environment
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Configuration
CATALOG = "main"
ENCRYPTION_SCHEMA = "encryption_demo"

# Create schema for encryption demonstrations
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{ENCRYPTION_SCHEMA}")

print("✓ Environment setup complete")
print(f"→ Using catalog: {CATALOG}")
print(f"→ Using schema: {ENCRYPTION_SCHEMA}")


---

## 1. Server-side Encryption for Cloud Storage

**What is Server-side Encryption?**

Server-side encryption (SSE) automatically encrypts data when it's written to cloud storage and decrypts it when accessed. This is managed by the cloud provider and requires no application changes.

**Cloud Provider Options:**

### AWS S3
- **SSE-S3:** Amazon S3-managed keys (AES-256)
- **SSE-KMS:** AWS Key Management Service (customer-managed keys)
- **SSE-C:** Customer-provided encryption keys

### Azure Blob Storage
- **Microsoft-managed keys:** Automatic encryption with Azure-managed keys
- **Customer-managed keys:** Use Azure Key Vault for key management

### Google Cloud Storage
- **Google-managed keys:** Default encryption at rest
- **Customer-managed encryption keys (CMEK):** Use Cloud KMS

**Benefits:**
- ✓ Transparent to applications
- ✓ No performance overhead
- ✓ Enabled by default in most cloud providers
- ✓ Compliant with security standards

**Databricks Integration:**
- All data stored in Delta Lake is encrypted at rest by default
- Additional encryption layers can be configured via workspace settings
- Customer-managed keys provide additional control


In [None]:
# Server-side Encryption is automatic in Databricks
# This cell demonstrates that your data is already encrypted at rest

print("Server-side Encryption Status")
print("=" * 60)

# Create a sample table
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{ENCRYPTION_SCHEMA}.sensitive_data (
    id INT,
    credit_card STRING,
    customer_name STRING,
    transaction_amount DECIMAL(10, 2)
)
""")

# Insert sample data
spark.sql(f"""
INSERT INTO {CATALOG}.{ENCRYPTION_SCHEMA}.sensitive_data VALUES
    (1, '4532-1111-2222-3333', 'John Doe', 1250.00),
    (2, '5555-4444-3333-2222', 'Jane Smith', 3400.50),
    (3, '3782-123456-78901', 'Bob Johnson', 890.25)
""")

print("\n✓ Table created with sensitive data")
print("→ Data is automatically encrypted at rest by your cloud provider")
print("→ Encryption happens transparently without application changes")
print("→ Data is decrypted automatically when read by authorized users\n")

# Show the table (data appears unencrypted because we have access)
display(spark.sql(f"SELECT * FROM {CATALOG}.{ENCRYPTION_SCHEMA}.sensitive_data"))

print("\n📝 Note: While you see unencrypted data (because you're authorized),")
print("   the actual storage files are encrypted using AES-256 encryption.")
print("   Without proper credentials, the underlying files are unreadable.")


---

## 2. AES Encryption and Decryption

**What is AES Encryption?**

AES (Advanced Encryption Standard) is a symmetric encryption algorithm that provides column-level encryption in addition to storage-level encryption. This allows you to encrypt specific sensitive fields within your data.

**AES Key Sizes:**
- **AES-128:** 16-byte key (sufficient for most use cases)
- **AES-192:** 24-byte key (higher security)
- **AES-256:** 32-byte key (maximum security)

**Modes of Operation:**
- **ECB (Electronic Codebook):** Simplest mode, deterministic
- **CBC (Cipher Block Chaining):** More secure, uses initialization vector
- **GCM (Galois/Counter Mode):** Authenticated encryption

**Padding Schemes:**
- **PKCS:** Standard padding for block ciphers
- **NONE:** No padding (data must be block-aligned)

**SQL Functions:**
- `AES_ENCRYPT(input, key, mode, padding)` - Encrypts data
- `AES_DECRYPT(input, key, mode, padding)` - Decrypts data
- `BASE64(binary)` - Encodes binary to Base64 string
- `UNBASE64(string)` - Decodes Base64 string to binary

**Use Cases:**
- Encrypt credit card numbers, SSNs, and other PII
- Store encrypted data while allowing queries on other columns
- Implement field-level encryption for compliance
- Control who can decrypt data via key management


In [None]:
# AES-128 Encryption Example

print("AES-128 Encryption Demonstration")
print("=" * 60)

# Define encryption key (16 bytes for AES-128)
# ⚠️ In production, NEVER hardcode keys - use a secure key management service
encryption_key_128 = "SecureKey1234567"  # 16 characters = 16 bytes

print(f"\n→ Encryption Key Length: {len(encryption_key_128)} bytes (AES-128)")
print("→ Algorithm: AES")
print("→ Mode: ECB")
print("→ Padding: PKCS\n")

# Create a table with encrypted credit card numbers
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{ENCRYPTION_SCHEMA}.payments_encrypted AS
SELECT 
    id,
    customer_name,
    base64(aes_encrypt(credit_card, '{encryption_key_128}', 'ECB', 'PKCS')) AS credit_card_encrypted,
    transaction_amount
FROM 
    {CATALOG}.{ENCRYPTION_SCHEMA}.sensitive_data
""")

print("✓ Table created with encrypted credit card numbers\n")

# Display the encrypted data
print("Encrypted Data (credit card numbers are encrypted):")
display(spark.sql(f"SELECT * FROM {CATALOG}.{ENCRYPTION_SCHEMA}.payments_encrypted"))

print("\n📝 Note: Credit card numbers are now stored in encrypted form.")
print("   Only users with the encryption key can decrypt them.")


In [None]:
# AES Decryption Example

print("AES-128 Decryption Demonstration")
print("=" * 60)

# Create a view that decrypts the credit card numbers
print("\nCreating a view that decrypts credit card numbers for authorized users...\n")

spark.sql(f"""
CREATE OR REPLACE VIEW {CATALOG}.{ENCRYPTION_SCHEMA}.v_payments_decrypted AS
SELECT 
    id,
    customer_name,
    CASE 
        WHEN IS_ACCOUNT_GROUP_MEMBER('payment_processors') THEN 
            aes_decrypt(unbase64(credit_card_encrypted), '{encryption_key_128}', 'ECB', 'PKCS')
        ELSE '****-****-****-****'
    END AS credit_card,
    transaction_amount
FROM 
    {CATALOG}.{ENCRYPTION_SCHEMA}.payments_encrypted
""")

print("✓ View created with conditional decryption\n")

# Display the decrypted data (for authorized users)
print("Decrypted Data (for demonstration - in production, only authorized users would see this):")
decrypted_df = spark.sql(f"""
SELECT 
    id,
    customer_name,
    aes_decrypt(unbase64(credit_card_encrypted), '{encryption_key_128}', 'ECB', 'PKCS') AS credit_card,
    transaction_amount
FROM 
    {CATALOG}.{ENCRYPTION_SCHEMA}.payments_encrypted
""")

display(decrypted_df)

print("\n📝 Key Points:")
print("   • Encryption protects data even if someone gains database access")
print("   • Decryption requires the encryption key")
print("   • Key management is critical - use Azure Key Vault, AWS KMS, or GCP KMS")
print("   • Combine with RBAC to control who can decrypt specific columns")


---

## 3. Format-Preserving Encryption (FPE)

**What is Format-Preserving Encryption?**

FPE encrypts data while maintaining its original format. For example, a 16-digit credit card number remains a 16-digit number after encryption, making it compatible with existing systems that expect specific data formats.

**Benefits:**
- ✓ Maintains data format constraints (length, character set)
- ✓ Compatible with legacy systems and applications
- ✓ No schema changes required
- ✓ Useful for credit cards, phone numbers, IDs

**Implementation Approaches:**

1. **FF1/FF3 Algorithms:** NIST-approved FPE algorithms
2. **Custom UDFs:** Implement FPE logic via external libraries
3. **Third-party Services:** Use specialized FPE vendors

**Limitations:**
- More complex than standard encryption
- May require external libraries or services
- Performance overhead compared to standard AES

**Use Cases:**
- PCI-DSS compliance for credit card masking
- Phone number protection while maintaining format
- Legacy system integration
- Testing with realistic but protected data


In [None]:
# Format-Preserving Encryption Simulation
# NOTE: True FPE requires specialized libraries (e.g., pyffx, ff3)
# This demonstrates the concept using a simplified approach

print("Format-Preserving Encryption (Simulated)")
print("=" * 60)

print("\n⚠️  Note: This is a simplified simulation for demonstration purposes.")
print("   Production FPE should use NIST-approved FF1/FF3 algorithms.")
print("   Consider using libraries like pyffx or services like Protegrity.\n")

# Create a simplified FPE function for credit card numbers
spark.sql(f"""
CREATE OR REPLACE FUNCTION {CATALOG}.{ENCRYPTION_SCHEMA}.fpe_encrypt_cc(cc STRING, key STRING)
RETURNS STRING
RETURN 
    -- Simulate FPE by preserving format while changing digits
    -- In production, use proper FF1/FF3 algorithms
    CONCAT(
        SUBSTRING(cc, 1, 4), '-',
        LPAD(CAST(MOD(CAST(SUBSTRING(cc, 6, 4) AS INT) + 1234, 10000) AS STRING), 4, '0'), '-',
        LPAD(CAST(MOD(CAST(SUBSTRING(cc, 11, 4) AS INT) + 5678, 10000) AS STRING), 4, '0'), '-',
        LPAD(CAST(MOD(CAST(SUBSTRING(cc, 16, 4) AS INT) + 9012, 10000) AS STRING), 4, '0')
    )
""")

print("✓ Simulated FPE function created\n")

# Apply FPE to credit card numbers
fpe_df = spark.sql(f"""
SELECT 
    id,
    customer_name,
    credit_card AS original_cc,
    {CATALOG}.{ENCRYPTION_SCHEMA}.fpe_encrypt_cc(credit_card, 'key') AS encrypted_cc,
    transaction_amount
FROM 
    {CATALOG}.{ENCRYPTION_SCHEMA}.sensitive_data
""")

print("Original vs Format-Preserving Encrypted Credit Cards:")
display(fpe_df)

print("\n📝 Key Observations:")
print("   • Format is preserved (XXXX-XXXX-XXXX-XXXX)")
print("   • Length remains the same")
print("   • Compatible with systems expecting 16-digit cards")
print("   • In production, use proper FPE libraries for security")
print("\n🔗 Production FPE Options:")
print("   • pyffx library (Python)")
print("   • Protegrity or Voltage SecureData (Commercial)")
print("   • AWS Payment Cryptography (AWS-specific)")


---

## 4. Envelope Encryption with Unity Catalog

**What is Envelope Encryption?**

Envelope encryption is a multi-layer encryption approach where data is encrypted with a Data Encryption Key (DEK), and the DEK is then encrypted with a Key Encryption Key (KEK). This provides enhanced security through key separation.

**How It Works:**
- Step 1: Data is encrypted with a DEK
- Step 2: The DEK is encrypted with a KEK
- Step 3: Only the encrypted DEK is stored with the data
- Step 4: The KEK is managed by a key management service

**Benefits:**
- Enhanced security through key separation
- Efficient key rotation (only re-encrypt DEKs, not data)
- Centralized key management
- Compliance with regulatory requirements
- Protection even if encrypted data is compromised

**Unity Catalog Implementation:**

Databricks Unity Catalog supports envelope encryption through automatic DEK generation for each table, KEK management via cloud provider KMS, transparent encryption/decryption at runtime, and key rotation without data re-encryption.

**Architecture Components:**
- Each Delta table has its own DEK
- DEKs are encrypted with workspace KEK
- KEK stored in cloud KMS (AWS KMS, Azure Key Vault, GCP Cloud KMS)
- Transparent access for authorized users


In [None]:
# Envelope Encryption Demonstration (Conceptual)

print("Envelope Encryption with Unity Catalog")
print("=" * 60)

print("\n📝 Note: Envelope encryption is automatically handled by Unity Catalog")
print("   when customer-managed keys (CMK) are configured.\n")

print("Encryption Layers:")
print("-" * 60)
print("Layer 1: Data Encryption")
print("  → Each table has a unique Data Encryption Key (DEK)")
print("  → Data is encrypted with DEK using AES-256")
print("  → DEK is generated automatically by Unity Catalog\n")

print("Layer 2: Key Encryption")
print("  → DEK is encrypted with Key Encryption Key (KEK)")
print("  → KEK is stored in your cloud provider's KMS")
print("  → Options: AWS KMS, Azure Key Vault, GCP Cloud KMS\n")

print("Layer 3: Access Control")
print("  → User requests data from Unity Catalog")
print("  → Unity Catalog retrieves encrypted DEK")
print("  → KEK from KMS decrypts the DEK")
print("  → DEK decrypts the data")
print("  → Data returned to authorized user\n")

# Demonstrate the concept with a simplified example
print("Simplified Example:")
print("-" * 60)

# Simulate DEK and KEK
dek = "DataKey12345678"  # In reality, this is random and unique per table
kek = "MasterKey8765432"  # In reality, this is in your KMS

# Create a table (DEK is automatically generated by Unity Catalog)
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{ENCRYPTION_SCHEMA}.highly_sensitive_data (
    id INT,
    patient_id STRING,
    diagnosis STRING,
    treatment STRING
)
TBLPROPERTIES (
    'comment' = 'Protected with envelope encryption'
)
""")

spark.sql(f"""
INSERT INTO {CATALOG}.{ENCRYPTION_SCHEMA}.highly_sensitive_data VALUES
    (1, 'P001', 'Type 2 Diabetes', 'Metformin'),
    (2, 'P002', 'Hypertension', 'Lisinopril'),
    (3, 'P003', 'Asthma', 'Albuterol')
""")

print("\n✓ Table created with automatic envelope encryption")
print("→ Unity Catalog automatically:")
print("  1. Generated a unique DEK for this table")
print("  2. Encrypted the table data with the DEK")
print("  3. Encrypted the DEK with your workspace KEK")
print("  4. Stored the encrypted DEK with table metadata\n")

display(spark.sql(f"SELECT * FROM {CATALOG}.{ENCRYPTION_SCHEMA}.highly_sensitive_data"))

print("\n🔒 Security Benefits:")
print("   • Data encrypted at rest with unique keys per table")
print("   • Keys protected by cloud provider KMS")
print("   • Key rotation doesn't require re-encrypting all data")
print("   • Meets compliance requirements (HIPAA, GDPR, etc.)")
print("   • Audit trail for all key access via KMS logs")


In [None]:
# Multi-key Protection Overview

print("Databricks Multi-key Protection")
print("=" * 60)

print("\nKey Management Architecture:")
print("-" * 60)

print("\n1. Databricks-managed Keys (Default)")
print("   ✓ Automatic encryption for all data")
print("   ✓ No configuration required")
print("   ✓ Managed by Databricks")
print("   ✓ Provides baseline security\n")

print("2. Customer-managed Keys (CMK)")
print("   ✓ Keys stored in YOUR cloud KMS")
print("   ✓ YOU control key access and rotation")
print("   ✓ YOU can revoke Databricks access")
print("   ✓ Enhanced compliance and control\n")

print("3. Combined Approach (Recommended)")
print("   ✓ Data encrypted with Databricks DEKs")
print("   ✓ DEKs encrypted with YOUR CMK")
print("   ✓ Best of both worlds\n")

print("=" * 60)
print("Configuration Steps (High-level):")
print("=" * 60)

print("\nFor AWS:")
print("  1. Create CMK in AWS KMS")
print("  2. Grant Databricks IAM role access to CMK")
print("  3. Configure workspace to use CMK")
print("  4. Enable CMK for managed services and storage")

print("\nFor Azure:")
print("  1. Create key in Azure Key Vault")
print("  2. Configure managed identity")
print("  3. Grant Databricks access to Key Vault")
print("  4. Enable CMK in workspace settings")

print("\nFor GCP:")
print("  1. Create encryption key in Cloud KMS")
print("  2. Grant Databricks service account access")
print("  3. Configure CMEK in workspace")
print("  4. Apply to storage buckets and resources")

print("\n" + "=" * 60)
print("Security Benefits:")
print("=" * 60)
print("✓ Complete control over encryption keys")
print("✓ Ability to revoke access instantly")
print("✓ Separate encryption for workspace, notebooks, and jobs")
print("✓ Compliance with regulations requiring CMK")
print("✓ Audit logs in your cloud KMS")
print("✓ Key rotation without service disruption")

print("\n📖 Documentation:")
print("   AWS: https://docs.databricks.com/security/keys/customer-managed-keys-aws.html")
print("   Azure: https://docs.databricks.com/security/keys/customer-managed-keys-azure.html")
print("   GCP: https://docs.databricks.com/security/keys/customer-managed-keys-gcp.html")
