# Privacy-Preserving Record Linkage (PPRL) - Superhero Demo with PySpark

This notebook demonstrates how OpenToken enables privacy-preserving record linkage between two organizations without sharing sensitive patient data.

**Scenario**: Super Hero Hospital and Super Hero Pharmacy want to link patient records for care coordination while protecting patient privacy.

**PySpark Bridge**: This notebook uses the OpenToken PySpark bridge for distributed overlap analysis and advanced transformations.

In [None]:
import pandas as pd
import os
import subprocess
import json
from pathlib import Path

# Set up paths
demo_dir = Path.cwd()
scripts_dir = demo_dir / 'scripts'
datasets_dir = demo_dir / 'datasets'
outputs_dir = demo_dir / 'outputs'

print(f"Demo directory: {demo_dir}")
print(f"Scripts directory: {scripts_dir}")
print(f"Datasets directory: {datasets_dir}")
print(f"Outputs directory: {outputs_dir}")

# Try to import PySpark components (optional)
try:
    from pyspark.sql import SparkSession
    from opentoken_pyspark.overlap_analyzer import OpenTokenOverlapAnalyzer
    pyspark_available = True
    print("✓ PySpark and OpenToken PySpark Bridge available")
except ImportError:
    pyspark_available = False
    print("⚠ PySpark not available - will use pandas-based analysis instead")
    print("  (Install with: pip install pyspark opentoken-pyspark)")

# Initialize Spark Session if available
spark = None
if pyspark_available:
    try:
        spark = SparkSession.builder \
            .appName("PPRL-Superhero-Demo") \
            .master("local[*]") \
            .config("spark.sql.shuffle.partitions", "4") \
            .getOrCreate()
        print("✓ Spark Session initialized")
    except Exception as e:
        print(f"⚠ Could not initialize Spark: {e}")
        pyspark_available = False

## 2. Generate Superhero Datasets

Create two datasets (hospital and pharmacy) with a 40% overlap. The overlap represents patients that appear in both datasets.

In [None]:
# Run the data generation script
result = subprocess.run(
    ['python', str(scripts_dir / 'generate_superhero_datasets.py')],
    cwd=str(demo_dir),
    capture_output=True,
    text=True
)

print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
else:
    print("✓ Datasets generated successfully!")

### Inspect the Generated Datasets

In [None]:
# Load and display hospital dataset
hospital_df = pd.read_csv(datasets_dir / 'hospital_superhero_data.csv')
print(f"Hospital Dataset: {len(hospital_df)} records")
print(hospital_df.head())
print()

# Load and display pharmacy dataset
pharmacy_df = pd.read_csv(datasets_dir / 'pharmacy_superhero_data.csv')
print(f"Pharmacy Dataset: {len(pharmacy_df)} records")
print(pharmacy_df.head())

## 3. Tokenize the Datasets

Each organization tokenizes their data independently using the OpenToken Java CLI. This applies:
1. HMAC-SHA256 hashing for deterministic tokens
2. AES-256-GCM encryption for secure transmission

**Important**: Both organizations use the same hashing and encryption keys to enable later comparison.

In [None]:
# Make the tokenization script executable
tokenize_script = scripts_dir / 'tokenize_datasets.sh'
os.chmod(tokenize_script, 0o755)

# Run tokenization
result = subprocess.run(
    ['bash', str(tokenize_script)],
    cwd=str(demo_dir),
    capture_output=True,
    text=True,
    env={**os.environ, 'PATH': os.environ.get('PATH', '')}
)

print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
    raise RuntimeError("Tokenization failed")
else:
    print("✓ Tokenization completed successfully!")

### Inspect Tokenized Data

In [None]:
# Load tokenized hospital data
hospital_tokens = pd.read_csv(outputs_dir / 'hospital_tokens.csv')
print(f"Hospital Tokens: {len(hospital_tokens)} token rows (5 per patient)")
print(hospital_tokens.head(10))
print()

# Load tokenized pharmacy data
pharmacy_tokens = pd.read_csv(outputs_dir / 'pharmacy_tokens.csv')
print(f"Pharmacy Tokens: {len(pharmacy_tokens)} token rows (5 per patient)")
print(pharmacy_tokens.head(10))

### View Metadata

The metadata files contain information about the tokenization process, including record counts and hashes of the secrets used (not the secrets themselves).

In [None]:
# Load and display hospital metadata
with open(outputs_dir / 'hospital_tokens.csv.metadata.json') as f:
    hospital_metadata = json.load(f)
    
print("Hospital Metadata:")
print(json.dumps(hospital_metadata, indent=2))
print()

# Load and display pharmacy metadata
with open(outputs_dir / 'pharmacy_tokens.csv.metadata.json') as f:
    pharmacy_metadata = json.load(f)
    
print("Pharmacy Metadata:")
print(json.dumps(pharmacy_metadata, indent=2))

## 4. Decrypt Tokens and Perform Overlap Analysis

To compare tokens across independently tokenized datasets:
1. **Decrypt** the encrypted tokens to reveal the underlying HMAC-SHA256 hashes
2. **Compare** the decrypted hashes to find matching records

**Why decryption is needed**: OpenToken uses random IVs for encryption, so even identical patients produce different encrypted tokens. Decryption reveals the deterministic hash layer that can be compared.

In [None]:
# If PySpark is available, use the PySpark bridge for analysis
if pyspark_available and spark:
    print("Using OpenToken PySpark Bridge for overlap analysis...")
    print()
    
    try:
        # Load tokenized data into Spark DataFrames
        hospital_tokens_spark = spark.read.csv(
            str(outputs_dir / 'hospital_tokens.csv'),
            header=True,
            inferSchema=True
        )
        pharmacy_tokens_spark = spark.read.csv(
            str(outputs_dir / 'pharmacy_tokens.csv'),
            header=True,
            inferSchema=True
        )
        
        # Initialize the overlap analyzer with the encryption key
        encryption_key = "Secret-Encryption-Key-Goes-Here."  # Same key used for tokenization
        analyzer = OpenTokenOverlapAnalyzer(encryption_key)
        
        # Analyze overlap using PySpark
        matches_spark = analyzer.analyze_overlap(
            hospital_tokens_spark,
            pharmacy_tokens_spark,
            ["T1", "T2", "T3", "T4", "T5"],  # All 5 tokens must match
            hospital_record_id_column="RecordId",
            pharmacy_record_id_column="RecordId",
            hospital_token_column="Token",
            pharmacy_token_column="Token",
            hospital_token_id_column="TokenId",
            pharmacy_token_id_column="TokenId"
        )
        
        # Collect results for display
        matches_list = matches_spark.collect()
        print(f"✓ Overlap analysis completed using PySpark!")
        print(f"  Found {len(matches_list)} matching record pairs")
        
        # Show results
        matches_spark.show(10)
        
    except Exception as e:
        print(f"Error during PySpark analysis: {e}")
        print("Falling back to pandas-based analysis...")
        pyspark_available = False

# Fallback to pandas-based analysis
if not pyspark_available or not spark:
    print("Using pandas-based overlap analysis...")
    print()
    
    # Make the overlap analysis script executable
    analyze_script = scripts_dir / 'analyze_overlap.py'

    # Run overlap analysis
    result = subprocess.run(
        ['python', str(analyze_script)],
        cwd=str(demo_dir),
        capture_output=True,
        text=True
    )

    print(result.stdout)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    else:
        print("✓ Overlap analysis completed successfully!")

### View Matching Results

In [None]:
# Load and display matching results
matches_df = pd.read_csv(outputs_dir / 'matching_records.csv')

print(f"Total Matching Pairs: {len(matches_df)}")
print()
print("First 10 matching records:")
print(matches_df.head(10))
print()

# Summary statistics
hospital_count = len(hospital_df)
pharmacy_count = len(pharmacy_df)
unique_hospital_matches = matches_df['HospitalRecordId'].nunique()
unique_pharmacy_matches = matches_df['PharmacyRecordId'].nunique()

print("Summary Statistics:")
print(f"- Hospital records with matches: {unique_hospital_matches} out of {hospital_count}")
print(f"- Pharmacy records with matches: {unique_pharmacy_matches} out of {pharmacy_count}")
print(f"- Overlap percentage (hospital): {(unique_hospital_matches / hospital_count * 100):.1f}%")
print(f"- Overlap percentage (pharmacy): {(unique_pharmacy_matches / pharmacy_count * 100):.1f}%")

## 6. Understand the Results

Let's look at what a match actually means by examining some matched records in detail.

In [None]:
# Get a sample matched record
if len(matches_df) > 0:
    sample_match = matches_df.iloc[0]
    hospital_record_id = sample_match['HospitalRecordId']
    pharmacy_record_id = sample_match['PharmacyRecordId']
    
    # Get the original records
    hospital_record = hospital_df[hospital_df['RecordId'] == hospital_record_id].iloc[0]
    pharmacy_record = pharmacy_df[pharmacy_df['RecordId'] == pharmacy_record_id].iloc[0]
    
    print("Sample Match:")
    print(f"Hospital Record ID: {hospital_record_id}")
    print(f"Hospital Patient: {hospital_record['FirstName']} {hospital_record['LastName']}")
    print(f"DOB: {hospital_record['BirthDate']}, SSN: {hospital_record['SocialSecurityNumber']}")
    print()
    print(f"Pharmacy Record ID: {pharmacy_record_id}")
    print(f"Pharmacy Patient: {pharmacy_record['FirstName']} {pharmacy_record['LastName']}")
    print(f"DOB: {pharmacy_record['BirthDate']}, SSN: {pharmacy_record['SocialSecurityNumber']}")
    print()
    print("✓ All 5 tokens matched, confirming this is the same patient!")
else:
    print("No matches found. This could happen if:")
    print("- Different hashing/encryption keys were used")
    print("- Data validation rejected records with invalid attributes")

## 5a. Advanced PySpark Transformations (Optional)

If PySpark is available, we can perform distributed transformations on the tokenized data for large-scale analysis.


In [None]:
# PySpark-based transformations for distributed processing
if pyspark_available and spark:
    try:
        from pyspark.sql.functions import explode, col, count as spark_count
        
        print("Performing distributed token analysis with PySpark...")
        print()
        
        # Load the tokenized data
        hospital_tokens_spark = spark.read.csv(
            str(outputs_dir / 'hospital_tokens.csv'),
            header=True,
            inferSchema=True
        )
        pharmacy_tokens_spark = spark.read.csv(
            str(outputs_dir / 'pharmacy_tokens.csv'),
            header=True,
            inferSchema=True
        )
        
        # Analyze token distribution in hospital dataset
        print("Hospital Token Distribution:")
        hospital_tokens_spark.groupBy("TokenId").agg(spark_count("*").alias("count")).show()
        print()
        
        # Analyze token distribution in pharmacy dataset
        print("Pharmacy Token Distribution:")
        pharmacy_tokens_spark.groupBy("TokenId").agg(spark_count("*").alias("count")).show()
        print()
        
        # Count unique records
        hospital_unique = hospital_tokens_spark.select("RecordId").distinct().count()
        pharmacy_unique = pharmacy_tokens_spark.select("RecordId").distinct().count()
        print(f"Unique records - Hospital: {hospital_unique}, Pharmacy: {pharmacy_unique}")
        
    except Exception as e:
        print(f"Note: Advanced transformations not available - {type(e).__name__}")
        print("This is optional and does not affect the core PPRL workflow.")
else:
    print("PySpark not available for advanced transformations.")
    print("Core PPRL analysis completed successfully using pandas.")


## 7. Privacy and Security Summary

This demonstration shows how OpenToken enables privacy-preserving record linkage:

### What was protected:
- ✓ Raw patient data (names, SSNs, birthdates) was never shared between organizations
- ✓ HMAC-SHA256 hashes cannot be reversed to recover original data
- ✓ Encryption key controls who can decrypt and perform linkage

### What was shared:
- • Encrypted tokens for secure transmission
- • Matching statistics showing overlap counts
- • Metadata with summary information (not raw data)

### Key security principles:
1. **Strong Encryption**: AES-256-GCM with random IVs prevents pattern analysis
2. **Key Management**: Secure sharing and storage of encryption/hashing keys
3. **Deterministic Hashing**: HMAC-SHA256 enables comparison without raw data
4. **Access Control**: Only authorized parties can decrypt tokens

### PySpark Bridge Benefits:
- **Distributed Processing**: Handle large datasets across multiple nodes
- **Parallel Decryption**: Efficiently decrypt millions of tokens
- **Scalable Analysis**: Perform overlap analysis on enterprise-scale data
- **Integration**: Native Spark SQL for additional transformations

## 8. Customization Examples

You can customize this demo by modifying the scripts:

### Change dataset size and overlap:
Edit `scripts/generate_superhero_datasets.py`:
```python
num_hospital = 500  # Different size
num_pharmacy = 600
overlap_percentage = 0.50  # 50% overlap instead of 40%
```

### Use different encryption keys:
Edit `scripts/tokenize_datasets.sh`:
```bash
HASH_KEY="YourCustomHashingKey"
ENCRYPTION_KEY="YourCustomEncryptionKey-32"
```

**Important**: Both organizations must use the same keys for tokens to match!

### Scale with PySpark:
For large datasets, ensure PySpark is installed:
```bash
pip install pyspark opentoken-pyspark
```

The notebook will automatically use distributed processing if available.

## 9. Next Steps

This PPRL demo can be adapted for:
- Healthcare: Hospital-to-hospital patient matching
- Insurance: Claims linkage across providers
- Research: Multi-site study participant matching
- Government: Cross-agency identity resolution
- Financial Services: Anti-fraud systems

### With PySpark Bridge:
- Scale to petabyte-level datasets
- Distribute tokenization across clusters
- Parallel overlap analysis
- Real-time record linkage pipelines

For more information, see the [README.md](./README.md) in this directory and the [main OpenToken documentation](../../README.md).