# Privacy-Preserving Record Linkage (PPRL) Demonstration

This notebook is the recommended walkthrough for the `pprl-superhero-example` demo. It now follows the **ECDH public-key exchange** flow end-to-end (no pre-shared secrets). It shows how two organizations can link records **without exchanging raw identifiers** (like names or Social Security Numbers).

**Scenario:** Super Hero Hospital and Super Hero Pharmacy want to link patient records for care coordination while protecting privacy.

**Who does what:** In this demo, assume the **hospital** runs the overlap analysis step (it receives the pharmacy token package and compares tokens to find matches).

**Match policy:** OpenToken provides standard token rules (T1–T5), but it does **not** define a single, universal match policy. A match policy is what the parties agree on: **which token IDs must match (and how many)** before treating two records as the same person. This notebook shows strict matching (T1–T5).

**Prefer a one-command run?** From this directory you can run `./run_end_to_end.sh` to generate data, perform ECDH key exchange, tokenize, decrypt, and analyze overlap end-to-end (see `README.md`).

> **Important demo disclaimer:** This example uses fully synthetic data and demo keys that are safe for illustration only. Do not reuse these keys or patterns in production. Real deployments must use strong key management, strict access controls around decryption and matching, and clear governance over who can run linkage jobs and how results are used.

In [None]:
# Setup: imports + paths
import subprocess
import sys
from pathlib import Path

import pandas as pd


def _find_demo_dir() -> Path:
    """Find the demo directory regardless of the current working directory."""
    start = Path.cwd().resolve()
    candidates = [start, *start.parents]
    for base in candidates:
        # Case 1: notebook opened from the demo directory
        if (base / "scripts" / "generate_superhero_datasets.py").exists():
            return base
        # Case 2: notebook opened from repo root (or elsewhere)
        demo = base / "demos" / "pprl-superhero-example"
        if (demo / "scripts" / "generate_superhero_datasets.py").exists():
            return demo
    raise FileNotFoundError("Could not locate demos/pprl-superhero-example")


demo_dir = _find_demo_dir()
scripts_dir = demo_dir / "scripts"
datasets_dir = demo_dir / "datasets"
outputs_dir = demo_dir / "outputs"
keys_dir = demo_dir / "keys"

# Ensure expected folders exist
datasets_dir.mkdir(parents=True, exist_ok=True)
outputs_dir.mkdir(parents=True, exist_ok=True)
keys_dir.mkdir(parents=True, exist_ok=True)

print(f"Demo directory: {demo_dir}")
print("Using ECDH public-key exchange demo flow (no shared secrets).")


## 2. Generate Superhero Datasets

Create two datasets (hospital and pharmacy) with a 40% overlap. The overlap represents patients that appear in both datasets.

In [None]:
# Run the data generation script
result = subprocess.run(
    [sys.executable, str(scripts_dir / "generate_superhero_datasets.py")],
    cwd=str(demo_dir),
    capture_output=True,
    text=True,
    check=False,
 )

print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
    raise RuntimeError("Dataset generation failed")
print("✓ Datasets generated successfully!")

### Inspect the Generated Datasets

In [None]:
# Load and display hospital dataset
hospital_df = pd.read_csv(datasets_dir / 'hospital_superhero_data.csv')
print(f"Hospital Dataset: {len(hospital_df)} records")
print(hospital_df.head())
print()

# Load and display pharmacy dataset
pharmacy_df = pd.read_csv(datasets_dir / 'pharmacy_superhero_data.csv')
print(f"Pharmacy Dataset: {len(pharmacy_df)} records")
print(pharmacy_df.head())

## 3. Tokenize the Datasets (ECDH via demo scripts)

Each organization tokenizes their data independently using **ECDH public-key exchange**. In this notebook we call the demo scripts (Java/Python under the hood) to perform:
1. Generate pharmacy key pair (receiver)
2. Hospital tokenization with ECDH (sender)
3. Pharmacy decryption + overlap analysis inputs

**Note:** The end-to-end script (`run_end_to_end.sh`) runs the same flow in one command.

In [None]:
print("Tokenizing datasets with ECDH demo scripts...")
print()

# Scripts (ECDH flow only)
pharmacy_keys_script = scripts_dir / "tokenize_pharmacy_generate_keys.sh"
hospital_script = scripts_dir / "tokenize_hospital.sh"
pharmacy_decrypt_script = scripts_dir / "tokenize_pharmacy_decrypt.sh"

# Ensure executability
for script in [pharmacy_keys_script, hospital_script, pharmacy_decrypt_script]:
    subprocess.run(["chmod", "+x", str(script)], cwd=str(demo_dir), check=False)

# Step A: Pharmacy generates key pair
result = subprocess.run(
    [str(pharmacy_keys_script)],
    cwd=str(demo_dir),
    capture_output=True,
    text=True,
    check=False,
)
print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
    raise RuntimeError("Key generation failed")
print("✓ Pharmacy ECDH key pair generated")
print()

# Step B: Hospital tokenizes with ECDH (outputs hospital_tokens_ecdh.zip)
result = subprocess.run(
    [str(hospital_script)],
    cwd=str(demo_dir),
    capture_output=True,
    text=True,
    check=False,
)
print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
    raise RuntimeError("Hospital tokenization failed")
print("✓ Hospital tokenization (ECDH) completed")
print()

# Step C: Pharmacy decrypts hospital tokens and prepares for analysis
result = subprocess.run(
    [str(pharmacy_decrypt_script)],
    cwd=str(demo_dir),
    capture_output=True,
    text=True,
    check=False,
)
print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
    raise RuntimeError("Pharmacy decryption failed")
print("✓ Pharmacy decryption completed; decrypted tokens ready for analysis")

### Inspect Tokenized Data

In [None]:
# Load decrypted hospital tokens (produced by ECDH flow)
pharmacy_decrypted = pd.read_csv(outputs_dir / "pharmacy_decrypted_hospital_tokens.csv")
print(f"Decrypted Hospital Tokens (from pharmacy): {len(pharmacy_decrypted)} token rows")
print(pharmacy_decrypted.head(10))

## 4. Decrypt Tokens and Perform Overlap Analysis (ECDH)

To compare tokens across independently tokenized datasets:
1. The hospital sends encrypted tokens + its public key (in `hospital_tokens_ecdh.zip`).
2. The pharmacy uses its private key + the hospital public key to derive the same keys (ECDH) and decrypts to the deterministic HMAC layer.
3. The pharmacy compares the decrypted fingerprints to find matching records.

**Why decryption is needed:** OpenToken uses random IVs for encryption, so even identical patients produce different encrypted token strings. Decryption reveals the deterministic fingerprint layer that can be compared for equality.

**Who runs this step (in this demo):** the **pharmacy** runs the overlap analysis after decrypting (trusted environment with its private key).

In [None]:
print("Performing overlap analysis (ECDH)...")
print()

analyze_script = scripts_dir / "analyze_overlap.py"
result = subprocess.run(
    [sys.executable, str(analyze_script)],
    cwd=str(demo_dir),
    capture_output=True,
    text=True,
    check=False,
)
print(result.stdout)
if result.returncode != 0:
    print(f"Error: {result.stderr}")
    raise RuntimeError("Overlap analysis failed")
print("✓ Overlap analysis completed successfully (ECDH)")

### View Matching Results

In [None]:
# Load and display matching results (ECDH)
matches_df = pd.read_csv(outputs_dir / "matching_records_ecdh.csv")

print(f"Total Matching Pairs: {len(matches_df)}")
print()
print("First 10 matching records:")
print(matches_df.head(10))
print()

# Summary statistics
hospital_count = len(pd.read_csv(datasets_dir / "hospital_superhero_data.csv"))
pharmacy_count = len(pd.read_csv(datasets_dir / "pharmacy_superhero_data.csv"))
unique_hospital_matches = matches_df['Hospital_RecordId'].nunique()
unique_pharmacy_matches = matches_df['Pharmacy_RecordId'].nunique()

print("Summary Statistics:")
print(f"- Hospital records with matches: {unique_hospital_matches} out of {hospital_count}")
print(f"- Pharmacy records with matches: {unique_pharmacy_matches} out of {pharmacy_count}")
print(f"- Overlap percentage (hospital): {(unique_hospital_matches / hospital_count * 100):.1f}%")
print(f"- Overlap percentage (pharmacy): {(unique_pharmacy_matches / pharmacy_count * 100):.1f}%")

## 5. Alternative Analysis

For custom match rules (e.g., fewer than 5 tokens), rerun the CLI overlap analyzer with your desired `matching_rules` configuration in `scripts/analyze_overlap.py`. This notebook run uses the default strict T1–T5 policy.

In [None]:
print("Alternative analysis not run here.")
print("To try different matching rules, edit scripts/analyze_overlap.py and rerun the notebook or CLI.")

### Interpreting the Alternative Analysis

The alternative analysis uses only 4 tokens (T1, T2, T3, T5) instead of all 5:
- **T1**: FirstName + LastName + Sex + BirthDate
- **T2**: FirstName + LastName + PostalCode
- **T3**: FirstName + LastName + SocialSecurityNumber
- **T4**: ❌ *EXCLUDED* - BirthDate + Sex + PostalCode
- **T5**: BirthDate + Sex + SocialSecurityNumber

By excluding T4, we're being more lenient about postal code consistency. This might find additional matches where:
- Postal codes have typos or formatting differences
- People have moved between visits
- Data entry errors occurred

However, this also increases the risk of false positives.

## 6. Understand the Results

Let's look at what a match actually means by examining some matched records in detail.

In [None]:
# Get a sample matched record
if len(matches_df) > 0:
    sample_match = matches_df.iloc[0]
    hospital_record_id = sample_match['Hospital_RecordId']
    pharmacy_record_id = sample_match['Pharmacy_RecordId']
    
    # Get the original records
    hospital_match = pd.read_csv(datasets_dir / "hospital_superhero_data.csv")
    hospital_match = hospital_match[hospital_match['RecordId'] == hospital_record_id]
    pharmacy_match = pd.read_csv(datasets_dir / "pharmacy_superhero_data.csv")
    pharmacy_match = pharmacy_match[pharmacy_match['RecordId'] == pharmacy_record_id]
    
    if len(hospital_match) == 0 or len(pharmacy_match) == 0:
        print(f"Warning: Could not find matching records in original datasets")
        print(f"Hospital RecordId {hospital_record_id} found: {len(hospital_match) > 0}")
        print(f"Pharmacy RecordId {pharmacy_record_id} found: {len(pharmacy_match) > 0}")
    else:
        hospital_record = hospital_match.iloc[0]
        pharmacy_record = pharmacy_match.iloc[0]
        
        print("Sample Match:")
        print(f"Hospital Record ID: {hospital_record_id}")
        print(f"Hospital Patient: {hospital_record['FirstName']} {hospital_record['LastName']}")
        print(f"DOB: {hospital_record['BirthDate']}, SSN: {hospital_record['SocialSecurityNumber']}")
        print()
        print(f"Pharmacy Record ID: {pharmacy_record_id}")
        print(f"Pharmacy Patient: {pharmacy_record['FirstName']} {pharmacy_record['LastName']}")
        print(f"DOB: {pharmacy_record['BirthDate']}, SSN: {pharmacy_record['SocialSecurityNumber']}")
        print()
        print("✓ All 5 tokens matched, confirming this is the same patient!")
else:
    print("No matches found. This could happen if:")
    print("- Key exchange was not run (tokenization/decryption failed)")
    print("- Data validation rejected records with invalid attributes")

## 7. Advanced PySpark Transformations (Optional)

If PySpark is available, we can perform distributed transformations on the tokenized data for large-scale analysis.


In [None]:
# Check PySpark availability
try:
    from pyspark.sql import SparkSession
    pyspark_available = True
    
    # Initialize Spark session
    spark = SparkSession.builder \
        .appName("OpenToken PPRL Demo") \
        .master("local[*]") \
        .getOrCreate()
    
    print("✓ PySpark is available")
    print(f"Spark version: {spark.version}")
except ImportError:
    pyspark_available = False
    spark = None
    print("PySpark not installed - advanced transformations will be skipped")
    print("To enable PySpark: pip install pyspark opentoken-pyspark")

In [None]:
# PySpark-based transformations for distributed processing
if not (pyspark_available and spark):
    print("PySpark not available for advanced transformations.")
    print("Core PPRL analysis completed successfully using pandas.")
elif not (outputs_dir / 'pharmacy_decrypted_hospital_tokens.csv').exists():
    print("PySpark analysis skipped for ECDH flow.")
    print("ECDH produces encrypted tokens in ZIP format and decrypted tokens for analysis.")
    print("For PySpark analysis, extract tokens from ZIP or tokenize with pre-shared secrets.")
else:
    try:
        from pyspark.sql.functions import col, count as spark_count
        
        print("Performing distributed token analysis with PySpark (ECDH flow)...")
        print()
        
        # Load the ECDH-decrypted hospital tokens
        hospital_tokens_spark = spark.read.csv(
            str(outputs_dir / 'pharmacy_decrypted_hospital_tokens.csv'),
            header=True,
            inferSchema=False  # Keep all as strings
        )
        
        # Analyze token distribution in decrypted hospital dataset
        print("Hospital Token Distribution (ECDH-decrypted):")
        hospital_tokens_spark.groupBy("RuleId").agg(spark_count("*").alias("count")).orderBy("RuleId").show()
        print()
        
        # Count unique records
        hospital_unique = hospital_tokens_spark.select("RecordId").distinct().count()
        print(f"Unique hospital records: {hospital_unique}")
        print()
        print("Note: ECDH flow analyzed. For pharmacy token distribution, run pharmacy")
        print("      tokenization separately (not covered in this ECDH demo).")
        
    except Exception as e:
        print(f"Note: Advanced transformations not available - {type(e).__name__}")
        print("This is optional and does not affect the core PPRL workflow.")

## 8. Privacy and Security Summary

This demonstration shows how OpenToken enables privacy-preserving record linkage:

### What was protected:
- ✓ Raw patient data (names, SSNs, birthdates) was never shared between organizations
- ✓ HMAC-SHA256 hashes cannot be reversed to recover original data
- ✓ Encryption key controls who can decrypt and perform linkage

### What was shared:
- • Encrypted tokens for secure transmission
- • Matching statistics showing overlap counts
- • Metadata with summary information (not raw data)

### Key security principles:
1. **Strong Encryption**: AES-256-GCM with random IVs prevents pattern analysis
2. **Key Management**: Secure sharing and storage of encryption/hashing keys
3. **Deterministic Hashing**: HMAC-SHA256 enables comparison without raw data
4. **Access Control**: Only authorized parties can decrypt tokens

### PySpark Bridge Benefits:
- **Distributed Processing**: Handle large datasets across multiple nodes
- **Parallel Decryption**: Efficiently decrypt millions of tokens
- **Scalable Analysis**: Perform overlap analysis on enterprise-scale data
- **Integration**: Native Spark SQL for additional transformations

## 9. Customization Examples

You can customize this demo by modifying the scripts:

### Change dataset size and overlap:
Edit `scripts/generate_superhero_datasets.py`:
```python
num_hospital = 500  # Different size
num_pharmacy = 600
overlap_percentage = 0.50  # 50% overlap instead of 40%
```

### Use different secrets (demo only):
Edit **both** tokenization scripts to keep them in sync:
- `scripts/tokenize_hospital.sh`
- `scripts/tokenize_pharmacy.sh`

```bash
HASHING_SECRET="YourCustomHashingKey"
ENCRYPTION_KEY="YourCustomEncryptionKey-32chars"  # Must be exactly 32 characters
```

**Important**: Both organizations must use the same secrets for tokens to match.

### Scale with PySpark:
For large datasets, ensure PySpark and the OpenToken PySpark bridge are installed:
```bash
pip install pyspark opentoken-pyspark
```

The notebook will automatically use distributed processing if available.

## 10. Next Steps

This PPRL demo can be adapted for:
- Healthcare: Hospital-to-hospital patient matching
- Insurance: Claims linkage across providers
- Research: Multi-site study participant matching
- Government: Cross-agency identity resolution
- Financial Services: Anti-fraud systems

### With PySpark Bridge:
- Scale to petabyte-level datasets
- Distribute tokenization across clusters
- Parallel overlap analysis
- Real-time record linkage pipelines

For more information, see the [README.md](./README.md) in this directory and the [main OpenToken documentation](../../README.md).

<!-- notebook-edit-test -->