# ACADP: Adaptive Correlation-Aware Differential Privacy
## Code Review Notebook

**Team Workflow Distribution:**
- Workflow 1: Data Ingestion & Preprocessing (Teammate 1)
- Workflow 2: Correlation & Feature Grouping (Teammate 2)
- Workflow 3: Differential Privacy & Budget Allocation (You)
- Workflow 4: Evaluation & Validation (Teammate 3)

**Dataset:** NYC Taxi Trip Data (Jan 2023)  
**Global Privacy Budget:** ε = 1.0

---
## WORKFLOW 1: Data Ingestion & Preprocessing
**Owner:** Teammate 1

**Modules:**
- Batch data loading with PySpark
- Schema validation and feature typing
- Missing value handling
- Feature bounding (DP requirement)
- Normalization

**Deliverable:** Clean, bounded dataset with feature metadata

In [None]:
# Install dependencies
!pip install pyspark -q

In [None]:
# Download NYC Taxi data (single file to avoid schema conflicts)
import urllib.request
import os

os.makedirs("nyc_taxi", exist_ok=True)

base_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/"
filename = "yellow_tripdata_2023-01.parquet"  # Using single file
filepath = f"nyc_taxi/{filename}"

if not os.path.exists(filepath):
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(base_url + filename, filepath)
    print("✓ Download complete")
else:
    print(f"✓ {filename} already exists")

In [None]:
# Initialize Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ACADP") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.shuffle.partitions", "10") \
    .getOrCreate()

print(f"✓ Spark {spark.version} initialized")

In [None]:
# Load data
df = spark.read.parquet("nyc_taxi/yellow_tripdata_2023-01.parquet")

print(f"✓ Loaded {df.count():,} rows, {len(df.columns)} columns")

In [None]:
# View schema
print("Schema:")
df.printSchema()

In [None]:
# Cast numeric columns to double for consistency
from pyspark.sql.functions import col

numeric_cols = [
    "passenger_count", "trip_distance", "fare_amount",
    "extra", "mta_tax", "tip_amount",
    "tolls_amount", "total_amount"
]

for col_name in numeric_cols:
    if col_name in df.columns:
        df = df.withColumn(col_name, col(col_name).cast("double"))

print("✓ Numeric columns cast to double")

In [None]:
# Handle missing values
df = df.dropna(subset=[
    "passenger_count", "trip_distance", "fare_amount", 
    "tip_amount", "total_amount"
])

print(f"✓ After cleaning: {df.count():,} rows")

In [None]:
# Feature bounding (DP requirement)
from pyspark.sql.functions import when

# Define bounds
bounds = {
    "passenger_count": (1, 6),
    "trip_distance": (0, 100),
    "fare_amount": (0, 500),
    "extra": (0, 100),
    "mta_tax": (0, 10),
    "tip_amount": (0, 100),
    "tolls_amount": (0, 50),
    "total_amount": (0, 600)
}

# Apply bounds
for feature, (min_val, max_val) in bounds.items():
    if feature in df.columns:
        df = df.withColumn(feature, 
                           when(col(feature).isNull(), min_val)
                           .when(col(feature) < min_val, min_val)
                           .when(col(feature) > max_val, max_val)
                           .otherwise(col(feature)))

print("✓ Feature bounding applied")
df.select("fare_amount", "trip_distance", "passenger_count", "tip_amount").show(5)

In [None]:
# Save preprocessed data
df.write.mode("overwrite").parquet("nyc_taxi_preprocessed")
print("✓ WORKFLOW 1 COMPLETE: Preprocessed data saved")

---
## WORKFLOW 2: Correlation & Feature Grouping
**Owner:** Teammate 2

**Modules:**
- Approximate Pearson correlation (sampled)
- Discretized Mutual Information
- Sparse dependency extraction
- Graph-based feature grouping (connected components)

**Deliverable:** Privacy blocks (correlated feature groups)

In [None]:
# Install additional dependencies
!pip install networkx scikit-learn -q

In [None]:
# Reload preprocessed data
df = spark.read.parquet("nyc_taxi_preprocessed")

# Select numerical features
all_num_cols = [
    "passenger_count", "trip_distance", "fare_amount",
    "extra", "mta_tax", "tip_amount", 
    "tolls_amount", "total_amount"
]

num_cols = [c for c in all_num_cols if c in df.columns]

print(f"Analyzing {len(num_cols)} numerical features: {num_cols}")

In [None]:
# Sample for correlation computation (1% sample)
SAMPLE_FRACTION = 0.01
sample_df = df.select(num_cols).sample(fraction=SAMPLE_FRACTION, seed=42)

print(f"✓ Sample: {sample_df.count():,} rows ({SAMPLE_FRACTION*100}%)")

In [None]:
# Compute Pearson correlations
from itertools import combinations

PEARSON_THRESH = 0.4
pearson_pairs = []

print(f"Computing Pearson correlations (threshold = {PEARSON_THRESH})...")
for f1, f2 in combinations(num_cols, 2):
    corr = sample_df.stat.corr(f1, f2)
    if corr and abs(corr) >= PEARSON_THRESH:
        pearson_pairs.append((f1, f2, corr))

print(f"✓ Found {len(pearson_pairs)} significant correlations\n")

# Top correlations
for f1, f2, corr in sorted(pearson_pairs, key=lambda x: abs(x[2]), reverse=True)[:5]:
    print(f"  {f1:20s} <-> {f2:20s} : {corr:>6.3f}")

In [None]:
# Compute Mutual Information
import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_regression

sample_pd = sample_df.toPandas()
MI_THRESH = 0.1
mi_pairs = []

print(f"Computing Mutual Information (threshold = {MI_THRESH})...")
for target_col in num_cols:
    feature_cols = [c for c in num_cols if c != target_col]
    X = sample_pd[feature_cols].values
    y = sample_pd[target_col].values
    
    mi_scores = mutual_info_regression(X, y, random_state=42)
    
    for i, feature_col in enumerate(feature_cols):
        if mi_scores[i] >= MI_THRESH:
            pair = tuple(sorted([feature_col, target_col]))
            mi_pairs.append((*pair, mi_scores[i]))

mi_pairs = list(set(mi_pairs))
print(f"✓ Found {len(mi_pairs)} significant MI dependencies")

In [None]:
# Combine Pearson and MI edges
all_edges = set()

for f1, f2, _ in pearson_pairs:
    all_edges.add(tuple(sorted([f1, f2])))

for f1, f2, _ in mi_pairs:
    all_edges.add(tuple(sorted([f1, f2])))

print(f"✓ Combined: {len(all_edges)} unique dependency edges")

In [None]:
# Create dependency graph and find feature blocks
import networkx as nx

G = nx.Graph()
G.add_nodes_from(num_cols)

for f1, f2 in all_edges:
    G.add_edge(f1, f2)

feature_blocks = [sorted(list(block)) for block in nx.connected_components(G)]

print(f"✓ Feature grouping: {len(feature_blocks)} privacy blocks\n")
for i, block in enumerate(feature_blocks):
    print(f"Block {i} ({len(block)} features): {block}")

print("\n✓ WORKFLOW 2 COMPLETE")

---
## WORKFLOW 3: Differential Privacy & Budget Allocation
**Owner:** You (Primary)

**Modules:**
- Joint sensitivity estimation per block
- Adaptive ε allocation (data-driven, no uniform)
- Laplace mechanism (ε-DP) at block level
- Privacy accounting

**Deliverable:** Privatized dataset with ε allocation report

In [None]:
# Block-level sensitivity estimation
import numpy as np

block_sensitivities = []

print("Block Sensitivity Estimation:")
print("="*60)

for i, block in enumerate(feature_blocks):
    feature_sens = []
    for feature in block:
        min_val, max_val = bounds[feature]
        sensitivity = max_val - min_val
        feature_sens.append(sensitivity)
    
    # L2 block sensitivity
    l2_sensitivity = np.sqrt(np.sum(np.array(feature_sens) ** 2))
    
    block_sensitivities.append({
        'block_id': i,
        'features': block,
        'l2_sensitivity': l2_sensitivity
    })
    
    print(f"Block {i}: Δ = {l2_sensitivity:.2f}")

print("\n✓ Sensitivity estimation complete")

In [None]:
# Adaptive privacy budget allocation
EPSILON_GLOBAL = 1.0
total_sensitivity = sum(bs['l2_sensitivity'] for bs in block_sensitivities)

print(f"Adaptive Budget Allocation (ε = {EPSILON_GLOBAL}):")
print("="*60)

for bs in block_sensitivities:
    # Sensitivity-proportional allocation
    epsilon_block = EPSILON_GLOBAL * (bs['l2_sensitivity'] / total_sensitivity)
    bs['epsilon'] = epsilon_block
    
    print(f"Block {bs['block_id']}: ε = {epsilon_block:.4f} ({100*epsilon_block/EPSILON_GLOBAL:.1f}%)")

total_allocated = sum(bs['epsilon'] for bs in block_sensitivities)
print(f"\nTotal: {total_allocated:.6f}")
print("✓ Budget allocation complete")

In [None]:
# Apply Laplace mechanism
df_for_dp = sample_pd.copy()
df_private = df_for_dp.copy()

print("Applying Laplace Mechanism:")
print("="*60)

for bs in block_sensitivities:
    features = bs['features']
    sensitivity = bs['l2_sensitivity']
    epsilon = bs['epsilon']
    laplace_scale = sensitivity / epsilon
    
    print(f"Block {bs['block_id']}: λ = {laplace_scale:.2f}")
    
    for feature in features:
        noise = np.random.laplace(0, laplace_scale, size=len(df_private))
        df_private[feature] = df_private[feature] + noise
        
        # Clip to bounds
        min_val, max_val = bounds[feature]
        df_private[feature] = df_private[feature].clip(min_val, max_val)

print("\n✓ Differential privacy applied")
print("✓ WORKFLOW 3 COMPLETE")

---
## WORKFLOW 4: Evaluation & Validation
**Owner:** Teammate 3

**Modules:**
- Baseline: Feature-independent uniform ε DP
- Utility metrics: MAE, correlation preservation
- Comparison: ACADP vs Baseline

**Deliverable:** Quantitative comparison

In [None]:
# Baseline: Uniform DP
df_baseline = df_for_dp.copy()
epsilon_per_feature = EPSILON_GLOBAL / len(num_cols)

print(f"Baseline: Uniform DP (ε/feature = {epsilon_per_feature:.4f})")

for feature in num_cols:
    min_val, max_val = bounds[feature]
    sensitivity = max_val - min_val
    laplace_scale = sensitivity / epsilon_per_feature
    
    noise = np.random.laplace(0, laplace_scale, size=len(df_baseline))
    df_baseline[feature] = df_baseline[feature] + noise
    df_baseline[feature] = df_baseline[feature].clip(min_val, max_val)

print("✓ Baseline DP applied")

In [None]:
# Metric 1: Mean Absolute Error
mae_acadp = {f: np.mean(np.abs(df_for_dp[f] - df_private[f])) for f in num_cols}
mae_baseline = {f: np.mean(np.abs(df_for_dp[f] - df_baseline[f])) for f in num_cols}

avg_mae_acadp = np.mean(list(mae_acadp.values()))
avg_mae_baseline = np.mean(list(mae_baseline.values()))

print("Mean Absolute Error:")
print(f"  ACADP:    {avg_mae_acadp:.2f}")
print(f"  Baseline: {avg_mae_baseline:.2f}")
print(f"  Improvement: {((avg_mae_baseline - avg_mae_acadp) / avg_mae_baseline * 100):.2f}%")

In [None]:
# Metric 2: Correlation Preservation
corr_original = df_for_dp[num_cols].corr()
corr_acadp = df_private[num_cols].corr()
corr_baseline = df_baseline[num_cols].corr()

corr_error_acadp = np.mean(np.abs(corr_original.values - corr_acadp.values))
corr_error_baseline = np.mean(np.abs(corr_original.values - corr_baseline.values))

print("\nCorrelation Preservation Error:")
print(f"  ACADP:    {corr_error_acadp:.4f}")
print(f"  Baseline: {corr_error_baseline:.4f}")
print(f"  Improvement: {((corr_error_baseline - corr_error_acadp) / corr_error_baseline * 100):.2f}%")

In [None]:
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.heatmap(corr_original, annot=False, cmap='coolwarm', center=0, 
            vmin=-1, vmax=1, ax=axes[0], cbar_kws={'label': 'Correlation'})
axes[0].set_title('Original', fontsize=14, fontweight='bold')

sns.heatmap(corr_acadp, annot=False, cmap='coolwarm', center=0,
            vmin=-1, vmax=1, ax=axes[1], cbar_kws={'label': 'Correlation'})
axes[1].set_title(f'ACADP (error={corr_error_acadp:.4f})', fontsize=14, fontweight='bold')

sns.heatmap(corr_baseline, annot=False, cmap='coolwarm', center=0,
            vmin=-1, vmax=1, ax=axes[2], cbar_kws={'label': 'Correlation'})
axes[2].set_title(f'Baseline (error={corr_error_baseline:.4f})', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('correlation_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Saved: correlation_comparison.png")

In [None]:
# Summary
summary = pd.DataFrame({
    'Metric': ['Average MAE', 'Correlation Error', 'Privacy Budget (ε)'],
    'ACADP': [
        f"{avg_mae_acadp:.2f}",
        f"{corr_error_acadp:.4f}",
        f"{EPSILON_GLOBAL}"
    ],
    'Baseline': [
        f"{avg_mae_baseline:.2f}",
        f"{corr_error_baseline:.4f}",
        f"{EPSILON_GLOBAL}"
    ],
    'ACADP Improvement': [
        f"{((avg_mae_baseline - avg_mae_acadp) / avg_mae_baseline * 100):.2f}%",
        f"{((corr_error_baseline - corr_error_acadp) / corr_error_baseline * 100):.2f}%",
        "N/A"
    ]
})

print("\n" + "="*80)
print("EVALUATION SUMMARY")
print("="*80)
print(summary.to_string(index=False))
print("="*80)
print("\n✓ WORKFLOW 4 COMPLETE")

---
## Results Summary

**Key Findings:**
1. ACADP preserves correlations better than baseline uniform DP
2. Comparable or lower error rates
3. Adaptive budget allocation optimizes for data characteristics
4. Block-level DP more efficient than per-feature DP

**Limitations:**
- 1% sampling for correlation analysis (computational feasibility)
- Single month dataset (Jan 2023)
- Thresholding: |r| ≥ 0.4, MI ≥ 0.1