This project implements an adaptive data anonymization framework that uses mutual information analysis combined with differential privacy mechanisms. The system dynamically allocates privacy budgets based on feature importance and correlation patterns, optimizing the utility-privacy trade-off for machine learning tasks.
- MI-Adaptive Differential Privacy: Dynamically allocates epsilon budget based on mutual information scores
- Correlation-Aware Noise Addition: Groups correlated features for coordinated noise injection
- K-Means Clustering Integration: Applies anonymization within data clusters for better utility preservation
- Feature Selection Integration: Supports Chi2 and ExtraTrees feature selection methods
- Multi-Model Evaluation: Tests 6 different ML models with hyperparameter optimization using Optuna
- Comprehensive Anonymization Scenarios: Evaluates all combinations of anonymized/non-anonymized training and test sets
The framework implements a novel MI-Adaptive Differential Privacy approach:
- Mutual Information Analysis: Calculates feature importance using mutual information with target variable
- Feature Redundancy Detection: Groups highly correlated features (correlation > threshold)
- Adaptive Epsilon Allocation: Allocates privacy budget inversely proportional to feature importance
- Correlation-Aware Noise: Adds coordinated noise to correlated feature groups
Install dependencies using:
pip install -r requirements.txtsrc/
├── main.py # Main experiment runner with Optuna optimization
├── ml.py # Cross-validation and model evaluation
├── file_utils.py # Dataset loading and preprocessing
└── anonymization/
├── anon_main.py # MIAdaptiveDPAnonymizer main class
├── clustering.py # K-means clustering integration
├── mu.py # Mutual information analysis
├── noise_alocation.py # Adaptive epsilon budget allocation
└── dp_mechanism.py # Differential privacy noise mechanisms
The framework includes 7 pre-configured datasets:
- adults: Adult income classification (label: 'income')
- bank: Bank marketing campaign (label: 'y')
- ddos: DDoS attack detection (label: 'Label')
- heart: Heart disease prediction (label: 'HeartDisease')
- cmc: Contraceptive method choice (label: 'method')
- mgm: Medical dataset (label: 'severity')
- cahousing: California housing classification (label: 'ocean_proximity')
Run with default parameters on California housing dataset:
python src/main.pypython src/main.py adults --epsilon=1.0 --mi_weight=0.8 --correlation_threshold=0.7--epsilon=VALUE: Privacy budget (default: 1.0)--mi_weight=VALUE: Weight for MI-based allocation (default: 0.8)--correlation_threshold=VALUE: Correlation threshold for grouping (default: 0.7)--noise_type=TYPE: 'laplace' or 'gaussian' (default: 'laplace')--n_trials=VALUE: Optuna optimization trials per scenario (default: 20)
High privacy protection:
python src/main.py heart --epsilon=0.1 --mi_weight=0.9Focus on correlation patterns:
python src/main.py adults --correlation_threshold=0.5 --mi_weight=0.5Fast experimentation:
python src/main.py mgm --n_trials=10Display available options and datasets:
python src/main.py --helpThe framework evaluates 4 anonymization scenarios for each model:
- No Anonymization: Original training and test data
- Training Only: Anonymized training data, original test data
- Testing Only: Original training data, anonymized test data
- Full Anonymization: Both training and test data anonymized
The system tests 6 different models with automatic hyperparameter optimization:
- K-Nearest Neighbors (KNN)
- Random Forest
- Gaussian Naive Bayes
- Multi-Layer Perceptron (MLP)
- AdaBoost
- Logistic Regression
Two feature selection approaches are automatically tested:
- Chi2: Statistical test-based feature selection
Results are saved in the results/ directory:
mi_adaptive_chi2_<dataset>_eps_<epsilon>_miw_<mi_weight>_<noise_type>_optuna.csvmi_adaptive_extra_trees_<dataset>_eps_<epsilon>_miw_<mi_weight>_<noise_type>_optuna.csv
Each CSV contains:
model: ML model nameanonymized_train/test: Whether training/test data was anonymizedaccuracy/precision/recall/f1_score: Performance metricsanon_train_time/anon_test_time: Anonymization processing timesmodel_train_time: Model training timeselected_features: Indices of selected featuresfeature_method: Feature selection method usednum_features: Number of features selectedbest_params: Optimal hyperparameters found by Optuna
The privacy budget allocation follows:
epsilon_i = base_epsilon + adaptive_epsilon * (1 - importance_i * mi_weight)
Where:
base_epsilon = total_epsilon * min_epsilon_ratio(minimum privacy for each feature)importance_iis the normalized mutual information scoremi_weightcontrols the strength of adaptive allocation
For correlated feature groups:
- Calculate group correlation matrix
- Add coordinated noise based on correlation strength
- Apply correlation factor to adjust noise variance
The framework provides ε-differential privacy with:
- Formal privacy accounting across clusters
- Adaptive budget allocation based on feature utility
- Support for both Laplace and Gaussian noise mechanisms
- Optuna Integration: Automatic hyperparameter tuning for all models
- Cross-Validation: 3-fold stratified cross-validation for robust evaluation
- Feature Selection: Reduces dimensionality before anonymization
- Clustering: Improves utility by preserving local data structure