PrivFusion: LLM-Powered Heterogeneous Data Consolidation

Official Implementation of "PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets"

📄 Paper

📋 Table of Contents

Overview
Key Features
Installation
Quick Start
Configuration
Architecture
Running Experiments
Project Structure
Citation
Contributing
License
Acknowledgments

Overview

PrivFusion is a novel framework that leverages Large Language Models (LLMs) to automatically align and consolidate heterogeneous tabular datasets with overlapping but differently structured features. This repository contains the official implementation of our paper.

Problem Statement

Organizations often need to consolidate multiple datasets from different sources that describe similar entities but use different schemas, naming conventions, and data representations. Traditional approaches require extensive manual effort and domain expertise.

Our Solution

PrivFusion uses LLMs to:

Semantically cluster similar features across datasets
Normalize feature representations to a unified schema
Generate transformation code to align data values
Validate transformations with comprehensive quality metrics

Key Contributions

🎯 Novel LLM-based approach for automated dataset consolidation
🔄 End-to-end pipeline from feature clustering to code generation
📊 Comprehensive evaluation with fidelity, privacy, and statistical metrics
🔌 Flexible architecture supporting multiple LLM backends
🎓 Semantic type detection using DBpedia ontology

Key Features

🤖 LLM-Powered Analysis: Leverages state-of-the-art language models for semantic understanding
🔄 Automated Transformations: Generates Python code to transform data between schemas
📊 Multiple Metrics: Includes fidelity, privacy, and statistical metrics for data quality assessment
🔌 Flexible LLM Backends: Supports WatsonX, Ollama, and custom LLM endpoints
🎯 Semantic Type Detection: Automatically identifies and maps semantic types using DBpedia URIs
📝 Experiment Tracking: YAML-based configuration for reproducible experiments

Installation

Prerequisites

Python 3.11 or higher
uv (recommended) or pip

Setup

# Clone the repository
git clone https://github.com/IBM/privfusion.git
cd privfusion

# Using uv (recommended - fastest)
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync --extra dev  # Installs all dependencies including dev extras
pre-commit install

# Or using uv pip
uv pip install -e .[dev]
pre-commit install

# Or using pip
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e .[dev]
pre-commit install

Environment Configuration

Create a .env file in the project root (see .env.example):

# For WatsonX
watsonx_apikey=your_watsonx_api_key
watsonx_project_id=your_project_id

# For RITS endpoint (optional)
rits_api_key=your_rits_api_key

Quick Start

Basic Usage

from privfusion.consolidater import Consolidator
from privfusion.agents.llms import OllamaLLM
from privfusion.dataset_analyzer import DatasetAnalyzer
import pandas as pd

# Initialize LLM
llm = OllamaLLM(model_name="llama3.2", temperature=0)

# Analyze datasets
analyzer = DatasetAnalyzer(llm)
dataset1_info = analyzer.analyze(pd.read_csv("data/dataset1.csv"))
dataset2_info = analyzer.analyze(pd.read_csv("data/dataset2.csv"))

# Prepare datasets
datasets = {
    "dataset1": {
        "data": pd.read_csv("data/dataset1.csv"),
        "info": dataset1_info
    },
    "dataset2": {
        "data": pd.read_csv("data/dataset2.csv"),
        "info": dataset2_info
    }
}

# Run consolidation
consolidator = Consolidator()
result = consolidator.consolidate(datasets, llm)

# View results
print(result[['dataset', 'feature_name', 'cluster_id', 'norm_feature_name']])

Interactive Notebooks

Explore the framework through our Jupyter notebooks:

01-getting-started.ipynb - Introduction and basic usage
02-generate-data.ipynb - Synthetic data generation
03-consolidate.ipynb - Full consolidation workflow
04-run-experiments.ipynb - Running configured experiments
05-show-experiments.ipynb - Analyzing and visualizing experiments

Configuration

Experiments are configured using YAML files. Example structure:

datasets:
  - name: dataset1
    path: data/dataset1.csv
  - name: dataset2
    path: data/dataset2.csv

cluster:
  system_prompt: >
    Analyze and cluster semantically similar features...
  llm: privfusion.agents.llms.OllamaLLM
  args:
    model_name: llama3.2
  kwargs:
    temperature: 0
    max_tokens: 5000

normalize:
  system_prompt: >
    Normalize clustered features to unified schema...
  llm: privfusion.agents.llms.OllamaLLM
  number_samples: 5

transform:
  system_prompt: >
    Generate transformation code...
  llm: privfusion.agents.llms.OllamaLLM
  number_samples: 5

experiment:
  max_iter: 3

See configs/README.md for detailed configuration options.

Architecture

Core Components

PrivFusion Pipeline
├── DatasetAnalyzer      # Extract semantic & structural information
├── AgentCluster         # Cluster similar features across datasets
├── AgentNorm            # Normalize to unified schema
├── AgentCode            # Generate transformation code
└── Consolidator         # Orchestrate the pipeline

Component Details

DatasetAnalyzer: Extracts semantic types using DBpedia, analyzes data distributions
AgentCluster: Uses LLMs to identify semantically similar features across datasets
AgentNorm: Normalizes feature names, types, and value structures
AgentCode: Generates Python transformation code with validation
Consolidator: Manages the end-to-end consolidation workflow

LLM Backends

WatsonXLLM: IBM WatsonX AI integration
OllamaLLM: Local Ollama models (Llama3.2, Mistral, etc.)
RITSLLM: Custom RITS endpoint support

Metrics

The framework includes comprehensive evaluation metrics:

Fidelity Metrics: Measure preservation of data patterns and relationships
Privacy Metrics: Assess privacy preservation during consolidation
Statistical Metrics: Compare statistical properties between original and consolidated data

Running Experiments

Available Datasets

The framework includes several datasets for experimentation (available in data/):

COVID-19 Datasets
- covid19-dataset.csv - Global COVID-19 statistics
- covid19-indonesia.csv - Indonesia-specific COVID-19 data
- covid_19_indonesia_time_series_all.csv - Time series data
- covid19_italy_province.csv - Italy provincial data
Adult Income Dataset
- adult_dataset/adult.csv - UCI Adult dataset

Running Experiments

# Run a specific experiment configuration
python -m notebooks.04-run-experiments --config configs/experiment_1.yaml

# Run all experiments
for config in configs/experiment_*.yaml; do
    python -m notebooks.04-run-experiments --config $config
done

# Analyze experiments
python -m notebooks.05-show-experiments

Pre-configured Experiments

Pre-configured experiments are available in configs/:

experiment_1.yaml - COVID-19 global consolidation
experiment_2.yaml - COVID-19 regional analysis
experiment_3.yaml - Multi-source COVID-19 integration
experiment_4.yaml - Adult dataset experiments
experiment_5-7.yaml - Ablation studies

Evaluation Methodology

The framework evaluates consolidation quality through:

Clustering Quality
- Precision, recall, F1-score
- Semantic similarity scores
Normalization Accuracy
- Schema alignment correctness
- Type mapping accuracy
Transformation Quality
- Code execution success rate
- Data fidelity preservation
- Statistical distribution similarity

Project Structure

privfusion/
├── configs/              # Experiment configurations
├── data/                 # Datasets
├── notebooks/            # Jupyter notebooks
├── src/
│   ├── metrics/          # Evaluation metrics
│   ├── privfusion/       # Core framework
│   │   ├── agents/       # LLM agents
│   │   │   ├── agent_cluster.py
│   │   │   ├── agent_norm.py
│   │   │   ├── agent_code.py
│   │   │   └── llms.py
│   │   ├── consolidater.py
│   │   ├── dataset_analyzer.py
│   │   ├── data_models.py
│   │   └── utils/
│   └── tabular_data/     # Synthetic data generation
├── tests/                # Unit tests
└── requirements.txt      # Dependencies

Citation

If you use PrivFusion in your research, please cite our paper:

@article{privfusion2026,
  title={{PrivFusion}: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets},
  author={Anisa Halimi and Liubov Nedoshivina and Kieran Fraser and Stefano Braghin},
  journal={arXiv preprint arXiv:2605.24249},
  year={2026},
}

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests: pytest
Run linters: ruff check src/ && ruff format --check src/ && mypy src/
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test
pytest tests/test_mapping.py -v

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgment

This project is partly supported by the Innovative Health Initiative Joint Undertaking (IHI JU) under grant agreement No. 101172997 – SEARCH.
Built with LangChain for LLM orchestration
Uses READI for semantic type detection
Powered by IBM WatsonX, Ollama, and other LLM providers
COVID-19 datasets from Our World in Data
Adult dataset from UCI Machine Learning Repository

Contact

For questions about the paper or code:

Issues: GitHub Issues

Note: This is research code. For production use, additional testing and optimization may be required.

--

Built with ❤️ by IBM Research

Documentation • Examples • Contributing • License

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
configs		configs
notebooks		notebooks
src/privfusion		src/privfusion
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
bandit.yaml		bandit.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PrivFusion: LLM-Powered Heterogeneous Data Consolidation

📋 Table of Contents

Overview

Problem Statement

Our Solution

Key Contributions

Key Features

Installation

Prerequisites

Setup

Environment Configuration

Quick Start

Basic Usage

Interactive Notebooks

Configuration

Architecture

Core Components

Component Details

LLM Backends

Metrics

Running Experiments

Available Datasets

Running Experiments

Pre-configured Experiments

Evaluation Methodology

Project Structure

Citation

Contributing

Development Workflow

Testing

License

🙏 Acknowledgment

Contact

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages