A multi-agent AI pipeline to validate, curate, and safely integrate metadata into single-cell AnnData (.h5ad) datasets.
This system combines LLM-based reasoning with deterministic data processing to ensure accurate, explainable, and scalable metadata integration.
- ✅ Schema validation against data model
- ✅ Metadata completeness checks
- ✅ Ontology validation (MeSH, BTO, PubChem)
- ✅ LLM-based integration decision with reasoning
- ✅ User-controlled override for appending metadata
- ✅ Safe metadata integration into
.h5addatasets - ✅ Automated HTML QC report generation
- ✅ GEO accession detection and dataset summarization
Metadata CSV ↓ Schema Agent ↓ Validation Agent ↓ Ontology Agent ↓ LLM Integration Decision ↓ Append Metadata (optional) ↓ Extract Dataset Summary ↓ Generate HTML Report
AI_Assistant/
├── agents/ │ ├── schema_agent.py │ ├── validation_agent.py │ ├── ontology_agent.py │ └── integration_agent.py │ ├── tools/ │ ├── file_tools.py │ ├── h5ad_tools.py │ ├── report_tools.py │ └── integration_decision.py │ ├── data/ │ ├── dataset/ │ │ └── *.h5ad │ │ │ └── metadata/ │ └── *.csv │ ├── reports/ │ └── validation_report.html │ ├── main.py ├── config.py ├── requirements.txt └── .env
1. Clone the repository
git clone https://github.com/ElucidataInc/curation_agent.git
cd curation_agent
2. Create environment
conda create -n metadata_agent python=3.11
conda activate metadata_agent
3. Install dependencies
pip install -r requirements.txt
4. Setup environment variables
Update .env file:
OPENAI_API_KEY=your_api_key_here
▶️ Usage
Run the pipeline:
python main.py
🧪 Example Workflow
STEP 1: Listing metadata files
STEP 2: Schema validation
STEP 3: Metadata validation
STEP 4: Ontology validation
STEP 5: LLM integration decision
Example Decision
Metadata SHOULD NOT be appended
Reason: Drug field missing PubChem ID
Override Option
Override and append anyway? yes
Append Execution
Using metadata join key: sample
Matched 31 samples between metadata and dataset
Metadata appended successfully
🔄 Metadata Integration Logic
Problem
Metadata → sample-level
.h5ad → cell-level
Solution
metadata.sample
↓
adata.obs["sample"]
↓
propagate metadata to all cells
📊 Output
HTML Report
reports/validation_report.html
Includes:
Dataset summary
GEO accession link
Validation results
Ontology compliance
Integration decision
Append status
🧠 Design Principles
1. LLM for Reasoning
Schema validation
Ontology reasoning
Integration decision
Report generation
2. Python for Execution
File operations
Dataset modification
Metadata joins
3. Safe Integration
Prevents incorrect metadata append
Requires validation + reasoning
Allows user override
⚠️ Error Handling
The system handles:
Missing .h5ad file
Missing metadata columns
Schema mismatch
Ontology violations
Dataset-metadata mismatch
🔐 Configuration
OPENAI_API_KEY=your_key_here
📈 Future Improvements
🔹 Polly API Integration
Fetch datasets from Polly workspace
Load data models and curated metadata automatically
🔹 Slack Integration
Send validation notifications
Alert failures
🔹 Google Calendar Integration
Schedule meetings for unresolved metadata issues
🔹 Auto Metadata Correction
Fix ontology mismatches automatically
Normalize metadata fields
🔹 Visualization
UMAP plots
Cell-type distributions
QC dashboards
👨🔬 Use Cases
Single-cell dataset curation
GEO dataset ingestion
Bioinformatics workflows
Metadata standardization
🏁 Summary
This system enables:
AI reasoning + deterministic data engineering
to create a safe, explainable, and scalable metadata integration pipeline.
⭐ Contributing
Pull requests and suggestions are welcome!
📜 License
MIT License