DataElf is an intelligent execution engine for large-scale data workflows. It turns natural-language goals into runnable pipelines, executes built-in tools automatically, and keeps security, traceability, and extensibility in the loop.
The open-source edition is designed for teams that want one framework for data inspection, safety checks, scoring, selection, and domain-oriented tool orchestration without exposing private data-processing infrastructure.
- Translate natural language into executable pipelines.
- Run a single-pass workflow for clear, bounded tasks with lightweight clarification when needed.
- Run a higher-autonomy pilot loop for harder tasks that require planning, execution, judging, repair, and asset derivation.
- Preserve execution artifacts such as pipelines, logs, reports, and intermediate outputs for review.
- Support reusable stable workflow assets through approval and submission.
- Let teams extend the tool layer with custom
BaseToolimplementations. - Combine data-safety checks with broader data-processing workflows in one CLI-driven system.
DataElf currently ships with these built-in tools:
| Tool | Focus | Docs |
|---|---|---|
security_audit |
Safety and risk scanning for datasets | security_audit_en.md |
data_scoring |
Sample-level quality scoring | data_scoring_en.md |
data_select |
Budget-aware or cluster-based data selection | data_select_en.md |
enzyme_acquire |
Enzyme information retrieval workflows | enzyme_acquire_en.md |
protein_analyzer |
Protein analysis workflows | protein_analyzer_en.md |
skillrl_skill_extraction |
Skill extraction from trajectory-style data | skillrl_skill_extraction_en.md |
Clone the repository and install the package:
git clone https://github.com/<your-org>/DataElf.git
cd DataElf
pip install -e .If you want all currently built-in tools available in one environment, install the optional dependency groups as well:
pip install -e ".[scitools,scoring,security_audit]"After installation, the CLI is available as:
elf --helpThe repository now keeps a single public example config at config.yaml. It is written for open-source usage:
- LLM credentials are read from environment variables.
- The open-source build supports
local_fileandmockdatabase strategies. - The example config lists all built-in tools.
Set environment variables before running:
export OPENAI_API_KEY="<your-api-key>"
export OPENAI_BASE_URL="https://api.openai.com/v1"
export DATAELF_AGENT_MODEL="gpt-4o-mini"
export DATAELF_TOOL_LLM_MODEL="gpt-4o-mini"The default config uses the local JSON datasets in ./test_data, so you can point DataElf at repository fixtures or replace that path with your own dataset directory.
Run a single task:
elf run "run security_audit on security_audit_samples with a custom checker set" -c config.yaml --waitRun a scoring task:
elf run "score the alpaca_data dataset and summarize the highest-quality records" -c config.yaml --waitRun the higher-autonomy pilot loop:
elf pilot "screen alpaca_data for high-value and low-risk samples" -c config.yaml --wait --budget-steps 3Check outputs later:
elf status <job_id>
elf result <job_id> --json --artifacts
elf inspect <job_id|candidate_id|asset_id> --jsonApprove and reuse a promoted workflow asset:
elf approve <candidate_id>
elf submit <asset_id> -c config.yaml --waitelf run "task" [--config PATH] [--wait] [--verbose]
elf pilot "task" [--config PATH] [--wait] [--budget-steps N] [--allow-experimental-tools]
elf status <job_id>
elf result <job_id> [--json] [--artifacts]
elf approve <candidate_id>
elf submit <asset_id> [--config PATH] [--wait]
elf inspect <job_id|candidate_id|asset_id> [--json]run: single-pass planning and execution for relatively clear tasks.pilot: iterative planning and repair loop with candidate asset generation.submit: execute an approved stable pipeline asset.
You can add your own tools by implementing BaseTool, wiring the tool into the registry, and listing it in config.yaml.
Minimal example:
from typing import Any
from tools import BaseTool, ToolContext
class DomainRiskCheckTool(BaseTool):
@property
def name(self) -> str:
return "domain_risk_check"
@property
def description(self) -> str:
return "Check domain-specific risky patterns in dataset records."
@property
def parameters(self) -> dict[str, Any]:
return {
"type": "object",
"properties": {
"data": {
"type": "array",
"items": {"type": "object"},
},
"mode": {
"type": "string",
"default": "strict",
},
},
"required": ["data"],
}
def run(self, context: ToolContext, **kwargs: Any) -> dict[str, Any]:
data = kwargs.get("data", [])
mode = kwargs.get("mode", "strict")
checked = data
return {
"result": checked,
"metadata": {
"records_processed": len(data),
"mode": mode,
},
"artifacts": {
"report_md": "# Domain Risk Report\n",
},
}See the developer docs for details:
DataElf/
├── ai_data_pilot/ # CLI entry package
├── cli/ # Command implementations
├── agent/ # Agent adapters and prompt building
├── agentic/ # Pilot loop and asset lifecycle
├── runtime/ # Runtime execution layer
├── tools/ # Built-in tools
├── database/ # Open-source database strategies
├── config/ # Config loading
├── docs/ # User and developer documentation
├── test_data/ # Local sample datasets
└── config.yaml # Public example config
