A distributed data processing framework inspired by bee colony intelligence, designed for small compute that scales to the cloud.
Apiary is a lakehouse for small compute. It's designed to run on inexpensive hardware — Raspberry Pis, NUCs, old laptops — while scaling seamlessly to cloud compute. Users write data via Python, query via SQL, and the swarm distributes work across available nodes with no central controller.
- 🏡 Runs on Small Hardware: Designed for Raspberry Pi and similar devices, not cloud-first systems squeezed onto small hardware
- ☁️ Object Storage as Truth: Data lives in S3, GCS, MinIO, or local filesystem — nodes are stateless workers
- 🤝 No Node-to-Node Communication: Nodes coordinate through the storage layer itself, eliminating distributed systems complexity
- 🔒 ACID Transactions: Ledger-based transactions using conditional writes (like Delta Lake)
- 🐝 Biology-Driven Design: Memory budgets, task sizing, and failure recovery governed by bee-inspired behavioral patterns
- 📊 SQL Queries: Apache DataFusion powers SQL queries over Parquet files
- 🐍 Python Interface: PyO3-based SDK with zero-copy Arrow interop
- 🚀 Zero Configuration Multi-Node: Add nodes by connecting to the same storage bucket — no seed nodes, no tokens
-
Small-compute-first design: Not a cloud system squeezed onto a Pi, but a system designed for resource-constrained hardware that scales to the cloud.
-
Object storage as coordination layer: No consensus protocol, no gossip network, no node-to-node communication. Nodes coordinate through the same storage that holds the data.
-
Biology-driven resource management: Memory budgets, cell sizing, backpressure, and failure recovery are governed by bee-inspired behavioral patterns — not metaphors, but algorithms.
-
Zero-configuration multi-node: A second node joins the swarm by connecting to the same storage bucket. No seed nodes, no tokens, no mesh network setup.
-
Solo to swarm without code changes: A single Pi with local filesystem storage is a fully functional data platform. Switching to S3 and adding nodes is a configuration change, not an architecture change.
# Linux / macOS (x86_64 or ARM64 / Raspberry Pi)
curl -fsSL https://raw.githubusercontent.com/ApiaryData/apiary/main/scripts/install.sh | bash# Windows (PowerShell)
irm https://raw.githubusercontent.com/ApiaryData/apiary/main/scripts/install.ps1 | iexdocker pull ghcr.io/apiarydata/apiary:latest # or build locally
docker build -t apiary:latest .Prerequisites: Rust 1.78+, Python 3.9+
git clone https://github.com/ApiaryData/apiary.git
cd apiary
cargo build --workspace
pip install maturin
maturin developfrom apiary import Apiary
# Solo mode (local filesystem)
ap = Apiary("my_database")
ap.start()
# Check status
status = ap.status()
print(f"Running on {status['cores']} cores with {status['memory_gb']:.2f}GB memory")
# Shutdown when done
ap.shutdown()from apiary import Apiary
# All nodes connect to the same storage
ap = Apiary("production", storage="s3://my-bucket/apiary")
ap.start()
# Nodes automatically discover each other through the storage layer
# No additional configuration needed!| Concept | Apiary Component | Description |
|---|---|---|
| Hive | Database | Top-level logical grouping |
| Box | Schema | Namespace within a hive |
| Frame | Table | Queryable dataset |
| Cell | Parquet file | Physical storage unit |
| Bee | CPU core | Unit of compute (1 core = 1 bee) |
| Swarm | Compute mesh | All nodes and their bees |
| Meadow | Object storage | Where all data lives |
Tier 1 — Memory: Arrow RecordBatches in bee chambers (active computation)
Tier 2 — Local Disk: Cache + spill + write buffer
Tier 3 — Object Storage: S3/GCS/MinIO/filesystem (canonical truth)
- Object storage is canonical: All committed data, metadata, and coordination state lives in object storage
- Conditional writes for serialization: One write succeeds, the other retries — no Raft needed
- 1 core = 1 bee: Each virtual core is an independent unit with its own memory budget
- Biology is the runtime: Mason bee isolation, leafcutter sizing, abandonment, and colony temperature are not metaphors but algorithms
Apiary is in active development. See BUILD_STATUS.md for detailed progress and BENCHMARKS.md for performance metrics.
| Step | Component | Status |
|---|---|---|
| 1 | Skeleton + StorageBackend | ✅ Complete |
| 2 | Registry + Namespace | ✅ Complete |
| 3 | Ledger + Cell Storage | ✅ Complete |
| 4 | DataFusion Integration | ✅ Complete |
| 5 | Mason Bee Isolation | ✅ Complete |
| 6 | Heartbeat + World View | ✅ Complete |
| 7 | Distributed Query Execution | ✅ Complete |
| 8 | Local Cell Cache | ✅ Complete |
| 9 | Behavioral Model | ✅ Complete |
| 10 | Testing + Hardening | ✅ Complete |
-
✅ Rust workspace with 6 crates
-
✅ Python SDK via PyO3
-
✅ LocalBackend (filesystem storage)
-
✅ S3Backend (S3-compatible object storage)
-
✅ StorageBackend trait with atomic operations
-
✅ Node configuration with resource auto-detection
-
✅ Typed identifiers (HiveId, BoxId, FrameId, TaskId, etc.)
-
✅ Registry with DDL operations (create/list hives, boxes, frames)
-
✅ Dual terminology (bee-themed and traditional database naming)
-
✅ Transaction ledger with optimistic concurrency
-
✅ Parquet cell writing with LZ4 compression
-
✅ Cell-level statistics for query pruning
-
✅ Partitioning with partition pruning on read
-
✅ Leafcutter cell sizing
-
✅ Schema validation (null partition rejection)
-
✅ Frame overwrite (atomic cell replacement)
-
✅ Ledger checkpointing
-
✅ SQL queries via Apache DataFusion
-
✅ Custom SQL commands (USE, SHOW, DESCRIBE)
-
✅ Cell pruning from WHERE predicates
-
✅ Projection pushdown via DataFusion
-
✅ Aggregation (GROUP BY, AVG, SUM, COUNT, MIN, MAX)
-
✅ DML blocking (DELETE/UPDATE with clear error messages)
-
✅ Mason bee sealed chambers (memory-budgeted isolated execution per bee)
-
✅ BeePool with task queuing and concurrent execution
-
✅ Task timeout enforcement
-
✅ Scratch directory isolation per bee
-
✅ SQL queries routed through BeePool
-
✅ Heartbeat writer (background task writing node status to storage)
-
✅ World view builder (discovers all nodes via heartbeat polling)
-
✅ Node state detection (Alive, Suspect, Dead based on heartbeat age)
-
✅ Graceful departure (heartbeat file deleted on shutdown)
-
✅ Stale heartbeat cleanup (dead nodes cleaned after threshold)
-
✅
swarm_status()Python API for swarm visibility -
✅ Solo mode works as a swarm of one (zero special-casing)
-
✅ Distributed query planner (cache-aware, capacity-based cell assignment)
-
✅ Query coordinator (manifest writing, task execution, result merging)
-
✅ Worker task poller (background task polling for query manifests)
-
✅ Storage-based coordination (query manifests via object storage)
-
✅ Partial result exchange (Arrow IPC format)
-
✅ Transparent distribution (single-node fallback for small queries)
-
✅ Local cell cache (LRU eviction, 2GB default)
-
✅ Cache reporting in heartbeats (enables cache-aware planning)
-
✅ Cache-aware distributed query planning (preferential assignment)
-
✅ Colony temperature measurement (composite system health metric)
-
✅ Temperature regulation classification (cold/ideal/warm/hot/critical)
-
✅ Task abandonment tracker (retry logic with trial limits)
-
✅ Behavioral model Python API (colony_status method)
-
✅ Integration tests (solo mode, multi-node, mason isolation, concurrent writes, backpressure, chaos)
-
✅ Documentation (getting started, concepts, Python SDK, SQL reference, architecture summary)
apiary/
├── crates/
│ ├── apiary-core/ # Core types and traits
│ ├── apiary-storage/ # Storage backends
│ ├── apiary-runtime/ # Node runtime
│ ├── apiary-query/ # DataFusion SQL engine
│ ├── apiary-python/ # PyO3 bindings
│ └── apiary-cli/ # Command-line interface
├── python/ # Python package source
├── docs/ # Documentation
│ ├── architecture/ # Design documentation
│ └── development/ # Step completion records
├── tests/ # Acceptance tests
├── deploy/ # Raspberry Pi docker-compose configs
└── scripts/ # Benchmark and utility scripts
| Component | Technology | Why |
|---|---|---|
| Runtime | Rust | Memory safety, zero-cost abstractions, ARM64 cross-compilation |
| Python Bridge | PyO3 + maturin | Zero-copy Arrow interop, native wheels |
| SQL Engine | Apache DataFusion | Rust-native, Arrow-native, extensible |
| Storage Format | Apache Parquet | Columnar, compressed, universal |
| In-Memory | Apache Arrow | Zero-copy, columnar, cross-language |
| Object Storage | S3 API | Universal, battle-tested |
| Async Runtime | Tokio | Standard Rust async |
# Build all Rust crates
cargo build --workspace
# Run tests
cargo test --workspace
# Run linter
cargo clippy --workspace
# Build Python package
maturin develop# Rust tests
cargo test --workspace
# Python acceptance tests
python tests/test_step1_acceptance.pyApiary is in early development. Contributions are welcome once the v1 core is established. For now, watch the repository and join the conversation in issues.
- Single-node and multi-node operation
- Python SDK with data writing
- SQL queries via DataFusion
- ACID transactions via conditional writes
- Distributed query execution
- Mason bee isolation and leafcutter sizing
- SWIM (Scalable Weakly-consistent Infection-style Membership) gossip for sub-second failure detection
- Arrow Flight for low-latency data shuffles
- Full 20-behavior biological model
- Streaming ingestion
- Time travel queries
- Multi-apiary federation
- Regulatory compliance
- Data lineage
- Advanced access control
See docs/architecture/06-roadmap.md for details.
- Quick Deployment Guide — Fast setup for Pi and containers
- Raspberry Pi Deployment — Complete guide for edge devices
- Cloud Container Deployment — Docker and Kubernetes
- Live Benchmark Results — View latest performance metrics
- Performance Benchmarks — Documentation and baseline metrics
- Running Benchmarks — How to run performance tests
- Raspberry Pi Benchmarking — Pi-specific resource configurations
- Architecture Summary
- Architecture Overview
- Storage Engine
- Swarm Coordination
- Query Execution
- Behavioral Model
- Roadmap
Apache License 2.0 - See LICENSE for details.
Apiary draws inspiration from:
- The biological intelligence of bee colonies
- Apache Arrow ecosystem (DataFusion, Parquet, object_store)
- Delta Lake's transaction log design
- Modern lakehouse architectures
Status: v1 Release Candidate. All 10 steps complete. 🐝