🐝 Apiary

A distributed data processing framework inspired by bee colony intelligence, designed for small compute that scales to the cloud.

What Is Apiary?

Apiary is a lakehouse for small compute. It's designed to run on inexpensive hardware — Raspberry Pis, NUCs, old laptops — while scaling seamlessly to cloud compute. Users write data via Python, query via SQL, and the swarm distributes work across available nodes with no central controller.

Key Features

🏡 Runs on Small Hardware: Designed for Raspberry Pi and similar devices, not cloud-first systems squeezed onto small hardware
☁️ Object Storage as Truth: Data lives in S3, GCS, MinIO, or local filesystem — nodes are stateless workers
🤝 No Node-to-Node Communication: Nodes coordinate through the storage layer itself, eliminating distributed systems complexity
🔒 ACID Transactions: Ledger-based transactions using conditional writes (like Delta Lake)
🐝 Biology-Driven Design: Memory budgets, task sizing, and failure recovery governed by bee-inspired behavioral patterns
📊 SQL Queries: Apache DataFusion powers SQL queries over Parquet files
🐍 Python Interface: PyO3-based SDK with zero-copy Arrow interop
🚀 Zero Configuration Multi-Node: Add nodes by connecting to the same storage bucket — no seed nodes, no tokens

What Makes Apiary Different

Small-compute-first design: Not a cloud system squeezed onto a Pi, but a system designed for resource-constrained hardware that scales to the cloud.
Object storage as coordination layer: No consensus protocol, no gossip network, no node-to-node communication. Nodes coordinate through the same storage that holds the data.
Biology-driven resource management: Memory budgets, cell sizing, backpressure, and failure recovery are governed by bee-inspired behavioral patterns — not metaphors, but algorithms.
Zero-configuration multi-node: A second node joins the swarm by connecting to the same storage bucket. No seed nodes, no tokens, no mesh network setup.
Solo to swarm without code changes: A single Pi with local filesystem storage is a fully functional data platform. Switching to S3 and adding nodes is a configuration change, not an architecture change.

Quick Start

Installation

Pre-built Binaries (no build required)

# Linux / macOS (x86_64 or ARM64 / Raspberry Pi)
curl -fsSL https://raw.githubusercontent.com/ApiaryData/apiary/main/scripts/install.sh | bash

# Windows (PowerShell)
irm https://raw.githubusercontent.com/ApiaryData/apiary/main/scripts/install.ps1 | iex

Docker

docker pull ghcr.io/apiarydata/apiary:latest   # or build locally
docker build -t apiary:latest .

Build from Source

Prerequisites: Rust 1.78+, Python 3.9+

git clone https://github.com/ApiaryData/apiary.git
cd apiary
cargo build --workspace
pip install maturin
maturin develop

Hello Apiary

from apiary import Apiary

# Solo mode (local filesystem)
ap = Apiary("my_database")
ap.start()

# Check status
status = ap.status()
print(f"Running on {status['cores']} cores with {status['memory_gb']:.2f}GB memory")

# Shutdown when done
ap.shutdown()

Multi-Node Mode

from apiary import Apiary

# All nodes connect to the same storage
ap = Apiary("production", storage="s3://my-bucket/apiary")
ap.start()

# Nodes automatically discover each other through the storage layer
# No additional configuration needed!

Architecture Overview

The Beekeeping Metaphor

Concept	Apiary Component	Description
Hive	Database	Top-level logical grouping
Box	Schema	Namespace within a hive
Frame	Table	Queryable dataset
Cell	Parquet file	Physical storage unit
Bee	CPU core	Unit of compute (1 core = 1 bee)
Swarm	Compute mesh	All nodes and their bees
Meadow	Object storage	Where all data lives

Storage Model

Tier 1 — Memory: Arrow RecordBatches in bee chambers (active computation)
Tier 2 — Local Disk: Cache + spill + write buffer
Tier 3 — Object Storage: S3/GCS/MinIO/filesystem (canonical truth)

Key Design Principles

Object storage is canonical: All committed data, metadata, and coordination state lives in object storage
Conditional writes for serialization: One write succeeds, the other retries — no Raft needed
1 core = 1 bee: Each virtual core is an independent unit with its own memory budget
Biology is the runtime: Mason bee isolation, leafcutter sizing, abandonment, and colony temperature are not metaphors but algorithms

Development Status

Apiary is in active development. See BUILD_STATUS.md for detailed progress and BENCHMARKS.md for performance metrics.

Step	Component	Status
1	Skeleton + StorageBackend	✅ Complete
2	Registry + Namespace	✅ Complete
3	Ledger + Cell Storage	✅ Complete
4	DataFusion Integration	✅ Complete
5	Mason Bee Isolation	✅ Complete
6	Heartbeat + World View	✅ Complete
7	Distributed Query Execution	✅ Complete
8	Local Cell Cache	✅ Complete
9	Behavioral Model	✅ Complete
10	Testing + Hardening	✅ Complete

Current Capabilities (Step 10 Complete — v1 Release Candidate)

✅ Rust workspace with 6 crates
✅ Python SDK via PyO3
✅ LocalBackend (filesystem storage)
✅ S3Backend (S3-compatible object storage)
✅ StorageBackend trait with atomic operations
✅ Node configuration with resource auto-detection
✅ Typed identifiers (HiveId, BoxId, FrameId, TaskId, etc.)
✅ Registry with DDL operations (create/list hives, boxes, frames)
✅ Dual terminology (bee-themed and traditional database naming)
✅ Transaction ledger with optimistic concurrency
✅ Parquet cell writing with LZ4 compression
✅ Cell-level statistics for query pruning
✅ Partitioning with partition pruning on read
✅ Leafcutter cell sizing
✅ Schema validation (null partition rejection)
✅ Frame overwrite (atomic cell replacement)
✅ Ledger checkpointing
✅ SQL queries via Apache DataFusion
✅ Custom SQL commands (USE, SHOW, DESCRIBE)
✅ Cell pruning from WHERE predicates
✅ Projection pushdown via DataFusion
✅ Aggregation (GROUP BY, AVG, SUM, COUNT, MIN, MAX)
✅ DML blocking (DELETE/UPDATE with clear error messages)
✅ Mason bee sealed chambers (memory-budgeted isolated execution per bee)
✅ BeePool with task queuing and concurrent execution
✅ Task timeout enforcement
✅ Scratch directory isolation per bee
✅ SQL queries routed through BeePool
✅ Heartbeat writer (background task writing node status to storage)
✅ World view builder (discovers all nodes via heartbeat polling)
✅ Node state detection (Alive, Suspect, Dead based on heartbeat age)
✅ Graceful departure (heartbeat file deleted on shutdown)
✅ Stale heartbeat cleanup (dead nodes cleaned after threshold)
✅ swarm_status() Python API for swarm visibility
✅ Solo mode works as a swarm of one (zero special-casing)
✅ Distributed query planner (cache-aware, capacity-based cell assignment)
✅ Query coordinator (manifest writing, task execution, result merging)
✅ Worker task poller (background task polling for query manifests)
✅ Storage-based coordination (query manifests via object storage)
✅ Partial result exchange (Arrow IPC format)
✅ Transparent distribution (single-node fallback for small queries)
✅ Local cell cache (LRU eviction, 2GB default)
✅ Cache reporting in heartbeats (enables cache-aware planning)
✅ Cache-aware distributed query planning (preferential assignment)
✅ Colony temperature measurement (composite system health metric)
✅ Temperature regulation classification (cold/ideal/warm/hot/critical)
✅ Task abandonment tracker (retry logic with trial limits)
✅ Behavioral model Python API (colony_status method)
✅ Integration tests (solo mode, multi-node, mason isolation, concurrent writes, backpressure, chaos)
✅ Documentation (getting started, concepts, Python SDK, SQL reference, architecture summary)

Project Structure

apiary/
├── crates/
│   ├── apiary-core/       # Core types and traits
│   ├── apiary-storage/    # Storage backends
│   ├── apiary-runtime/    # Node runtime
│   ├── apiary-query/      # DataFusion SQL engine
│   ├── apiary-python/     # PyO3 bindings
│   └── apiary-cli/        # Command-line interface
├── python/                # Python package source
├── docs/                  # Documentation
│   ├── architecture/      # Design documentation
│   └── development/       # Step completion records
├── tests/                 # Acceptance tests
├── deploy/                # Raspberry Pi docker-compose configs
└── scripts/               # Benchmark and utility scripts

Technology Stack

Component	Technology	Why
Runtime	Rust	Memory safety, zero-cost abstractions, ARM64 cross-compilation
Python Bridge	PyO3 + maturin	Zero-copy Arrow interop, native wheels
SQL Engine	Apache DataFusion	Rust-native, Arrow-native, extensible
Storage Format	Apache Parquet	Columnar, compressed, universal
In-Memory	Apache Arrow	Zero-copy, columnar, cross-language
Object Storage	S3 API	Universal, battle-tested
Async Runtime	Tokio	Standard Rust async

Development

Build

# Build all Rust crates
cargo build --workspace

# Run tests
cargo test --workspace

# Run linter
cargo clippy --workspace

# Build Python package
maturin develop

Run Tests

# Rust tests
cargo test --workspace

# Python acceptance tests
python tests/test_step1_acceptance.py

Contributing

Apiary is in early development. Contributions are welcome once the v1 core is established. For now, watch the repository and join the conversation in issues.

Roadmap

v1 — Prove the Core (In Progress)

Single-node and multi-node operation
Python SDK with data writing
SQL queries via DataFusion
ACID transactions via conditional writes
Distributed query execution
Mason bee isolation and leafcutter sizing

v2 — Direct Communication

SWIM (Scalable Weakly-consistent Infection-style Membership) gossip for sub-second failure detection
Arrow Flight for low-latency data shuffles
Full 20-behavior biological model
Streaming ingestion
Time travel queries

v3 — Enterprise & Federation

Multi-apiary federation
Regulatory compliance
Data lineage
Advanced access control

See docs/architecture/06-roadmap.md for details.

Documentation

Getting Started

Deployment Guides

Quick Deployment Guide — Fast setup for Pi and containers
Raspberry Pi Deployment — Complete guide for edge devices
Cloud Container Deployment — Docker and Kubernetes

API & SQL

Performance & Testing

Live Benchmark Results — View latest performance metrics
Performance Benchmarks — Documentation and baseline metrics
Running Benchmarks — How to run performance tests
Raspberry Pi Benchmarking — Pi-specific resource configurations

Architecture

License

Apache License 2.0 - See LICENSE for details.

Acknowledgments

Apiary draws inspiration from:

The biological intelligence of bee colonies
Apache Arrow ecosystem (DataFusion, Parquet, object_store)
Delta Lake's transaction log design
Modern lakehouse architectures

Status: v1 Release Candidate. All 10 steps complete. 🐝

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github		.github
.vscode		.vscode
benchmarks		benchmarks
crates		crates
deploy		deploy
docs		docs
python/apiary		python/apiary
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

ApiaryData/apiary

Folders and files

Latest commit

History

Repository files navigation