An Elixir implementation of the Amazon Dynamo (2007) paper. Built on the BEAM, which turns out to be a surprisingly natural fit for a system that needs isolated crash domains, message-passing between processes, and graceful degradation under load.
This is part of a [Paper Implementation] series. The companion blog post walks through the design decisions in detail.
DynamoNode is a distributed key-value store with the following properties:
- Always writable -- writes succeed even when nodes are down, using sloppy quorum and hinted handoff
- Eventually consistent -- all replicas converge after failures resolve
- Conflict-aware -- concurrent writes produce siblings that the client can inspect and merge
- Masterless -- any node can coordinate any request; no single point of failure
- Self-healing -- gossip-based failure detection and automatic read repair keep replicas in sync
lib/dynamo_node/
├── application.ex # OTP Application - supervision tree
├── cluster.ex # libcluster wrapper - node lifecycle events
├── client.ex # Public API for external callers
├── replicator.ex # Handles hinted handoff drain
├── ring/
│ ├── ring.ex # GenServer - consistent hash ring
│ └── config.ex # N/W/R quorum configuration
├── coordinator/
│ ├── coordinator.ex # GenServer - write/read coordination
│ └── quorum.ex # Pure - node selection, quorum logic
├── gossip/
│ ├── gossip.ex # GenServer - periodic peer exchange
│ └── membership.ex # Pure - merge + failure detection FSM
├── storage/
│ ├── storage_engine.ex # GenServer - ETS owner
│ └── hinted_write.ex # Pure - hinted write struct + ETS helpers
└── vector_clock/
├── vector_clock.ex # Pure - clock operations
└── siblings.ex # Pure - sibling reconciliation
Ring -- Consistent hash ring with virtual nodes (128 vnodes per physical node by default). Ring state is published to :persistent_term so preference list lookups require no process communication on the hot path.
Coordinator -- Handles get/put requests, selects write/read nodes, fires concurrent RPCs, and waits for quorum acknowledgments using Task.yield_many. The coordinator for a key is always the first node in its preference list.
Vector Clocks -- Tracks causal history of writes. Compares versions to determine if one supersedes another or if they're concurrent (:after, :before, :equal, :concurrent). Clocks are pruned to 50 entries max to prevent unbounded growth.
Gossip -- Each node increments its heartbeat every second, picks a random live peer, and exchanges full membership tables. Failure detection runs a state machine: :alive -> :suspect -> :dead. Dead nodes are evicted from the ring.
Hinted Handoff -- Writes destined for unavailable nodes are stored locally in an ETS bag table with a hint identifying the intended recipient. A background Replicator drains hints when nodes recover.
Storage Engine -- ETS-backed, in-memory. Durability comes from replication, not persistence. Sibling management runs on every write: existing versions dominated by the incoming clock are discarded, concurrent versions are kept.
All values can be set at runtime via Application.put_env/3 without recompilation.
| Key | Default | Description |
|---|---|---|
:n |
3 |
Replication factor -- number of replicas per key |
:w |
2 |
Write quorum -- acks required before returning to client |
:r |
2 |
Read quorum -- responses required before reconciling |
:vnodes_per_node |
128 |
Virtual nodes per physical node in the hash ring |
:put_timeout |
5000 |
Max ms to wait for W write acknowledgments |
:get_timeout |
5000 |
Max ms to wait for R read responses |
The critical property is R + W > N. When this holds, any read quorum and any write quorum share at least one node, so a read will always encounter the most recent write.
DynamoNode requires a cluster because it uses quorum-based replication. With the default N=3, W=2, R=2, you need at least 2 live nodes for writes to succeed.
For single-node experimentation, lower the quorum first:
iex -S mix
Application.put_env(:dynamo_node, :w, 1)
Application.put_env(:dynamo_node, :r, 1)# Blind write (empty clock -- no causal context)
{:ok, clock} = DynamoNode.Client.put("user:42", %{name: "Alice", email: "alice@example.com"}, %{})
# Read it back
{:ok, value, clock} = DynamoNode.Client.get("user:42")
# => {:ok, %{name: "Alice", email: "alice@example.com"}, %{...}}
# Causal write -- pass the clock back to establish happens-before
{:ok, new_clock} = DynamoNode.Client.put("user:42", %{name: "Alice", email: "alice@company.com"}, clock)Always pass the clock returned from a get into subsequent put calls for the same key. This is how the system knows the new write supersedes the previous one. A blind write (%{} clock) will always create a sibling if a concurrent version exists.
# Two blind writes to the same key produce siblings
{:ok, _c1} = DynamoNode.Client.put("cart:1", %{items: ["book"]}, %{})
{:ok, _c2} = DynamoNode.Client.put("cart:1", %{items: ["shirt"]}, %{})
# Read returns all concurrent versions
{:siblings, versions, merged_clock} = DynamoNode.Client.get("cart:1")
# => {:siblings, [
# {%{items: ["shirt"]}, clock_b},
# {%{items: ["book"]}, clock_a}
# ], merged_clock}
# Merge however makes sense for your domain (cart = union of items)
merged = %{items: ["book", "shirt"]}
DynamoNode.Client.put("cart:1", merged, merged_clock)
# Clean read after merge
{:ok, %{items: ["book", "shirt"]}, _clock} = DynamoNode.Client.get("cart:1")The system never auto-merges because automatic strategies like last-write-wins can silently discard data. Conflicts are surfaced to the caller.
DynamoNode.Client.get/1 returns one of:
{:ok, value, clock} # clean read
{:siblings, [{value, clock}], merged_clock} # concurrent versions exist
{:error, :not_found} # key doesn't exist
{:error, :insufficient_acks} # couldn't reach R replicasDynamoNode.Client.put/3 returns one of:
{:ok, clock} # write succeeded
{:error, :insufficient_acks} # couldn't reach W replicasThe DynamoNode.TestCluster helper makes it straightforward to test cluster behavior:
# Start a 3-node cluster
{:ok, nodes} = DynamoNode.TestCluster.start(node_count: 3)
# Partition a node
DynamoNode.TestCluster.partition(node_a)
# Write during partition -- succeeds via hinted handoff
DynamoNode.TestCluster.rpc(node_b, DynamoNode.Client, :put, ["key", "value", %{}])
# Heal the partition -- triggers gossip recovery and hint drain
DynamoNode.TestCluster.heal(node_a)
# Wait for gossip to converge across all nodes
DynamoNode.TestCluster.wait_for_convergence(nodes, 10_000)mix test163 tests total, covering:
- Property-based tests on vector clock operations (transitivity, commutativity, antisymmetry) using
stream_data - Unit tests on all modules
- Integration tests for coordinator operations, quorum logic, storage engine behavior, and ring membership
A few things worth knowing if you're reading the code:
Coordinator identity is not optional. The coordinator for a key must always be the first node in its preference list. If any node handles a write locally without forwarding, different nodes produce different vector clocks for the same key and conflict detection breaks.
:persistent_term must be re-written in Ring.init/1. If the Ring GenServer crashes and restarts without re-publishing to :persistent_term, all other processes silently read the pre-crash ring state. Nothing errors loudly -- keys just route to wrong or dead nodes.
Gossip depends on Ring, not the other way around. Gossip calls Ring.remove_node/1 on node death. Ring must never call back into Gossip. Keeping this dependency one-directional is important for testability and avoiding cycles in the supervision tree.
Timestamps in vector clocks are for pruning only. Causal ordering uses counters exclusively. Timestamps are only used to decide which entries to evict when a clock exceeds 50 entries. Using timestamps for ordering would reintroduce the NTP drift problems vector clocks are designed to avoid.