DynamoNode

An Elixir implementation of the Amazon Dynamo (2007) paper. Built on the BEAM, which turns out to be a surprisingly natural fit for a system that needs isolated crash domains, message-passing between processes, and graceful degradation under load.

This is part of a [Paper Implementation] series. The companion blog post walks through the design decisions in detail.

What it does

DynamoNode is a distributed key-value store with the following properties:

Always writable -- writes succeed even when nodes are down, using sloppy quorum and hinted handoff
Eventually consistent -- all replicas converge after failures resolve
Conflict-aware -- concurrent writes produce siblings that the client can inspect and merge
Masterless -- any node can coordinate any request; no single point of failure
Self-healing -- gossip-based failure detection and automatic read repair keep replicas in sync

Architecture

lib/dynamo_node/
├── application.ex              # OTP Application - supervision tree
├── cluster.ex                  # libcluster wrapper - node lifecycle events
├── client.ex                   # Public API for external callers
├── replicator.ex               # Handles hinted handoff drain
├── ring/
│   ├── ring.ex                 # GenServer - consistent hash ring
│   └── config.ex               # N/W/R quorum configuration
├── coordinator/
│   ├── coordinator.ex          # GenServer - write/read coordination
│   └── quorum.ex               # Pure - node selection, quorum logic
├── gossip/
│   ├── gossip.ex               # GenServer - periodic peer exchange
│   └── membership.ex           # Pure - merge + failure detection FSM
├── storage/
│   ├── storage_engine.ex       # GenServer - ETS owner
│   └── hinted_write.ex         # Pure - hinted write struct + ETS helpers
└── vector_clock/
    ├── vector_clock.ex         # Pure - clock operations
    └── siblings.ex             # Pure - sibling reconciliation

Key components

Ring -- Consistent hash ring with virtual nodes (128 vnodes per physical node by default). Ring state is published to :persistent_term so preference list lookups require no process communication on the hot path.

Coordinator -- Handles get/put requests, selects write/read nodes, fires concurrent RPCs, and waits for quorum acknowledgments using Task.yield_many. The coordinator for a key is always the first node in its preference list.

Vector Clocks -- Tracks causal history of writes. Compares versions to determine if one supersedes another or if they're concurrent (:after, :before, :equal, :concurrent). Clocks are pruned to 50 entries max to prevent unbounded growth.

Gossip -- Each node increments its heartbeat every second, picks a random live peer, and exchanges full membership tables. Failure detection runs a state machine: :alive -> :suspect -> :dead. Dead nodes are evicted from the ring.

Hinted Handoff -- Writes destined for unavailable nodes are stored locally in an ETS bag table with a hint identifying the intended recipient. A background Replicator drains hints when nodes recover.

Storage Engine -- ETS-backed, in-memory. Durability comes from replication, not persistence. Sibling management runs on every write: existing versions dominated by the incoming clock are discarded, concurrent versions are kept.

Configuration

All values can be set at runtime via Application.put_env/3 without recompilation.

Key	Default	Description
`:n`	`3`	Replication factor -- number of replicas per key
`:w`	`2`	Write quorum -- acks required before returning to client
`:r`	`2`	Read quorum -- responses required before reconciling
`:vnodes_per_node`	`128`	Virtual nodes per physical node in the hash ring
`:put_timeout`	`5000`	Max ms to wait for W write acknowledgments
`:get_timeout`	`5000`	Max ms to wait for R read responses

The critical property is R + W > N. When this holds, any read quorum and any write quorum share at least one node, so a read will always encounter the most recent write.

Usage

DynamoNode requires a cluster because it uses quorum-based replication. With the default N=3, W=2, R=2, you need at least 2 live nodes for writes to succeed.

For single-node experimentation, lower the quorum first:

iex -S mix

Application.put_env(:dynamo_node, :w, 1)
Application.put_env(:dynamo_node, :r, 1)

Basic get/put

# Blind write (empty clock -- no causal context)
{:ok, clock} = DynamoNode.Client.put("user:42", %{name: "Alice", email: "alice@example.com"}, %{})

# Read it back
{:ok, value, clock} = DynamoNode.Client.get("user:42")
# => {:ok, %{name: "Alice", email: "alice@example.com"}, %{...}}

# Causal write -- pass the clock back to establish happens-before
{:ok, new_clock} = DynamoNode.Client.put("user:42", %{name: "Alice", email: "alice@company.com"}, clock)

Always pass the clock returned from a get into subsequent put calls for the same key. This is how the system knows the new write supersedes the previous one. A blind write (%{} clock) will always create a sibling if a concurrent version exists.

Conflict handling

# Two blind writes to the same key produce siblings
{:ok, _c1} = DynamoNode.Client.put("cart:1", %{items: ["book"]}, %{})
{:ok, _c2} = DynamoNode.Client.put("cart:1", %{items: ["shirt"]}, %{})

# Read returns all concurrent versions
{:siblings, versions, merged_clock} = DynamoNode.Client.get("cart:1")
# => {:siblings, [
#      {%{items: ["shirt"]}, clock_b},
#      {%{items: ["book"]}, clock_a}
#    ], merged_clock}

# Merge however makes sense for your domain (cart = union of items)
merged = %{items: ["book", "shirt"]}
DynamoNode.Client.put("cart:1", merged, merged_clock)

# Clean read after merge
{:ok, %{items: ["book", "shirt"]}, _clock} = DynamoNode.Client.get("cart:1")

The system never auto-merges because automatic strategies like last-write-wins can silently discard data. Conflicts are surfaced to the caller.

Return values

DynamoNode.Client.get/1 returns one of:

{:ok, value, clock}                        # clean read
{:siblings, [{value, clock}], merged_clock} # concurrent versions exist
{:error, :not_found}                        # key doesn't exist
{:error, :insufficient_acks}               # couldn't reach R replicas

DynamoNode.Client.put/3 returns one of:

{:ok, clock}                  # write succeeded
{:error, :insufficient_acks}  # couldn't reach W replicas

Multi-node testing

The DynamoNode.TestCluster helper makes it straightforward to test cluster behavior:

# Start a 3-node cluster
{:ok, nodes} = DynamoNode.TestCluster.start(node_count: 3)

# Partition a node
DynamoNode.TestCluster.partition(node_a)

# Write during partition -- succeeds via hinted handoff
DynamoNode.TestCluster.rpc(node_b, DynamoNode.Client, :put, ["key", "value", %{}])

# Heal the partition -- triggers gossip recovery and hint drain
DynamoNode.TestCluster.heal(node_a)

# Wait for gossip to converge across all nodes
DynamoNode.TestCluster.wait_for_convergence(nodes, 10_000)

Tests

mix test

163 tests total, covering:

Property-based tests on vector clock operations (transitivity, commutativity, antisymmetry) using stream_data
Unit tests on all modules
Integration tests for coordinator operations, quorum logic, storage engine behavior, and ring membership

Design notes

A few things worth knowing if you're reading the code:

Coordinator identity is not optional. The coordinator for a key must always be the first node in its preference list. If any node handles a write locally without forwarding, different nodes produce different vector clocks for the same key and conflict detection breaks.

:persistent_term must be re-written in Ring.init/1. If the Ring GenServer crashes and restarts without re-publishing to :persistent_term, all other processes silently read the pre-crash ring state. Nothing errors loudly -- keys just route to wrong or dead nodes.

Gossip depends on Ring, not the other way around. Gossip calls Ring.remove_node/1 on node death. Ring must never call back into Gossip. Keeping this dependency one-directional is important for testability and avoiding cycles in the supervision tree.

Timestamps in vector clocks are for pruning only. Causal ordering uses counters exclusively. Timestamps are only used to decide which entries to evict when a clock exceeds 50 entries. Using timestamps for ordering would reintroduce the NTP drift problems vector clocks are designed to avoid.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bench		bench
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DynamoNode

What it does

Architecture

Key components

Configuration

Usage

Basic get/put

Conflict handling

Return values

Multi-node testing

Tests

Design notes

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DynamoNode

What it does

Architecture

Key components

Configuration

Usage

Basic get/put

Conflict handling

Return values

Multi-node testing

Tests

Design notes

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages