Skip to content

Jitesh117/dynamo_elixir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynamoNode

An Elixir implementation of the Amazon Dynamo (2007) paper. Built on the BEAM, which turns out to be a surprisingly natural fit for a system that needs isolated crash domains, message-passing between processes, and graceful degradation under load.

This is part of a [Paper Implementation] series. The companion blog post walks through the design decisions in detail.


What it does

DynamoNode is a distributed key-value store with the following properties:

  • Always writable -- writes succeed even when nodes are down, using sloppy quorum and hinted handoff
  • Eventually consistent -- all replicas converge after failures resolve
  • Conflict-aware -- concurrent writes produce siblings that the client can inspect and merge
  • Masterless -- any node can coordinate any request; no single point of failure
  • Self-healing -- gossip-based failure detection and automatic read repair keep replicas in sync

Architecture

lib/dynamo_node/
├── application.ex              # OTP Application - supervision tree
├── cluster.ex                  # libcluster wrapper - node lifecycle events
├── client.ex                   # Public API for external callers
├── replicator.ex               # Handles hinted handoff drain
├── ring/
│   ├── ring.ex                 # GenServer - consistent hash ring
│   └── config.ex               # N/W/R quorum configuration
├── coordinator/
│   ├── coordinator.ex          # GenServer - write/read coordination
│   └── quorum.ex               # Pure - node selection, quorum logic
├── gossip/
│   ├── gossip.ex               # GenServer - periodic peer exchange
│   └── membership.ex           # Pure - merge + failure detection FSM
├── storage/
│   ├── storage_engine.ex       # GenServer - ETS owner
│   └── hinted_write.ex         # Pure - hinted write struct + ETS helpers
└── vector_clock/
    ├── vector_clock.ex         # Pure - clock operations
    └── siblings.ex             # Pure - sibling reconciliation

Key components

Ring -- Consistent hash ring with virtual nodes (128 vnodes per physical node by default). Ring state is published to :persistent_term so preference list lookups require no process communication on the hot path.

Coordinator -- Handles get/put requests, selects write/read nodes, fires concurrent RPCs, and waits for quorum acknowledgments using Task.yield_many. The coordinator for a key is always the first node in its preference list.

Vector Clocks -- Tracks causal history of writes. Compares versions to determine if one supersedes another or if they're concurrent (:after, :before, :equal, :concurrent). Clocks are pruned to 50 entries max to prevent unbounded growth.

Gossip -- Each node increments its heartbeat every second, picks a random live peer, and exchanges full membership tables. Failure detection runs a state machine: :alive -> :suspect -> :dead. Dead nodes are evicted from the ring.

Hinted Handoff -- Writes destined for unavailable nodes are stored locally in an ETS bag table with a hint identifying the intended recipient. A background Replicator drains hints when nodes recover.

Storage Engine -- ETS-backed, in-memory. Durability comes from replication, not persistence. Sibling management runs on every write: existing versions dominated by the incoming clock are discarded, concurrent versions are kept.


Configuration

All values can be set at runtime via Application.put_env/3 without recompilation.

Key Default Description
:n 3 Replication factor -- number of replicas per key
:w 2 Write quorum -- acks required before returning to client
:r 2 Read quorum -- responses required before reconciling
:vnodes_per_node 128 Virtual nodes per physical node in the hash ring
:put_timeout 5000 Max ms to wait for W write acknowledgments
:get_timeout 5000 Max ms to wait for R read responses

The critical property is R + W > N. When this holds, any read quorum and any write quorum share at least one node, so a read will always encounter the most recent write.


Usage

DynamoNode requires a cluster because it uses quorum-based replication. With the default N=3, W=2, R=2, you need at least 2 live nodes for writes to succeed.

For single-node experimentation, lower the quorum first:

iex -S mix

Application.put_env(:dynamo_node, :w, 1)
Application.put_env(:dynamo_node, :r, 1)

Basic get/put

# Blind write (empty clock -- no causal context)
{:ok, clock} = DynamoNode.Client.put("user:42", %{name: "Alice", email: "alice@example.com"}, %{})

# Read it back
{:ok, value, clock} = DynamoNode.Client.get("user:42")
# => {:ok, %{name: "Alice", email: "alice@example.com"}, %{...}}

# Causal write -- pass the clock back to establish happens-before
{:ok, new_clock} = DynamoNode.Client.put("user:42", %{name: "Alice", email: "alice@company.com"}, clock)

Always pass the clock returned from a get into subsequent put calls for the same key. This is how the system knows the new write supersedes the previous one. A blind write (%{} clock) will always create a sibling if a concurrent version exists.

Conflict handling

# Two blind writes to the same key produce siblings
{:ok, _c1} = DynamoNode.Client.put("cart:1", %{items: ["book"]}, %{})
{:ok, _c2} = DynamoNode.Client.put("cart:1", %{items: ["shirt"]}, %{})

# Read returns all concurrent versions
{:siblings, versions, merged_clock} = DynamoNode.Client.get("cart:1")
# => {:siblings, [
#      {%{items: ["shirt"]}, clock_b},
#      {%{items: ["book"]}, clock_a}
#    ], merged_clock}

# Merge however makes sense for your domain (cart = union of items)
merged = %{items: ["book", "shirt"]}
DynamoNode.Client.put("cart:1", merged, merged_clock)

# Clean read after merge
{:ok, %{items: ["book", "shirt"]}, _clock} = DynamoNode.Client.get("cart:1")

The system never auto-merges because automatic strategies like last-write-wins can silently discard data. Conflicts are surfaced to the caller.

Return values

DynamoNode.Client.get/1 returns one of:

{:ok, value, clock}                        # clean read
{:siblings, [{value, clock}], merged_clock} # concurrent versions exist
{:error, :not_found}                        # key doesn't exist
{:error, :insufficient_acks}               # couldn't reach R replicas

DynamoNode.Client.put/3 returns one of:

{:ok, clock}                  # write succeeded
{:error, :insufficient_acks}  # couldn't reach W replicas

Multi-node testing

The DynamoNode.TestCluster helper makes it straightforward to test cluster behavior:

# Start a 3-node cluster
{:ok, nodes} = DynamoNode.TestCluster.start(node_count: 3)

# Partition a node
DynamoNode.TestCluster.partition(node_a)

# Write during partition -- succeeds via hinted handoff
DynamoNode.TestCluster.rpc(node_b, DynamoNode.Client, :put, ["key", "value", %{}])

# Heal the partition -- triggers gossip recovery and hint drain
DynamoNode.TestCluster.heal(node_a)

# Wait for gossip to converge across all nodes
DynamoNode.TestCluster.wait_for_convergence(nodes, 10_000)

Tests

mix test

163 tests total, covering:

  • Property-based tests on vector clock operations (transitivity, commutativity, antisymmetry) using stream_data
  • Unit tests on all modules
  • Integration tests for coordinator operations, quorum logic, storage engine behavior, and ring membership

Design notes

A few things worth knowing if you're reading the code:

Coordinator identity is not optional. The coordinator for a key must always be the first node in its preference list. If any node handles a write locally without forwarding, different nodes produce different vector clocks for the same key and conflict detection breaks.

:persistent_term must be re-written in Ring.init/1. If the Ring GenServer crashes and restarts without re-publishing to :persistent_term, all other processes silently read the pre-crash ring state. Nothing errors loudly -- keys just route to wrong or dead nodes.

Gossip depends on Ring, not the other way around. Gossip calls Ring.remove_node/1 on node death. Ring must never call back into Gossip. Keeping this dependency one-directional is important for testability and avoiding cycles in the supervision tree.

Timestamps in vector clocks are for pruning only. Causal ordering uses counters exclusively. Timestamps are only used to decide which entries to evict when a clock exceeds 50 entries. Using timestamps for ordering would reintroduce the NTP drift problems vector clocks are designed to avoid.


References

About

elixir implementation of the 2007 amazon dynamo paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages