This project is a framework for deploying Docker-based Atomix clusters for systems/fault injection/load testing. The test framework supports:
- Setup Docker-based test clusters
- Teardown Docker-based test clusters
- Add client/server nodes
- Remove nodes
- Kill/restart nodes
- Create network partitions
- Inject network latency
- Generate load on nodes (CPU, I/O, memory, etc)
The test framework requires that the Docker engine be running. It uses docker-py to communicate with the Docker engine to create networks and containers for test clusters.
Additionally, the Atomix Python client is required for performing operations on the Atomix cluster.
To build the test Docker container, run docker build -t atomix/atomix-test .
To install the test framework, run python setup.py install
The atomix-test
script can be used to manage Atomix clusters for use in testing.
To view a list of cluster management commands, run atomix-test cluster -h
To create a new test cluster, use the setup
command, providing a named configuration for the nodes to run:
> atomix-test cluster my-cluster setup -c consensus
my-cluster Setting up cluster
my-cluster Creating network
my-cluster Running container my-cluster-1
my-cluster Running container my-cluster-2
my-cluster Running container my-cluster-3
my-cluster Waiting for cluster bootstrap
To configure the cluster with more than three nodes, pass a --nodes
or -n
argument:
> atomix-test cluster my-cluster setup -n 5
Running the setup
command will set up a new Docker network and a set of containers.
The network will have the same name as the cluster and the container names will be
prefixed with the cluster name as visible via the Docker engine:
> docker network ls
NETWORK ID NAME DRIVER SCOPE
dd03f9e3a669 my-cluster bridge local
> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
719a14f535a3 atomix "java -jar atomix-..." About a minute ago Up 2 minutes 5679/tcp, 0.0.0.0:50219->5678/tcp my-cluster-3
55d6f3188a88 atomix "java -jar atomix-..." About a minute ago Up 2 minutes 5679/tcp, 0.0.0.0:50218->5678/tcp my-cluster-2
6b2d6569c503 atomix "java -jar atomix-..." About a minute ago Up 2 minutes 5679/tcp, 0.0.0.0:50217->5678/tcp my-cluster-1
To get a list of the currently running test clusters, use the list
command:
> atomix-test cluster list
my-cluster
my-other-cluster
To list the nodes in a cluster, use the nodes
command:
> atomix-test cluster my-cluster nodes
ID NAME STATUS IP LOCAL PORT
1 my-cluster-1 running 172.18.0.2 61170
2 my-cluster-2 running 172.18.0.3 61171
3 my-cluster-3 running 172.18.0.4 61172
The port listed under LOCAL PORT
is the localhost
port on which the Atomix HTTP
server is running.
The Docker engine is treated as the source of truth for test cluster info. The atomix-test
script inspects running Docker containers and networks to determine which test clusters are
running.
To add a node to a cluster, use the add-node
command:
> atomix-test cluster my-cluster add-node -c client
To remove a node from a cluster, use the remove-node
command, passing the node ID as the last argument:
> atomix-test cluster my-cluster remove-node 4
To shutdown a cluster, use the teardown
command:
> atomix-test cluster my-cluster teardown
my-cluster Tearing down cluster
my-cluster Stopping container my-cluster-1
my-cluster Removing container my-cluster-1
my-cluster Stopping container my-cluster-2
my-cluster Removing container my-cluster-2
my-cluster Stopping container my-cluster-3
my-cluster Removing container my-cluster-3
my-cluster Removing network
my-cluster Cleaning up cluster state
To create a cluster in a test, use the create_cluster
function:
from atomixtest import create_cluster
with create_cluster('consensus', nodes=3) as cluster:
node = cluster.node(1)
...
By default, a cluster with the same name as the test function - e.g. test_map
- will be
started prior to running the test method and will be torn down afterwards. The Cluster
object provided to tests has APIs for managing and accessing the test network, cluster,
and nodes.
The same test cluster management functions are available via the Cluster
object:
def test_resize():
with create_cluster('consensus', nodes=3) as cluster:
# Add a node to the cluster
cluster.add_node()
# Add a client node to the cluster
cluster.add_node(type='client')
# Remove a node from the cluster
cluster.remove_node(4)
Nodes can be accessed via the Cluster
API:
node = cluster.node(1)
Each node is also an Atomix REST client from which primitive instances can be created:
with create_cluster('consensus', nodes=3) as cluster:
node = cluster.add_node(type='client')
map = node.map('test-map')
assert map.get('foo') is None
assert map.put('foo', 'Hello world!') is None
assert map.get('foo')['value'] == 'Hello world!'
Tests are written using simple assert
statements like the above.
The Cluster
, Network
and Node
objects have various methods for injecting network
and node failures in tests.
with create_cluster('consensus', nodes=3) as cluster:
# Kill a node
cluster.node(1).kill()
# Restart the node
cluster.node(1).start()
# Partition a node from all other nodes
cluster.node(1).isolate()
# Heal the isolated node
cluster.node(1).unisolate()
# Partition one node from another node
cluster.node(1).partition(cluster.node(2))
# Inject latency in the network and attempt to write to a map
with cluster.network.delay(latency=100):
cluster.node(2).map('test-map').put('bar', 'baz')
Fault injection methods also support a context manager that can be used to encapsulate blocks of code to be executed during the failure.
with create_cluster('consensus', nodes=3) as cluster:
node1 = cluster.node(1)
node2 = cluster.node(2)
# Test writing to a map with the cluster under load.
with cluster.stress(cpu=4):
node1.map('test-map').put('foo', 'bar')
# Test that a map can be read from node 2 while node 1 is partitioned.
with node1.isolate():
assert node2.map('test-map').get('foo')['value'] == 'bar'
The test tool comes with a randomized fault injection/fuzz testing framework to aid in locating problems that are difficult to reproduce. Randomized fault injection tests are referred to as entropy tests. Entropy tests randomly inject select faults into a cluster while simultaneously performing common operations on the cluster nodes. An entropy test runs for a fixed period of time, running entropy functions at semi-random intervals specified in the entropy test command. Functions supported by entropy tests include:
- Node crash
- Network partitions
- Network latency
- CPU stress
- I/O stress
- Memory stress
- Node addition/removal
- Cluster restarts
To run entropy tests, use the atomix-test entropy
command:
atomix-test entropy --nodes 3 --config consensus restart
The entropy test provides several options for configuring the timing of entropy functions, the number of operations to perform, and the duration of the test:
--parallelism
- the number of processes with which to submit operations to the test cluster--scale
- the total number of keys on which to operate when submitting operations to the test cluster--prime
- the number of operations with which to prime the cluster before introducing entropy--ops
- the total number of operations per second to perform on the test cluster during the entropy test--run-time
- the total number of time for which to run the entropy test. Times are specified in human readable format, e.g.10m30s
For example, to prime a consensus cluster with 10,000 operations on 1,000 distinct keys, running a test for 10 minutes with randomized crashes and isolating network partitions while submitting 50 operations per second, use the following command:
atomix-test entropy --nodes 3 --config consensus --prime 10000 --scale 1000 --ops 50 --run-time 10m crash --random 30s partition --isolate 30s 1m
The following entropy functions are supported by the entropy command:
crash
partition
stress
restart
Each function supports its own distinct set of options specific to the function. Multiple entropy functions can be enabled by specifying them in the same command:
atomix-test entropy --nodes 3 --config consensus crash --random 10s restart 30s partition --random --isolate 30s 1m
The crash
function crashes a node in the cluster. The first two positional
argument to the crash
function are a uniform random delay for crash function
invocations:
atomix-test entropy -n 3 -c consensus crash 15s 30s
The --random
option configures the amount of time for which to crash a
random node, specifying a uniform random interval:
atomix-test entropy -n 3 -c consensus crash --random 30s 1m
The partition
function supports various types of network partitions.
The first two positional arguments to the partition
function are a uniform
random delay for partition function invocations:
atomix-test entropy -n 3 -c consensus partition 15s 30s
Additionally, types of partitions are supported by optional arguments:
--random
- partitions a random pair of nodes from each other--isolate
- partitions a random node from all other nodes in the cluster--halves
- partitions the cluster into two halves--bridge
- partitions the cluster into two halves with a single node visible to each half
Each partition type supports an optional uniform random partition period:
atomix-test entropy -n 3 -c consensus partition --random 30s --isolate 30s 1m --halves
The stress
function supports various types of stress on the cluster and
the network. The first two positional arguments to the stress
function
are a uniform random delay for stress function invocations:
atomix-test entropy -n 3 -c consensus stress 15s 30s
Additionally, types of stress are supported by optional arguments:
--network
- Injects configurable latency into the network--cpu
- Createsn
processes spinning onsqrt()
--io
- Createsn
processes spinning onsync()
--memory
- Createsn
processes spinning onmalloc()
/free()
atomix-test entropy -n 3 -c consensus stress --network 500ms --cpu 2
The stress
function also supports optional arguments specifying the nodes
to which to apply stress functions:
--random
- applies stress functions to a random node--all
- applies stress functions to all nodes simultaneously
atomix-test entropy -n 3 -c consensus stress --random --network 500ms
Note that each stress option creates a separate entropy function that will run independently of all other functions. For example:
atomix-test entropy -n 3 -c consensus stress --random --network 500ms --cpu 2
This command creates the following two functions:
- Increase latency for a random node by 500ms
- Increase CPU on a random node
The restart
function simply restarts all the nodes in the cluster.
The only arguments to the restart
function are a uniform random delay
for restart function invocations.
The following command restarts the cluster every minute:
atomix-test entropy -n 3 -c consensus restart 1m
The following command restarts the cluster every 10 minutes to 1 hour:
atomix-test entropy -n 3 -c consensus restart 10m 1h
The Cluster
object supports the following properties and methods:
path
- the host path in which Atomix data is persistednetwork
- theNetwork
used to communicate across nodes in the clusternode(id)
- returns aNode
object byint
ID orstr
namenodes(type=None)
- returns a list of all nodes in the cluster of the giventype
servers()
- returns a list of allserver
type nodes in the clusterclients()
- returns a list of allclient
type nodes in the clustersetup(nodes=3, supernet='172.18.0.0/16', subnet=None, gateway=None, cpu=None, memory=None)
- sets up a new clusteradd_node(type='server')
- adds a new node to the clusterremove_node(id)
- removes a node byint
ID orstr
nameteardown()
- tears down the clusterstress(node=None, timeout=None, cpu=None, io=None, memory=None, hdd=None)
- stresses all nodes in the clusterdestress()
- kills stressors on all nodes in the cluster
The Network
object can be accessed via Cluster.network
and supports the following
properties and methods:
setup(supernet='172.18.0.0/16', subnet=None, gateway=None)
- sets up the network, creating a new Docker network with a subnet in the givensupernet
teardown()
- tears down the network, removing the Docker networkpartition(local, remote=None)
- partitions the givenlocal
node from the givenremote
node or from all nodes in the network ifremote
isNone
.local
andremote
must be node names or IDs.unipartition(local, remote=None)
- creates a uni-directional partition from the givenlocal
node to the givenremote
node or to all nodes in the network ifremote
isNone
.local
andremote
must be node names or IDs.bipartition(local, remote=None)
- partitions the givenlocal
node from the givenremote
node or from all nodes in the network ifremote
isNone
.local
andremote
must be node names or IDs.heal(local=None, remote=None)
- heals a partition from the givenlocal
node to the givenremote
node. If theremote
node isNone
, all partitions from the givenlocal
are removed. Iflocal
isNone
then all partitions in the cluster are healed.local
andremote
must be node names or IDs.partition_halves()
- partitions the cluster into two halves using bi-directional partitionspartition_random()
- partitions a random node from all other nodes in the clusterpartition_bridge(node=None)
- partitions the cluster into two halves with a single "bridge" node able to see each side of the partition. If the givennode
isNone
then the bridge node will be selected randomly.partition_isolate(node=None)
- partitions a node from all other nodes in the cluster. If the given node isNone
then a random node will be selected.delay(node=None, latency=50, jitter=10, correlation=.75, distribution='normal')
- delays packets to the given node or to all nodes ifnode
isNone
drop(node=None, probability=.02, correlation=.25)
- drops packets to the given node or to all nodes ifnode
isNone
reorder(node=None, probability=.02, correlation=.5)
- reorders packets to the given node or to all nodes ifnode
isNone
duplicate(node=None, probability=.005, correlation=.05)
- delays packets to the given node or to all nodes ifnode
isNone
corrupt(node=None, probability=.02)
- delays packets to the given node or to all nodes ifnode
isNone
restore(node=None)
- restores all traffic to the given node or to all nodes ifnode
isNone
The Node
object can be accessed via Cluster.node(id)
and supports the following
properties and methods:
id
- theint
ID of the nodestatus
- thestr
status of the node provided by the underlying Docker containerlocal_port
- thelocalhost
port on which the Atomix HTTP server is listeninglogs()
- returns the Atomix logs for the nodesetup(cpu=None, memory=None)
- sets up the node, passing thecpu
andmemory
to Docker at setupteardown()
- tears down the node, removing the Docker containerrun(*command)
- runs a command in the Docker containerexecute(*command)
- runs a command in the Docker container, detaching from the shellstop()
- stops the Docker container usingdocker stop
start()
- starts the Docker container usingdocker start
kill()
- kills the Docker container usingdocker kill
recover()
- restarts a dead Docker container usingdocker start
restart()
- restarts the Docker container usingdocker restart
partition(node)
- partitions the node from the givennode
using a bi-directional partitionheal(node)
- heals a partition from this node to the givennode
isolate()
- isolates this node from all other nodes in the clusterunisolate()
- heals an isolation from all other nodes in the clusterdelay(latency=50, jitter=10, correlation=.75, distribution='normal')
- delays packets to the node with the givenlatency
in millsecondsdrop(probability=.02, correlation=.25)
- drops packets to the node with the givenprobability
reorder(probability=.02, correlation=.5)
- reorders packets to the node with the givenprobability
duplicate(probability=.005, correlation=.05)
- duplicates packets to the node with the givenprobability
corrupt(probability=.02)
- corrupts packets to the node with the givenprobability
restore(self)
stress(timeout=None, cpu=None, io=None, memory=None, hdd=None)
destress()
To run tests, use the run
command:
> atomix-test run
The run
command can take any arguments supported by pytest