New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SWIM #783
Comments
To support SWIM, Tarantool now has a new built-in module called @Gerold103 please add details to this ticket about the SWIM binary protocol. |
Binary protocolSWIM wire protocol is open, will be backward compatible in case of SWIM packet structure:
Initial vectorThis section is optional and appears only when any encryption All the next sections are encrypted as a one big data chunk, if an Meta sectionThis section handles routing, protocol versions compatibility. It
Protocol logic section.This section handles SWIM logical protocol steps and actions.
Refutation happens usually, when a false-positive failure When member's incarnation in a message is bigger than local one,
|
Protocol description
SWIM - Scalable Weakly-consistent Infection-style Process Group Membership
Protocol. Shortly, it is a protocol to discovery cluster members and detect
their failures, works via UDP. Next is a long version.
SWIM consists of 2 components: event dissemination and failure detection, and
stores in memory a table of known remote hosts - members. Also some SWIM
implementations have an additional component: anti-entropy - periodical
broadcast of a random subset of the member table. Tarantool has it as well,
among other extensions.
SWIM has a main operating cycle during which it randomly chooses members from a
member table and sends to them events + ping. Replies are processed out of the
main cycle, asynchronously.
When a member unacknowledged too many pings, its status is changed to
'suspected'. The SWIM paper describes and Tarantool implements a suspicion
subcomponent as a protection against false-positive detection of alive members
as dead. It happens when a member is overloaded and responds to pings too slow,
or when the network is in trouble and packets can not go through some channels.
When a member is suspected, another instance pings it indirectly via other
members. It sends a fixed number of pings to the suspected one in parallel via
additional hops selected randomly among other members.
Random selection in all the components provides even network load of ~1 message
on each member per one protocol step regardless of the cluster size - it is one
of the killer features of SWIM. Without randomness each member would receive a
network load of N messages in each protocol step, where N is the cluster size.
To speed up propagation of new information SWIM proposes and Tarantool
implements a kind of fairness: when selecting a next random member to ping, the
protocol prefers LRU members.
Tarantool splits protocol operation into rounds. At the beginning of a round all
members are randomly reordered into a queue. At each round step a member is
popped from the queue, a message is sent to it, and then it waits for the next
round. Round message contains ping, events, and anti-entropy.
Anti-entropy is a SWIM extension supported by Tarantool. Why is it needed and
even vital? Consider the example: two SWIM nodes, both are alive. Nothing
happens, so the event list is empty, only pings are being sent periodically.
Then a third node appears. It knows about one of the existing nodes. How can it
learn about the rest? Sure, its known counterpart can try to notify its peer,
but it is UDP, so this event can be lost. Anti-entropy is an extra simple
component, it just piggybacks random part of the member table with each regular
message. In the example above the new node will learn about the third one via
anti-entropy messages from the second one sooner or later.
Surprisingly, original SWIM does not describe any addressing, how to uniquely
identify a member. IP and port fallaciously could be considered as a good unique
identifier, but some arguments below demolish this belief:
if instances work in separate containers, they can have the same IP/port
inside a container NATed to a unique IP/port outside the container;
IP/port are likely to change during instance lifecycle. Once IP/port are
changed, a ghost of the old member's configuration still lives for a while
until it is suspected, dead and GC-ed. Taking into account that ACK
timeout can be tens of seconds, 'Dead Souls' can exist unpleasantly long.
Tarantool SWIM implementation uses UUIDs as unique identifiers. UUID is much
more unlikely to change than IP/port. But even if that happens, dissemination
component for a while gossips the new UUID together with the old one, and the
new UUID is learned by all other instances in a couple of rounds.
Additionally to the classical SWIM and to the anti-entropy extension, Tarantool
allows to disseminate your own events. For that with each SWIM instance a
payload can be associated. Payload is arbitrary user data limited in size down
to ~1.2Kb. A user can specify here anything, and it will be eventually
disseminated over the cluster and available at other instances. Each instance
can set out its own payload.
SWIM is supposed to work not only in closed networks, but can be used via public
Internet. Obviously, it would not be safe to use SWIM as is, because data would
be vulnerable. For this Tarantool SWIM provides optional encryption. A user can
choose an encryption algorithm, an encryption mode, a private key, and all the
messages (every ping, ack, event, payload, URI, UUID - everything) are encrypted
with that private key, and a random public key generated for each message to
prevent pattern attacks.
According to the SWIM paper, event dissemination speed over the whole cluster is
O(log(cluster_size)). For other math details see the paper:
https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf
SWIM expects 1500 bytes MTU.
API
SWIM is implemented in pure C, but has Lua API exposed as
swim
module via FFI.How to add:
swim = require('swim')
.SWIM module
The module object does not provide other methods.
SWIM instance
Once a SWIM instance is created via
swim.new()
, it providesthese methods, before first configuration.
Example:
When
swim:cfg()
is called at least once, the SWIM instance exposes a fullvariety of its methods.
Example:
How to set your payload.
Example:
How to set encryption. For brief description of encryption
algorithms see https://github.com/tarantool/tarantool/blob/master/src/lib/crypto/crypto.h#L56
and https://github.com/tarantool/tarantool/blob/master/src/lib/crypto/crypto.h#L83.
Example:
How to look at member table.
Note, all these methods caches their result. It means, that if a
member is once requested via
self()
, ormember_by_uuid()
, orpairs()
, then on a next lookup exactly the same object will bereturned. It means, that these methods are not expensive and does
not produce garbage.
Example:
SWIM member
Methods
swim:member_by_uuid()
,swim:self()
, andswim:pairs()
returnmember objects. A member object has its own API to read its attributes.
Example:
Requested by @Gerold103 in tarantool/tarantool#3234.
The text was updated successfully, but these errors were encountered: