Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWIM #783

Closed
TarantoolBot opened this issue May 21, 2019 · 2 comments
Closed

SWIM #783

TarantoolBot opened this issue May 21, 2019 · 2 comments
Assignees
Labels
feature A new functionality reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality

Comments

@TarantoolBot
Copy link
Collaborator

TarantoolBot commented May 21, 2019

Protocol description

SWIM - Scalable Weakly-consistent Infection-style Process Group Membership
Protocol. Shortly, it is a protocol to discovery cluster members and detect
their failures, works via UDP. Next is a long version.

SWIM consists of 2 components: event dissemination and failure detection, and
stores in memory a table of known remote hosts - members. Also some SWIM
implementations have an additional component: anti-entropy - periodical
broadcast of a random subset of the member table. Tarantool has it as well,
among other extensions.

SWIM has a main operating cycle during which it randomly chooses members from a
member table and sends to them events + ping. Replies are processed out of the
main cycle, asynchronously.

When a member unacknowledged too many pings, its status is changed to
'suspected'. The SWIM paper describes and Tarantool implements a suspicion
subcomponent as a protection against false-positive detection of alive members
as dead. It happens when a member is overloaded and responds to pings too slow,
or when the network is in trouble and packets can not go through some channels.
When a member is suspected, another instance pings it indirectly via other
members. It sends a fixed number of pings to the suspected one in parallel via
additional hops selected randomly among other members.

Random selection in all the components provides even network load of ~1 message
on each member per one protocol step regardless of the cluster size - it is one
of the killer features of SWIM. Without randomness each member would receive a
network load of N messages in each protocol step, where N is the cluster size.

To speed up propagation of new information SWIM proposes and Tarantool
implements a kind of fairness: when selecting a next random member to ping, the
protocol prefers LRU members.

Tarantool splits protocol operation into rounds. At the beginning of a round all
members are randomly reordered into a queue. At each round step a member is
popped from the queue, a message is sent to it, and then it waits for the next
round. Round message contains ping, events, and anti-entropy.

Anti-entropy is a SWIM extension supported by Tarantool. Why is it needed and
even vital? Consider the example: two SWIM nodes, both are alive. Nothing
happens, so the event list is empty, only pings are being sent periodically.
Then a third node appears. It knows about one of the existing nodes. How can it
learn about the rest? Sure, its known counterpart can try to notify its peer,
but it is UDP, so this event can be lost. Anti-entropy is an extra simple
component, it just piggybacks random part of the member table with each regular
message. In the example above the new node will learn about the third one via
anti-entropy messages from the second one sooner or later.

Surprisingly, original SWIM does not describe any addressing, how to uniquely
identify a member. IP and port fallaciously could be considered as a good unique
identifier, but some arguments below demolish this belief:

  • if instances work in separate containers, they can have the same IP/port
    inside a container NATed to a unique IP/port outside the container;

  • IP/port are likely to change during instance lifecycle. Once IP/port are
    changed, a ghost of the old member's configuration still lives for a while
    until it is suspected, dead and GC-ed. Taking into account that ACK
    timeout can be tens of seconds, 'Dead Souls' can exist unpleasantly long.

Tarantool SWIM implementation uses UUIDs as unique identifiers. UUID is much
more unlikely to change than IP/port. But even if that happens, dissemination
component for a while gossips the new UUID together with the old one, and the
new UUID is learned by all other instances in a couple of rounds.

Additionally to the classical SWIM and to the anti-entropy extension, Tarantool
allows to disseminate your own events. For that with each SWIM instance a
payload can be associated. Payload is arbitrary user data limited in size down
to ~1.2Kb. A user can specify here anything, and it will be eventually
disseminated over the cluster and available at other instances. Each instance
can set out its own payload.

SWIM is supposed to work not only in closed networks, but can be used via public
Internet. Obviously, it would not be safe to use SWIM as is, because data would
be vulnerable. For this Tarantool SWIM provides optional encryption. A user can
choose an encryption algorithm, an encryption mode, a private key, and all the
messages (every ping, ack, event, payload, URI, UUID - everything) are encrypted
with that private key, and a random public key generated for each message to
prevent pattern attacks.

According to the SWIM paper, event dissemination speed over the whole cluster is
O(log(cluster_size)). For other math details see the paper:
https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf

SWIM expects 1500 bytes MTU.

API

SWIM is implemented in pure C, but has Lua API exposed as swim module via FFI.
How to add: swim = require('swim').

SWIM module

--
-- Create a new SWIM instance. SWIM instance maintains a member
-- table, interacts with other members. In one Tarantool process
-- multiple SWIM instances can be created.
--
-- @a cfg is an optional configuration parameter. When it is nil,
-- a result SWIM instance is not bound to a socket, nor has a
-- UUID, or any other settings. It can't interact with anybody,
-- and has very limited API until a first configuration. When it
-- is not nil, it is the same as calling
--
--     s = swim.new() s:cfg(cfg).
--
-- For configuration description @sa swim:cfg().
--
swim.new(cfg)

The module object does not provide other methods.

SWIM instance

Once a SWIM instance is created via swim.new(), it provides
these methods, before first configuration.

--
-- Delete a SWIM instance immediately. Its memory is freed, member
-- table deleted, and it can't be used anymore. Other members will
-- think, that this instance is dead. The method does not return
-- anything and can't fail.
-- Note, after that method is executed, any attempt to use the
-- deleted instance leads to a thrown exception.
--
swim:delete()

--
-- Check if a SWIM instance is configured. It is false only when
-- an instance is created via swim.new() without a configuration,
-- and swim:cfg() was not called yet. Returns boolean true or
-- false.
--
swim:is_configured()

--
-- Configure or reconfigure a SWIM instance. @a cfg is a table
-- with the following options:
-- - heartbeat_rate - rate of sending round messages, in seconds.
--   It does not mean that each member will be checked each
--   heartbeat_rate seconds. It is rather the protocol speed.
--   Protocol period depends on member count and heartbeat_rate.
--   By default it is 1 second.
--
-- - ack_timeout - time in seconds after which a ping is
--   considered to be unacknowledged. By default it is 30 seconds.
--
-- - gc_mode - dead member collection mode. When it is 'on', dead
--   members are removed from the member table after one round of
--   dissemination. When it is 'off', dead members are never
--   removed. The SWIM instance will constantly ping them as any
--   other alive member. By default it is 'on'.
--
-- - uri - string or number URI. It can by an 'ip:port' address,
--   or just a port number (then 127.0.0.1 IP is used). Port 0 is
--   supported - it means, that the kernel will choose any free
--   port for the selected IP address.
--
-- - uuid - UUID string or cdata struct tt_uuid. It should be
--   unique among other SWIM instances. Note, that it is allowed
--   to use box.cfg.instance_uuid - it is ok to intersect with
--   Tarantool instance UUID. Moreover, it is recommended. But
--   you are free to choose any UUID.
--
-- All these parameters are dynamic - they can be changed after
-- first configuration. Note, that on first configuration URI and
-- UUID are mandatory - an instance can't operate without them.
--
-- The function is atomic - either the entire configuration is
-- applied, or nothing changes in case of an error.
--
-- @retval true Successful configuration.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:cfg(cfg)

--
-- Swim.cfg when indexed provides a read-only table of
-- configuration options specified in previously called
-- swim:cfg().
--
swim.cfg.<index>

Example:

tarantool> swim = require('swim')
---
...

tarantool> s = swim.new()
---
...

tarantool> s:is_configured()
---
- false
...

tarantool> s:cfg({})
---
- null
- 'swim.cfg: UUID and URI are mandatory in a first config'
...

tarantool> s:cfg({uri = 3333, uuid = '00000000-0000-1000-8000-000000000001'})
---
- true
...

tarantool> s.cfg
---
- uri: 3333
  uuid: 00000000-0000-1000-8000-000000000001
...

tarantool> s:cfg{gc_mode = 'off'}
---
- true
...

tarantool> s.cfg
---
- gc_mode: off
  uri: 3333
  uuid: 00000000-0000-1000-8000-000000000001
...

tarantool>

When swim:cfg() is called at least once, the SWIM instance exposes a full
variety of its methods.

--
-- Size of the member table. Remember, that it is always at least
-- 1 - self member. The function never fails and always returns a
-- number.
--
swim:size()

--
-- A graceful equivalent of swim:delete() - the instance is
-- deleted, but before deletion it sends to each member in its
-- table a message, that this instance has left the cluster, and
-- should not be considered dead. Other instances mark such member
-- in their tables as 'left', and drop it after one round of
-- dissemination. Consequences to the caller are the same as after
-- swim:delete() - the instance is not usable anymore, and it
-- throws an error on any usage attempts. The function can't fail.
--
swim:quit()

--
-- Explicitly add a new member into the member table. @a cfg is a
-- table which describes its attributes: {uuid = ..., uri = ...}.
-- Both values are mandatory, and rules are the same as for URI
-- UUID in swim:cfg(). The method is useful, when a new member is
-- just added to the cluster, and it does not know anybody. Then
-- it can start interaction explicitly with one of existing
-- members via add_member(), and learn about other members
-- automatically from the added one.
--
-- @retval true Member is added.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:add_member(cfg)

--
-- Explicitly and immediately remove a member from the member
-- table.
--
-- @param uuid UUID string or cdata struct tt_uuid.
-- @retval true Member is added.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:remove_member(uuid)

--
-- Send a ping request to @a uri address. If another member
-- listens that address, it will receive the ping, respond with
-- ACK containing its UUID, and the member will be added to the
-- member table. The method is similar to swim:add_member(), but
-- does not require UUID, and it is not reliable as it uses UDP.
--
-- @param uri String or number URI. Rules are the same as for
--        swim:cfg() uri.
-- @retval true Member is added.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:probe_member(uri)

--
-- Broadcast a ping request to all the network interfaces in the
-- system. It is like swim:probe_member(), but to many members at
-- once.
--
-- @param port Optional port argument. All the sent ping requests
--        have this port as destination port in their UDP headers.
--        By default a currently bound port is used.
-- @retval true Broadcast is sent.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:broadcast(port)

Example:

tarantool> fiber = require('fiber')
---
...

tarantool> swim = require('swim')
---
...

tarantool> s1 = swim.new({uri = 3333, uuid = '00000000-0000-1000-8000-000000000001', heartbeat_rate = 0.1})
---
...

tarantool> s2 = swim.new({uri = 3334, uuid = '00000000-0000-1000-8000-000000000002', heartbeat_rate = 0.1})
---
...

tarantool> s1:size()
---
- 1
...

tarantool> s1:add_member({uri = s2:self():uri(), uuid = s2:self():uuid()})
---
- true
...

tarantool> s1:size()
---
- 1
...

tarantool> s2:size()
---
- 1
...

tarantool> fiber.sleep(0.2)
---
...

tarantool> s1:size()
---
- 2
...

tarantool> s2:size()
---
- 2
...

tarantool> s1:remove_member(s2:self():uuid()) s2:remove_member(s1:self():uuid()) 
---
...

tarantool> s1:size()
---
- 1
...

tarantool> s2:size()
---
- 1
...

tarantool> s1:probe_member(s2:self():uri())
---
- true
...

tarantool> fiber.sleep(0.1)
---
...

tarantool> s1:size()
---
- 2
...

tarantool> s2:size()
---
- 2
...

tarantool> s1:remove_member(s2:self():uuid()) s2:remove_member(s1:self():uuid()) 
---
...

tarantool> s1:size()
---
- 1
...

tarantool> s2:size()
---
- 1
...

tarantool> s1:broadcast(3334)
---
- true
...

tarantool> fiber.sleep(0.1)
---
...

tarantool> s1:size()
---
- 2
...

tarantool> s2:size()
---
- 2
...

How to set your payload.

--
-- Payload is arbitrary user defined data up to 1200 bytes in size
-- and disseminated over the cluster, so as each cluster member
-- will eventually learn that payload and that it is associated
-- with that concrete member. Remember, that payload is specified
-- per member. It is not a singleton per cluster. Each cluster
-- member can set out its own payload.
--
-- @param payload Arbitrary Lua object to disseminate. Set to nil
--        to remove the payload. It will be eventually removed
--        on other instances. The object is serialized in
--        MessagePack.
-- @retval true Payload is set.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:set_payload(payload)

--
-- Sometimes it happens, that Lua object is not needed as a
-- payload. For example, a user already has well formatted
-- MessagePack and just wants to set it as a payload. Or cdata
-- is needed to be exposed. This method allows to set something as
-- a payload as is, without MessagePack serialization.
--
-- @param payload String or any cdata.
-- @param size Payload size in bytes. In case of string it is
--        optional, and if specified, then should not be bigger
--        than @a payload. If it is less, then only first @a size
--        bytes of @a payload are used. In case of cdata @a size
--        is mandatory.
-- @retval true Payload is set.
-- @retval nil,err An error occurred. @a err is an error object.
--
swim:set_payload_raw(payload, size)

Example:

tarantool> ffi = require('ffi')
---
...

tarantool> fiber = require('fiber')
---
...

tarantool> swim = require('swim')
---
...

tarantool> s1 = swim.new({uri = 0, uuid = '00000000-0000-1000-8000-000000000001', heartbeat_rate = 0.1})
---
...

tarantool> s2 = swim.new({uri = 0, uuid = '00000000-0000-1000-8000-000000000002', heartbeat_rate = 0.1})
---
...

tarantool> s1:add_member({uri = s2:self():uri(), uuid = s2:self():uuid()})
---
- true
...

tarantool> s1:set_payload({a = 100, b = 200})
---
- true
...

tarantool> s2:set_payload('any payload')
---
- true
...

tarantool> fiber.sleep(0.2)
---
...

tarantool> s1_view = s2:member_by_uuid(s1:self():uuid())
---
...

tarantool> s2_view = s1:member_by_uuid(s2:self():uuid())
---
...

tarantool> s1_view:payload()
---
- {'a': 100, 'b': 200}
...

tarantool> s2_view:payload()
---
- any payload
...

tarantool> cdata = ffi.new('char[?]', 2)
---
...

tarantool> cdata[0] = 1
---
...

tarantool> cdata[1] = 2
---
...

tarantool> s1:set_payload_raw(cdata, 2)
---
- true
...

tarantool> fiber.sleep(0.2)
---
...

tarantool> cdata, size = s1_view:payload_cdata()
---
...

tarantool> cdata[0]
---
- 1
...

tarantool> cdata[1]
---
- 2
...

tarantool> size
---
- 2
...

How to set encryption. For brief description of encryption
algorithms see https://github.com/tarantool/tarantool/blob/master/src/lib/crypto/crypto.h#L56
and https://github.com/tarantool/tarantool/blob/master/src/lib/crypto/crypto.h#L83.

--
-- Enable an encryption. When encryption is enabled, all the
-- messages are encrypted with a chosen private key, and a
-- randomly generated and updated public key. @a cfg parameter is
-- an encryption algorithm specification. It is a table with the
-- following options:
--
-- - algo - algorithm name as a string. Supports all the same
--   algorithms as crypto module. Those are 'aes128', 'aes192',
--   'aes256', 'des'. To disable encryption use 'none'.
--
-- - mode - algorithm encryption mode. Supports all the same modes
--   as crypto module. Those are 'ecb', 'cbc', 'cfb', 'ofb'. Default is 'cbc'.
--
-- - key - a private !!secret!! key. Never store it hardcoded in
--   the source code. It can be cdata and string.
--
-- - key_size - an optional argument, size of @a key in bytes.
--   It is mandatory in case of @a key is cdata. It is optional,
--   when @a key is a string, and allows to truncate it.
--
-- Note, that a private key, algorithm, and mode should be the
-- same on all instances needed to be able to interact.
--
swim:set_codec(cfg)

Example:

tarantool> swim = require('swim')
---
...

tarantool> s1 = swim.new({uri = 0, uuid = '00000000-0000-1000-8000-000000000001'})
---
...

tarantool> s1:set_codec({algo = 'aes128', mode = 'cbc', key = '1234567812345678'})
---
- true
...

How to look at member table.

--
-- Take a self SWIM member object. Never fails.
--
swim:self()

--
-- Find a SWIM member by UUID in the member table.
-- @param uuid UUID string or cdata struct tt_uuid.
-- @retval nil Not found.
-- @retval not-nil A member object.
--
swim:member_by_uuid(uuid)

--
-- Iterator for member table. It should be used in 'for', and
-- note, that the iterator should be only one per SWIM instance
-- at once. The iterator is implemented extra light, so as only
-- one iterator object is available per SWIM instance.
-- Returns the same as pairs() - generator function, iterator
-- object, and a key before first. Keys are UUID, values are
-- member objects.
--
swim:pairs()

Note, all these methods caches their result. It means, that if a
member is once requested via self(), or member_by_uuid(), or
pairs(), then on a next lookup exactly the same object will be
returned. It means, that these methods are not expensive and does
not produce garbage.

Example:

tarantool> fiber = require('fiber')
---
...

tarantool> swim = require('swim')
---
...

tarantool> s1 = swim.new({uri = 0, uuid = '00000000-0000-1000-8000-000000000001', heartbeat_rate = 0.1})
---
...

tarantool> s2 = swim.new({uri = 0, uuid = '00000000-0000-1000-8000-000000000002', heartbeat_rate = 0.1})
---
...

tarantool> s1:add_member({uri = s2:self():uri(), uuid = s2:self():uuid()})
---
- true
...

tarantool> fiber.sleep(0.2)
---
...

tarantool> s1:self()
---
- uri: 127.0.0.1:62341
  status: alive
  incarnation: 1
  uuid: 00000000-0000-1000-8000-000000000001
  payload_size: 0
...

tarantool> s1:member_by_uuid(s1:self():uuid())
---
- uri: 127.0.0.1:62341
  status: alive
  incarnation: 1
  uuid: 00000000-0000-1000-8000-000000000001
  payload_size: 0
...

tarantool> s1:member_by_uuid(s2:self():uuid())
---
- uri: 127.0.0.1:55435
  status: alive
  incarnation: 1
  uuid: 00000000-0000-1000-8000-000000000002
  payload_size: 0
...

tarantool> t = {}
---
...

tarantool> for k, v in s1:pairs() do table.insert(t, {k, v}) end
---
...

tarantool> t
---
- - - 00000000-0000-1000-8000-000000000002
    - uri: 127.0.0.1:55435
      status: alive
      incarnation: 1
      uuid: 00000000-0000-1000-8000-000000000002
      payload_size: 0
  - - 00000000-0000-1000-8000-000000000001
    - uri: 127.0.0.1:62341
      status: alive
      incarnation: 1
      uuid: 00000000-0000-1000-8000-000000000001
      payload_size: 0
...

SWIM member

Methods swim:member_by_uuid(), swim:self(), and swim:pairs() return
member objects. A member object has its own API to read its attributes.

--
-- Member status as a string. It can be 'alive', 'suspected',
-- 'left', and 'dead'.
--
member:status()

--
-- Member UUID as cdata struct tt_uuid.
--
member:uuid()

--
-- Real member URI as a string 'ip:port'. Via this method a user
-- can learn a real assigned port, when port = 0 was specified in
-- swim:cfg().
--
member:uri()

--
-- A number incremented on each member update.
--
member:incarnation()

--
-- Member payload as 'const char *' cdata as size in bytes.
-- Returns these two values.
--
member:payload_cdata()

--
-- Return payload as a string object. Payload is not decoded. It
-- is just returned as a string instead of cdata. If payload was
-- not specified (its size == 0), then nil is returned.
--
member:payload_str()

--
-- Since this is a Lua module, a user is likely to use Lua objects
-- as a payload - tables, numbers, string etc. And it is natural
-- to expect that member:payload() should return the same object
-- which was passed into swim:set_payload() on another instance.
-- This member method tries to interpret payload as MessagePack,
-- and if fails, returns the payload as a string.
--
-- This function caches its result. It means, that only first call
-- actually decodes cdata payload. All the next calls return
-- pointer to the same result, until payload is changed with a new
-- incarnation. If payload was not specified (its size == 0), then nil is
-- returned.
--
member:payload()

--
-- Returns true, if this member object is a stray reference to a
-- member, already dropped from the member table.
--
member:is_dropped()

Example:

tarantool> swim = require('swim')
---
...

tarantool> s = swim.new({uri = 0, uuid = '00000000-0000-1000-8000-000000000001'})
---
...

tarantool> self = s:self()
---
...

tarantool> self:status()
---
- alive
...

tarantool> self:uuid()
---
- 00000000-0000-1000-8000-000000000001
...

tarantool> self:uri()
---
- 127.0.0.1:56367
...

tarantool> self:incarnation()
---
- 1
...

tarantool> self:is_dropped()
---
- false
...

tarantool> s:set_payload_raw('123')
---
- true
...

tarantool> self:payload_cdata()
---
- 'cdata<const char *>: 0x0103500050'
- 3
...

tarantool> self:payload_str()
---
- '123'
...

tarantool> s:set_payload({a = 100})
---
- true
...

tarantool> self:payload_cdata()
---
- 'cdata<const char *>: 0x0103500050'
- 4
...

tarantool> self:payload_str()
---
- !!binary gaFhZA==
...

tarantool> self:payload()
---
- {'a': 100}
...

Requested by @Gerold103 in tarantool/tarantool#3234.

@lenkis lenkis added 2.1 feature A new functionality reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality labels May 22, 2019
@lenkis
Copy link
Contributor

lenkis commented May 22, 2019

To support SWIM, Tarantool now has a new built-in module called swim.

@Gerold103 please add details to this ticket about the SWIM binary protocol.

@Gerold103
Copy link
Contributor

Binary protocol

SWIM wire protocol is open, will be backward compatible in case of
any changes, and can be implemented in order to simulate your own
SWIM cluster members, in another language, or even not related to
Tarantool. The protocol is encoded as MessagePack.

SWIM packet structure:

+-----------------Public data, not encrypted------------------+
|                                                             |
|      Initial vector, size depends on chosen algorithm.      |
|                   Next data is encrypted.                   |
|                                                             |
+----------Meta section, handled by transport level-----------+
| map {                                                       |
|     0 = SWIM_META_TARANTOOL_VERSION: uint, Tarantool        |
|                                      version ID,            |
|     1 = SWIM_META_SRC_ADDRESS: uint, ip,                    |
|     2 = SWIM_META_SRC_PORT: uint, port,                     |
|     3 = SWIM_META_ROUTING: map {                            |
|         0 = SWIM_ROUTE_SRC_ADDRESS: uint, ip,               |
|         1 = SWIM_ROUTE_SRC_PORT: uint, port,                |
|         2 = SWIM_ROUTE_DST_ADDRESS: uint, ip,               |
|         3 = SWIM_ROUTE_DST_PORT: uint, port                 |
|     }                                                       |
| }                                                           |
+-------------------Protocol logic section--------------------+
| map {                                                       |
|     0 = SWIM_SRC_UUID: 16 byte UUID,                        |
|                                                             |
|                 AND                                         |
|                                                             |
|     2 = SWIM_FAILURE_DETECTION: map {                       |
|         0 = SWIM_FD_MSG_TYPE: uint, enum swim_fd_msg_type,  |
|         1 = SWIM_FD_INCARNATION: uint                       |
|     },                                                      |
|                                                             |
|               OR/AND                                        |
|                                                             |
|     3 = SWIM_DISSEMINATION: array [                         |
|         map {                                               |
|             0 = SWIM_MEMBER_STATUS: uint,                   |
|                                     enum member_status,     |
|             1 = SWIM_MEMBER_ADDRESS: uint, ip,              |
|             2 = SWIM_MEMBER_PORT: uint, port,               |
|             3 = SWIM_MEMBER_UUID: 16 byte UUID,             |
|             4 = SWIM_MEMBER_INCARNATION: uint,              |
|             5 = SWIM_MEMBER_PAYLOAD: bin                    |
|         },                                                  |
|         ...                                                 |
|     ],                                                      |
|                                                             |
|               OR/AND                                        |
|                                                             |
|     1 = SWIM_ANTI_ENTROPY: array [                          |
|         map {                                               |
|             0 = SWIM_MEMBER_STATUS: uint,                   |
|                                     enum member_status,     |
|             1 = SWIM_MEMBER_ADDRESS: uint, ip,              |
|             2 = SWIM_MEMBER_PORT: uint, port,               |
|             3 = SWIM_MEMBER_UUID: 16 byte UUID,             |
|             4 = SWIM_MEMBER_INCARNATION: uint,              |
|             5 = SWIM_MEMBER_PAYLOAD: bin                    |
|         },                                                  |
|         ...                                                 |
|     ],                                                      |
|                                                             |
|               OR/AND                                        |
|                                                             |
|     4 = SWIM_QUIT: map {                                    |
|         0 = SWIM_QUIT_INCARNATION: uint                     |
|     }                                                       |
| }                                                           |
+-------------------------------------------------------------+

Initial vector

This section is optional and appears only when any encryption
protocol is used. This section contains a public key. For example,
for AES algorithms it is 16 byte initial vector stored as is. When
no encryption is used, the section size is 0.

All the next sections are encrypted as a one big data chunk, if an
encryption is enabled.

Meta section

This section handles routing, protocol versions compatibility. It
works at 'transport' level of SWIM protocol, and is presented
always. Keys:

  • SWIM_META_TARANTOOL_VERSION - mandatory field. Tarantool sets
    here its version as a 3 byte integer: 1 byte for major, 1 byte
    for minor, 1 byte for patch. For example, version 2.1.3 would
    be encoded like this: (((2 << 8) | 1) << 8) | 3;. This field
    will be used to support multiple versions of the protocol;

  • SWIM_META_SRC_ADDRESS and SWIM_META_SRC_PORT - mandatory
    fields, source IP address and port. IP is encoded as 4 bytes.
    "xxx.xxx.xxx.xxx" - each of 'xxx' is one byte. Port is encoded
    as an integer. Example of how to encode "127.0.0.1:3313":

    struct in_addr addr;
    inet_aton("127.0.0.1", &addr);
    pos = mp_encode_uint(pos, SWIM_META_SRC_ADDRESS);
    pos = mp_encode_uint(pos, addr->s_addr);
    pos = mp_encode_uint(pos, SWIM_META_SRC_PORT);
    pos = mp_encode_uint(pos, 3313);
  • SWIM_META_ROUTING - optional field. It is a subsection
    responsible for packet forwarding. It is used by SWIM
    suspicion mechanism. Since this is a pure wire protocol
    description, read about suspicion in the SWIM paper. All the
    fields in this section are mandatory, if it is presented.

    • SWIM_ROUTE_SRC_ADDRESS and SWIM_ROUTE_SRC_PORT - source
      IP address and port. It should be an address of the
      message originator, and can be different from
      SWIM_META_SRC_ADDRESS/PORT;
    • SWIM_ROUTE_DST_ADDRESS and SWIM_ROUTE_DST_PORT -
      destination IP address and port. They should be set to
      the message final destination.

    If a message was sent indirectly with help of this section,
    answer should be sent back by the same route. An example of
    how SWIM uses routing for indirect ping.

    Assume, there are 3 nodes: S1, S2, S3. S1 sends a message to
    S3 via S2. The following steps are executed in order to
    deliver the message:

    S1 -> S2
    { src: S1, routing: {src: S1, dst: S3}, body: ... }
    

    S2 receives the message and sees: routing.dst != S2 - it is
    a foreign packet. S2 forwards it to S3 preserving all the
    data - body and routing sections.

    S2 -> S3
    {src: S2, routing: {src: S1, dst: S3}, body: ...}
    

    S3 receives the message and sees: routing.dst == S3 - the
    message is delivered. If S3 wants to answer, it sends a
    response via the same proxy. It knows, that the message was
    delivered from S2, and sends an answer via S2.

Protocol logic section.

This section handles SWIM logical protocol steps and actions.

  • SWIM_SRC_UUID - mandatory field. SWIM uses UUID as a unique
    identifier of a member, not IP/port. This field stores UUID of
    sender. Its type is MP_BIN. Size is always 16 bytes. UUID is
    encoded in host byte order, no bswaps are needed.
    Next sections can be present one be one, or only some of them. A
    connector should be ready to any combinations. In each section a
    member or the section on whole have an incarnation number. This
    number is used to ignore old messages, and refute false ones. If
    incarnation of a member is less than locally stored one, then the
    message is outdated. It happens, because UDP allows reordering and
    duplication.

Refutation happens usually, when a false-positive failure
detection has happened. In such a case the member thought to be
dead receives that fact from other members, increases its own
incarnation, and spreads the refutation saying the member is
alive.

When member's incarnation in a message is bigger than local one,
all its attributes should be updated with the received ones (IP,
port, status). Payload is a bit different. Payload can be updated
only if it is present in the message. Because of its huge size
(in comparison with UDP packet max size) it is not sent with each
member always.

  • SWIM_FAILURE_DETECTION - this subsection describes a ping or
    ACK;
    • SWIM_FD_MSG_TYPE - type of the message. 0 is ping, 1 is
      ACK;
    • SWIM_FD_INCARNATION - incarnation number of sender;
  • SWIM_DISSEMINATION - this subsection is a heart of SWIM. It
    lists changed cluster members. At least part of them, when
    there are too many changes to fit into one UDP packet;
    • SWIM_MEMBER_STATUS - member status. This field is
      mandatory. 0 - alive, 1 - suspected, 2 - dead, 3 - left;
    • SWIM_MEMBER_ADDRESS and SWIM_MEMBER_PORT - member IP and
      port. These fields are mandatory;
    • SWIM_MEMBER_UUID - member UUID. This field is mandatory.
    • SWIM_MEMBER_INCARNATION - member incarnation number. This
      field is mandatory;
    • SWIM_MEMBER_PAYLOAD - member payload. It is MP_BIN with
      arbitrary user data stored as is. The field is optional.
      Note, that absence of SWIM_MEMBER_PAYLOAD says nothing -
      it is not the same as 0 sized payload;
  • SWIM_ANTI_ENTROPY - this subsection is a helper of the
    dissemination. It contains all the same fields as the
    dissemination, but all of them are mandatory, including
    payload even when its size is 0. Anti-entropy eventually
    spreads changes not spread by the dissemination by any reason;
  • SWIM_QUIT - subsection saying that the sender has left the
    cluster gracefully and should not be considered dead. Sender
    should be marked 'left' then.
    • SWIM_QUIT_INCARNATION - sender incarnation number.

pgulutzan added a commit that referenced this issue Jun 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality
Projects
None yet
Development

No branches or pull requests

4 participants