In this document
In order to spread updates across every replica, we've decided to use a Reliable Broadcast adaptation to this problem. The protocol is as follows:
- When a replica receives an update from a client, then it sends it to every other replica (in the same partition). When it gets its first ACK, it returns to the client (because now it is sure that every replica will eventually receive the message).
- When a replica
i
receives a broadcast from other replicaj
, it stores its content and sends an ACK to every replica (in the same partition). It also stores that it received this message fromj
in aretransmission_buffer
(retbuf
). If at some pointi
knows thatj
has failed, theni
retransmits every message it received fromj
stored in thisretbuf
. (In this case, the other replicas will store this retransmission as a message received fromi
and do the same if they knowi
failed.) - When a replica
i
receives an ACK relative to a messagem
from every replica that has not crashed, theni
may remove the entries relative tom
from itsretbuf
, since every replica receivedm
and it won't need to be retransmitted. - Everytime a replica
i
tries to contact another replicaj
and it fails, it broadcasts (to every other replica or to every replica in the same partition?) thatj
has crashed.- In order to detect these failures quickly, every 5 seconds, each replica
k
pings the "next" replica alive, which is the replical
with the smallestid
that is higher thanl.id
. If there is no such replica, it restarts from the lowestid
. (Circular buffer scheme)
- In order to detect these failures quickly, every 5 seconds, each replica
- Every client and replica keeps a version associated with each object. This version is as follows:
<counter, clientID>
, wherecounter
is the version number and is incremented by every write andclientID
is the ID of the last client that wrote it. - When a client reads an object, the serving replica retrieves the stored object and its version. If the client's version is more recent that the retrieved one, it may accept the read, discard it or ask another replica for that object. In this client implementation, the client uses its cached value.
- When a client writes to a replica, it sends the object's version. The replica then compares it with its stored version. The new object version corresponds to
max(replica_version, client_version) + 1
. The client gets the new object version. - When a replica
i
receives a write from another replicaj
with the samecounter
as the stored version, it keeps the version with the higherclientID
- This algorithm is correctif time between crashes in one partitoin is higher than the time needed for that partition to stabilize. This is the time needed to detect the crash and to send pending messages to one of the remaining replicas.
Eg. If a client
1
sends a write to replicaA
and it crashes after sending it to replicaB
, the time needed to stabilize is the time thatB
takes to notice thatA
crashed and send it to some other replica. IfB
crashes after sending it to one replica, it is guaranteed that that replica will have enough time to notice thatB
crashed and send the pending message to another replica. And so on until every replica has crashed. - The system is stable if, for every partition, every message is replicated at least twice