Skip to content

Latest commit

 

History

History
356 lines (230 loc) · 20.8 KB

Lecture 22.md

File metadata and controls

356 lines (230 loc) · 20.8 KB

Distributed Systems Lecture 22

Lecture Given by Lindsey Kuper on May 29th, 2020 via YouTube

Previous Next
Lecture 21 Lecture 23

Keeping Replicas Consistent

Consider the following scenario in which we need strong consistency, but we have:

  • A datastore held across multiple replicas
  • These replicas have no leader nor coordinating process (I.E. We're not using Primary Backup or Chain Replication)
  • Any replica can receive an update

The challenge then concerns how to keep these replicas consistent with each other.

What approach should we adopt?

Well, we could use a consensus protocol to decide which updates should be processed and in which order.

If replica R1 receives update A and replica R2 receives update B, then these replicas must agree on the order in which these updates should be delivered.

Replica Consensus 1

So, even though the messages arrive in an unpredictable order, a consensus protocol can be used to decide upon the delivery order. In this case, the consensus protocol implements the concept of "delivery slots". It then determines that both replicas should place event B in delivery slot 1 and event A in delivery slot 2.

Q:   But how many messages need to be sent in order to arrive at this agreement?
A:   Lots!

Here's an example that shows the number of messages that need to be exchanged simply for replicas R1 and R2 to agree on a total delivery order for events A and B shown above.

Replica Consensus 2

Step Description Message Count
1 Replica R2 sends out a prepare(6) message to a majority of acceptors, who each respond with the corresponding promise messages 4
2 Just a little time after R2's message exchange has taken place, replica R1 sends out its prepare(5) messages to a majority of acceptors.
Acceptor A1 happily accepts this proposal number, but acceptor A2 has already promised to ignore messages with a proposal numbers less than 6, so this prepare message is ignored and replica R1 left hanging.
3
(which all turn out to be redundant)
3 Replica R2 sends out its accept(6,(slot_1,B)) messages to the acceptors who each respond with accepted(6,(slot_1,B)).
So, event B now occupies delivery slot 1.
4
4 Replica R1 still needs to get agreement on a total order for event A, so it tries the prepare/promise phase again, but now with proposal number 7.
This time, the proposal number is accepted.
4
5 Replica R1 then enters the accept/accepted phase and achieves consensus on event A occupying delivery slot 2. 4
19

So, even in this reasonably happy example where we are not sending messages to all the acceptors (only a majority), we've had to send 19 messages, 3 of which turned out to be redundant — and we haven't even accounted for sending messages out to the learners!

The point here is that in terms of the network traffic they generate, consensus algorithms are really expensive; therefore, we should only implement them in situations where it's extremely important that everybody agrees on a total order.

So far however, we have not even discussed what events A and B represent: this is because consensus algorithms do not care about a message's payload; they simply see an opaque (I.E. meaningless) byte array to which some metadata has been attached. Causal Broadcast for instance, looks simply at the message's recipient and the vector clock value, and from these values, determines the delivery order.

My Aside

A useful analogy here is to think of the people working in a mail sorting room. These people are concerned with the fact that all the letters and packages have been addressed correctly, and that the correct postage has been paid for a letter or package of that weight and dimensions.

It is quite irrelevant for these people to concern themselves with the contents of the letters and packages.

However, from the perspective of the human developer, we can see firstly that we're having to go to a lot of trouble simply to agree on the order in which a set of events are delivered, and secondly, the algorithm that determines the delivery order is completely agnostic to the real-life functionality these events trigger.

Rather than having a message-agnostic consensus algorithm then, wouldn't it be smarter to make intelligent decisions about delivery order based on our knowledge of the functionality being implemented?

Safety Properties: Do We Really Need Strong Consistency?

Let's go back to the shopping cart scenario described in lecture 17.

Amazon Shopping Cart 1

If replicas R1 and R2 simply represent shopping carts, then we certainly don't want to go to all the trouble of running a consensus algorithm. But we have arrived at this conclusion based on our knowledge of the business process being implemented. In this case, we really don't care whether a book is added to the shopping cart before or after the pair of jeans. From the perspective of the business logic, the order in which items are added is simply not important. (More technically, the order in which items are added to a shopping cart is commutative, not associative).

Of course, there will be some situations in which message delivery order is critical to the logic of your business process, but what we propose here is that strong consistency is needed only in a minority of cases; the greater majority of business scenarios will function quite happily with strong convergence.

Reminder

Strong Consistency

State equivalence between replicas can only be acheived after all replicas deliver the same messages in the same order.

Strong Convergence

State equivalence between replicas is eventually acheived after all replicas deliver the same messages.

Strong convergence might still be tricky to implement, but it will be easier than strong consistency. The bottom line here is that you should only implement strong consistency when you have no other choice.

How Do We Generalise the Requirement for Strong Convergence?

To do this, we will need to look back at the definition of a partially ordered set that we covered in lectures 3 and 4.

As a reminder, a partial order allows you to compare the members of set S using a binary relation such as (less than or equals). However, the word "partial" in the name tells us that not every pair of elements in the set is comparable.

This relation is governed by three axioms:

Property English Description Mathematical Description
Reflexivity For all a in S,
a is always to itself
∀ a ∈ S: a ≤ a
Anti-symmetry For all a and b in S,
if a ≤ b and b ≤ a, then a = b
∀ a, b ∈ S: a ≤ b, b ≤ a => a = b
Transitivity For all a, b and c in S,
if a ≤ b and b ≤ c, then a ≤ c
∀ a, b, c ∈ S: a ≤ b, b ≤ c => a ≤ c

The standard example of a partially ordered set is set inclusion — that is the set of all possible subsets of a given set.

For example, if we have a set containing:

Book A book
jeans A pair of jeans
Torch A torch

Then the inclusion set (the set of subsets) will contain the following eight members:

Set of Subsets

Our relation here is the "less than or equals" operator . Using this operator, certain members of our set of subsets can be compared as follows:

Comparable Subsets 1

However, other members are not comparable; for instance:

Noncomparable Subsets

Upper Bounds

If we select some elements from our set of subsets, say the singleton set containing the jeans {👖} and the singleton set containing the torch {🔦}, we could ask the following question:

Which elements of S are at least as big as {👖} and {🔦}?

The answer here is:

Any set that contains at least the union of {👖} and {🔦}.

So, this would be the sets {👖,🔦} and {📓,👖,🔦}. These two sets are known as the upper bounds of {👖} and {🔦}.

In more formal language, the upper bound is:

Given a partially ordered set1 (S, ) an upper bound of a,b ∈ S is an element u ∈ S such that a ≤ u and b ≤ u.

Notice that we talk of an upper bound. This means it is possible that for the members a and b there could well be multiple examples of some set u that all satisfy the upper bound requirements.

Which Upper Bounds are Going to Be the Most Interesting?

The upper bound set that contains all the members of the original set is not very interesting because this will always be a common upper bound for all its subsets, so we can ignore this one.

Generally speaking, the most interesting upper bounds are the smallest ones. But how do we define the smallest upper bound? The formal definition is:

If a, b, u and v are all members of the inclusion set S, then u is the least upper bound2 of a,b ∈ S if u ≤ v for each v

Join-semilattice

Q:   If S is the eight-member inclusion set of {📓,👖,🔦}, then is it true that every 2 elements of S will have a least upper bound (lub)? A:   Yes!

Any set for which this property is true is given the fancy name of a Join-semilattice.

A partially ordered set (poset) in which every 2 elements have a least upper bound (lub) is called a join-semilattice.

So, it follows therefore that if some partially ordered sets are join-semilattices, then there should also be some partially ordered sets that are not.

It's quite hard to think of a set within which every 2 elements do not have a least upper bound (I.E. think of a set that is not a join-semilattice), but a good example is a Boolean register. This is a tri-state variable that can be either empty, true or false

So, the poset is simply {empty, true, false}. This is a very simple set containing only three members that can be arranged as follows:

Boolean Ordering

These values can also be ordered using the operator:

empty ≤ empty
 true ≤ true
false ≤ false

empty ≤ true
empty ≤ false

So, is this a true partially-ordered set? To answer this question, we must check that all the axioms are satisfied.

Reflexivity
Since empty ≤ empty, true ≤ true and false ≤ false, then this set satisfies the requirements of reflexivity.

Anti-symmetry
Anti-symmetry requires that if a ≤ b and b ≤ a, then a = b. However, since this set contains only three members arranged in a two-layer lattice, we have already implicitly satisfied anti-symmetry by satisfying the requirements of reflexivity.

Transitivity
Transitivity requires that if a ≤ b and b ≤ c then a ≤ c. However, no members of the set can be compared this way, so this set obeys transitivity only in a vacuous sense.

So, this Boolean Register qualifies as a true partially-ordered set; however, we can see that if we picked the elements true and false and asked "What is their least upper bound?", then we can see that there isn't one. Therefore, this set is not a join-semilattice.

What's This Got to do with Distributed Systems?

Let's say we have a system with two replicas that hold a type of information that is very different to the shopping cart example. Here, these replicas hold the value of our Boolean register and they then receive conflicting updates:

Conflicting Updates

Q:   Why do these updates create a conflict?
A:   Because we have no way to combine the values true and false

Q:   Why can we not combine these values?
A:   Because, as we can see from the lattice diagram above, true and false have no upper bound, let alone a least upper bound.

In this case, the inclusion set formed from the members {empty, true, false} contains only the members:

S = { {}
    , {empty}
    , {true}
    , {false}
    , {empty, true}
    , {empty, false}
    }

The inclusion set does not contain the least upper bound member {true, false} neither does it contain the upper bound {empty, true, false}.

In order to resolve such a conflict, we would need to implement some sort of consensus algorithm. However, because consensus algorithms are expensive, we really don't want to implement one unless we need to.

So generally, if the updates your replicas are receiving can be thought of as the members of a set that is a join-semilattice, then we can resolve the requirements of strong convergence by taking the least upper bound. This also means we do not need to implement a consensus algorithm.

So, here's an informal claim:

If the states that replicas can take on can be thought of as elements of a join-semilattice, then there is a natural way of resolving conflicts between replicas without needing a consensus algorithm.

This claim is described as "informal" because it uses unqualified words such as "natural". Nonetheless, there is a lot of interesting work being done on this type of conflict resolution. If you're interested in this type of work, take a look at a topic called "Conflict-Free Replicated Datatypes" or (CRDTs). An example implementation by Martin Kleppmann and Alastair Beresford has been described in this paper for JSON datatypes.

Back to the Shopping Cart...

So far, we've been thinking about conflicts that can arise when different clients add members to a set. In the case of a shopping cart, the order in which items are added does not matter because even if the retailer offers schemes such as "Buy one, get one free", or "Buy these two together and get a 20% discount", these promotions are only applied at the time the user goes to the checkout (I.E. set membership is now fixed), not at the time the items are added to the cart. Based on our knowledge of the purchasing process, we can see that the order in which items are added to a shopping cart is immaterial. Therefore, this situation is not one in which consensus is required.

This is a particularly useful property in the event of a network partition. If communication is lost for some period of time between replicas, then the fact that the states of the shopping carts might diverge whilst the network partition exists does not create a problem.

As soon as the partition heals, the replicas can communicate with each other again, and their states will converge.

But what happens if we are allowed to remove items from the shopping cart?

Now things get more complicated.

Let's say that from your laptop, you add a book to your shopping cart, and from your phone, you add a pair of jeans.

Shopping Cart Item Deletion 1

Both replicas synchronise, so you see the same items in your shopping cart on both devices. Everything is fine...

From your laptop however, you've read some reviews of the book and decide that it doesn't look so interesting, so you remove it from your shopping cart — right at the very moment a network partition appears between the replicas.

Shopping Cart Item Deletion 2

The remove message gets through to replica 1, but not replica 2. So for the duration of the network partition, your shopping carts replicas will be out of sync with each other.

So after a while, the network partition heals and your shopping cart replicas synchronise with each other.

Now from your laptop, you look at the contents of your shopping cart and... huh!? That book has popped up again!

Shopping Cart Item Deletion 3

Maybe it's a sign that you really should read that book... or maybe it's the situation described in the Dynamo paper where, under certain circumstances, deleted items can reappear in shopping carts.

Why did this happen?

Because the contents of the shopping carts are treated as sets, and when a conflict occurs, the solution is to take the least upper bound. With Dynamo, this resolution happens on the client, but with CRDTs, it happens in the replica. Either way though, this approach takes the union of the sets and this sometimes causes a deleted item to pop up again.

So how do we avoid this problem?

We could go to all trouble of ensuring that the members of your set (I.E. the contents of your shopping cart) are always the members of a join-semilattice; however, this means that you have to throw the whole cart away as soon as any item is deleted.

But wait! There's a trick that allows us to handle this situation. Here, we will keep track of all additions to the shopping cart in one set and all the removals from the shopping cart in a different set known as the Tombstone Set. In this case, the least upper bound of two versions of a shopping cart can be calculated simply by taking the union of the sets.

R1
Additions
R1
Removals
R2
Additions
R2
Removals
📓 📓 👖
👖 📓

Even though replica R2 never found out about the removal of the book, this does not matter, because that fact has been recorded in replica R1. So now we can avoid having the deleted item reappear in the shopping cart by taking a two-step approach:

  1. Take the union of the addition sets
  2. From this, subtract the union of the removal sets

But we still haven't solved all our problems...

Let's say that even though all those bad reviews about the book caused us to delete it, we change our mind (I mean, can a book really be that bad? Let's find out). So you add the book again.

However, look at the addition set — it already contains the book, and our addition set can only hold single instances of an item. And the book is still in the tombstone set because we really did delete it. So, if we left the situation like it is, even though we added the book a second time, it would disappear from the resolved shopping cart because it's in the tombstone set...

So, under these conditions, once you remove an item, it’s never coming back!

Here is where you need to do some serious design work to decide on what behaviour you want your application to have, and then think of scenarios that could break that behaviour.

If you know that the addition of previously deleted items will be a frequently used aspect of your application's functionality, then you will need to implement some sort resolution strategy. For instance:

  • You could have of global coordination point in which all the replicas are notified that a previously deleted item is being added again and then try to ensure that everyone agrees.
  • Or you could take a simplistic approach and say that additions always win over removals.
  • Or you could keep a counter against each added or removed item so that adding the same book twice sets the addition counter to 2, and removing it sets the removal counter to -1. Now the desired total is simply the sum of additions total and the removals total.

The bottom line is this: this is hard to get right and has been an area of active research over the last 10 years or so. Quite a few interesting data structures have been proposed for resolving this problem, but it does not appear that any of them have been implemented in production systems yet.

As a developer however, it is difficult to reason about the data you are working with in this abstract manner. The question "Can I really treat my data as elements of a join-semilattice?" is difficult to answer and typically requires the help of specialist verification tools.

This is an area of research in which Lindsey Kuper is actively involved.


Previous Next
Lecture 21 Lecture 23

Endnotes

Footnotes

  1. The name "partially ordered set" is often abbreviated to "poset"

  2. The "least upper bound" is known as "lub" or "join"