"Paxos better than pussy" - Umberto Sani
Legend:
- 💩 = useless
- 😍 = useful
- images
- papers
System model
- set of
$\sum = { p_1, ..., p_n}$ processes - communicate by message passing (
send(m)
,receive(m)
) - crash failure model
-
f
faulty processes
Consensus is used to allow a set of processes to agree on a value proposes. It ensures
- Uniform integrity : if a
p
decided on v, v was proposed by somep
- Uniform agreement : no two
ps
decide different values - Termination : every correct
p
eventually decides on exactly one value
Operations are coordinated by one, or more, centralized clock signals.
- message speed and delay are bounded
- process keeps vector of values received
- after f+1 rounds -> decide
No global clock, no strong assumptions about time and order of operations. (real world scenario)
- messages can take any time to arrive
- FLP impossibility. Aka, we impossible to solve consensus
- process keeps current status and sends msg to other processes
How to fix this shit. We can strengthening the model assumptions or weakening the problem definition or doing booth
Uses failure detectors with different accuracies (strong, weak, eventually strong, eventually weak) 💩
- Strong eventually each process crashed is always suspected by every correct process
- Weak eventually each process crashed is suspected by some correct process
Paxos is love, paxos is life - Umberto Sani I'm the one in the room with the biggest c-rnd - Francesco Saverio Zuppichini
You know everything about it, pls ⭐ this
- uses
proposers
,acceptors
,learners
andleader
- to decide a
value
there must be a quorum ofacceptors
- leader election to ensure that there is always a leader
n = 2f+1
- latency =
-
$2\delta$ for leader -
$3\delta$ for proposers
-
To speed up
- ballot reservation (decide in advance which process will be the leader)
- Leader can execute
PHASE_2B
directly to the learners - Leader among proposers and leader among learners
broadcast : one
- Validity : If a correct
p
broadcastsm
then all correctps
eventually deliverm
- Agreement : If a correct
p
deliversm
then all correctps
eventually deliverm
- Integrity For any
m
, every correctp
deliversm
at most 1 only ifm
was broadcast
- Uniform Validity : If a correct
p
broadcastsm
then all correctps
eventually deliverm
- Uniform Agreement : If
p
deliversm
then all correctps
eventually deliverm
- Uniform Integrity For any
m
, everyp
deliversm
at most 1 only ifm
was broadcast
Deliver is done in the same order of the send
- FIFO order : if a correct
p
broadcastm
beforem'
then no correctp
deliversm'
beforem
- Uniform FIFO order : if a
p
broadcastm
beforem'
then nop
deliversm'
beforem
Same order of causally related deliver at all receivers
- Causal Order : if the broadcast of a
m
casually precedes the broadcast ofm'
, then no correctp
deliversm'
unless it hasdeliver(m)
- Uniform Causal Order : if the broadcast of a
m
casually precedes the broadcast ofm1
, then nop
deliversm'
unless it hasdeliver(m)
CHECK implementation
CHECK implementation
Order is indipendent from the send order
- Uniform total order : if
ps
p
andq
both deliverm
andm'
, thenp
deliversm
beforem'
iffq
deliversm
beforem'
(they must do the same stuff)
Pedone docet
Order by conflict relation ~
- Generic broadcast order : if correct
ps
p
andq
both deliverm
andm'
and m ~ m', thenp
deliversm
beforem'
iffq
deliversm
beforem'
(they must do the same stuff)
Cheaper and faster delivery (like just eat) CODICESCONTO: PAXOS50
CHECK IMPLEMENTATION
multicast : one
- Define $\Gamma = {g_1, ..., g_k } $ as the set of process groups
- They are disjoint
-
m.dst
= set of groupsm
is multicast to
Properties
- Validity : if
p
is correct and multicastsm
, then eventually all correctps
q
inm.dst
deliverm
- Uniform Agreement : if
p
deliversm
then all correctps
inm.dst
eventually deliverm
- Uniform Integrity : for any
m
, everyp
deliversm
at most 1 only ifp
was inm.dst
andm
was multicast - Uniform order : if
p
deliversm
andq
deliversm'
, eitherp
deliversm
beforem'
orq
deliversm'
beforem
Atomic multicast can be reduced to atomic broadcast 💩
TO ADD?
Every process has to commit in order to decide on action: ABORT / COMMIT
Properties
- Agreement : No two
ps
decides differently - Termination : Every corrent
p
eventually decides - Abort-Validity :
ABORT
is the only possibile decision if somep
votesABORT
- Commit-Validity:
COMMIT
is the only decision if every correntps
votesCOMMIT
Basically, one for all and all for one
Atomic Commitment | Consensus | |
---|---|---|
COMMIT decision |
all ps proposed COMMIT |
some ps proposed COMMIT |
all ps proposed COMMIT |
decide COMMIT or ABORT |
decide COMMIT |
Uses one Transaction Manager (TM
) and any number of Resource Manager (RM
). Each process can be in state PREPARED
or COMMITED
- A
RM
enters inPREPARED
and sendPREPARED
to theTM
- upon
receive(PREPARED)
TM
sendsPREPARE
to allRM
s - upon
receive(PREPARE)
RM
entersPREPARED
and sendsPREPARED
to theTM
- upon
receive(PREPARED)
from allRM
s,TM
sendsCOMMIT
- upon
receive(COMMIT)
RM
entersCOMMITED
easy peasy
if TM
☠️ then the algorithm is blocked
- separete instance of Paxos for each
RM
😍 -
$2f+1$ acceptors -
TM
is the leader
It ensures :
- Stability : every instance of paxos decides
PREPARED
orABORTED
- Non-Blocking : if the leader dies, then a new one is elected (paxos' liveness)
- A
RM
sendsBEGIN_COMMIT
to the leader and2A_PREPARED
to theacceptors
- The leader sends
PREPARE
to allRM
s - upon
receive(PREPARE)
RM
sends2A_PREPARED
to theacceptors
- upon
receive(2B_PREPARED)
from a quorum ofacceptors
, the leader sendCOMMIT
- upon
receive(COMMIT)
RM
entersCOMMITED
2PC | Paxos Commit | |
---|---|---|
Latency/Delay | 4 |
4$\delta$ |
Messages | ||
Disk writes |
from the book [chp1,3,11]
consistency model is a property of a system designs, usually presented as a condition that can be true
or false
for a single execution.
execution = one pattern of events
- Atomicity: all transactions are executed or none of them is
- Consistency: a transaction transforms a a state correctly
- Isolation: serializability
- Durability: changes committed survive to future failures
A concurrent execution is serializable if it is equivalent to a serial execution of the same transactions
Formalization of the sematics of the operations (What operations the client will do)
It is a six turple
History
Execution is the same as with a single site unreplicated system (the replication system gives the same functionality as the sequential data type)
DEF at pag 6 [chp1]
- Replication is hidden
- execution is linearizable
- easier for the developer
[chp11.2.1]
Has five phases
- Client Request
- Server coordination : before executing the operation, the servers may have to do some stuff
- Execution
- Agreement coordination: Servers may need to undergo a coordination phase to ensure that each executed operations has the same effect on all the servers
- Client Reponse
ps
follows specs until it crashes- a crashed
p
is eventually detected by every correctps
- no process is suspected of being crashed if it is not really crashed
Fail stop failure model:
ps
follow specs until they crash- crash of a
p
is detected by every correctp
- no
p
is suspected of being crashed if it is not really crashed
Crash Failure model:
ps
follow specs until they crash- crash of a
p
is detected by every correctp
p
may be suspected erroneously
Basically, same as Fail stop just every ps
may be suspected.
- let
s
be a server that executes transactiont
.t'
precededst
, denoted$t' \rightarrow t$ , ift'
committed ats
beforet
started executing ats
-
t'
conflicts witht
ift'
andt
access the same data item and at least one of them modifies the data item
Crash Failure model
(each number is a phase of the generic functional model)
- client sends operations
- client operations are ordered by an ordering protocol (Atomic broadcast)
- each replica executes the operation
- None
- replies to the client
Keeping the replicas consistent requires the execution to be deterministic (given a client operation, same state updated is produced by each replica)
Fail stop failure model.
A p
that has not crashed and has the lowest identifier is designated primary.
A primary always exist thank to Fail stop model (failure are deteched and primary is replaced)
- client sends operations
- None
- the primary execute the operation and sends state updated to all the replicas
- replicas, passively, apply the state updates in the order received
- replies to the client
These properties ensure linearizability
Multi is better than one - Umberto Sani
active and passive replicaiton are good high availability but not for high performance.
- fault-tolerace
- high performance
- suitable for databases
Similar to passive replication
- each operation executed by one machine (or a set of primary machines)
Transaction states = EXECUTING
, COMMITTING
, COMMITTED
, ABORTED
-
- When receive the update, each replica checks deterministically if the update can be accepted
$\rightarrow$ avoid mutually inconsistency - upon transaction termination
- if is read-only, commit with no interaction between replica
- if update, the transaction must be certified before be commit or abort
- When receive the update, each replica checks deterministically if the update can be accepted
Termination must guarantess transaction atomicity (either all the servers commit it or none do it) and isolation (one-copy serializability)
[chp11.3]
New state = PRECOMMIT
- transaction
t
is commited if all servers precommit - a server precommits
t
if each transaction it knows inCOMMITTED
orPRECOMMITED
either- precedes
t
- or does not conflict with
t
. Noread
/write
intersection
- precedes
- if one server down
$\rightarrow$ protocol is blocked (all servers need to precommit) - high abort ratio
Since atomic broadcast guarantees Agreement and Total order all servers reach the same outcome, COMMIT
or ABORT
. All replicas deliver in the same order, thus the certification test is deterministic
- transaction
t
is commited if no transactiont'
that precedest
does update any data item read byt
No need to check write-write conflicts
By reordering the transaction we can lower the abort ratio.
- uses a
ReorderList
contrains committed transactions not seen by transactions in execution since their order can change - when we reach the
Reorder Factor
(max len of the array) one transaction is removed and its updated are applied to the db
Uses Generic broadcast to taking care of the conflicts between operation.
- increase performance since ordering happens only when it is needed!
- conclict is defined for write-write, write-read and read-write conflicts
Gap between existing software and research
- Open source Java library implementing robuts state-machine replication
- support reconfigurations of the replica set
- provide efficient and transparent support for durable services
No Byzantine Fault-tolerant Atomic Multicast exists
- first BFT Atomic Multicast
- designed on top of existing BFT abstraction (BFT-Smart)
- scale with the number of group
- partially genuine
- uses two groups, all implements FIFO atomic broadcast
auxilary
, help order the msgtarget
, the ones that can be inm.dst
- uses a tree of processes to re-route/order efficiently the msg to their destination (lowest common anchestor)
Big performance degradation when there are conflicting request for geographically replicated sites
- solve generic consensun to increase performance
- implements Multi-Leader Generic Consensus
- uses a unique time-stamp associated with every command
c
to decide if a slow decision is needed
In Atomic-Multicast distributed message ordering is challenging since each message can be multicast to all destinations
- Genuine Atomic Multicast that uses only four communication delays (
$4 \delta$ ) - decompose the ordering in two execution paths,
FAST
andSLOW
-
FAST
speculates about the order$\rightarrow$ if okay save time -
SLOW
path similar to BaseCast
-
Coordinating geographically distributed replicas
- decouples order from execution in a state machine replication
-
partial order on the execution of operations (instead of total order)
$\rightarrow$ save time - exploit geographic location
Coordination between cross-data center is done twice, for concurrency model and consensus
(In a concurrent system different threads communicate with each other)
- concurrency control and consensus can be mapped to the same abstraction
- unified protocol do to booth at once:
- strict serializability for transaction consistency
- linearizability for replication consistency
(Strict serializability guarantees that operations take place atomically
Multi-core servers are not well exploited in fault-tolerant state machine-replication due to the deterministic execution of request that translates into a single-threaded replice leading to bad performance
- proposes early scheduling of operations. Decision are mode before the requests are ordered to schedule operations on worker threads at replicas
- outperform late scheduling
Build a scalable, globally-distributed DB
- first system to distribute data at globally scale and support externally-consistent distributed transactions
- data is stored in a schematized semi-relational tables
- replica configuration can be controlled at fine grain
- provide external consistency reads-write
- globally consistent reads
- assign globally-meaningful commit timestamps for transactions
Transactional Causal Consistency (TCC) is not implemented well
- present the first TCC that implements non blocking reads, archieving low latency and allows application to scale out by sharding.