doc/consistency.tex


\section{Consistent updates}
An update is sent to the master.
The master adds it to its log, and tries to get concensus about the update with the slaves.
Once concensus has been reached about the first log entry,
the master  adds the entry to the persistent local key-value store.
Slaves can move the updates from their log
into their local key-value store asynchronously.

\begin{figure}
\begin{verbatim}
(nicked from wikipedia)

C      M            S0  S1
|      |            |    |  --- first request ---
X----->|            |    |  Request
|      X----------->|--->|  Prepare(N)
|      |<-----------X----X  Promise(N,  I, {Va,Vb,Vc})
|      X----------->|--->|  Accept!(N,  I, Vn)
|      |<-----------X----X  Accepted(N, I)
|<-----X                    Response
                            --- other requests ---
X----->|            |    |  Request
|      X----------->|--->|  Accept(N,   I+1, W)
|      |<-----------X----X  Accepted(N, I+1)
|<-----X            |    |  Response

X----->|            |    |
|      X----------->|--->|  Accept(N, I+2,  X)
|      |<-----------X----X  Accepted(N,I+2)
...
\end{verbatim}
\end{figure}
The first request with M as Master (= leader) needs a full paxos round,
while subsequent updates with the same leader skip the first phase.
This boils down to a single roundtrip from master to slaves per update.
If the different nodes have failure modes independent of each other (independent power supplies, different disks, \ldots),
one needs not await the push-through to disk and the message can be pushed asynchronously to the local key-value store.
This optimistic behaviour needs to be a configuration option, since the application cannot assess this by itself.
One can also go below 1 roundtrip per update by stuffing multiple updates together.
This increases throughput.


\subsection{Individual Slave failure}
If a slave dies, the master is not affected.
When a slave comes up, there are three possibilities. 
The first is not very interesting. If the slave's log matches that of the master, nothing happened meanwhile and the slave is \emph{in sync}. 
The other cases are \emph{small lapse} and \emph{big lapse}.
\paragraph{small lapse}
Its replication counter I is still within the log of the master
(or other any other slave that has a more recent state) . 
So the slave first downloads the missing part of the log.
Then it iterates over the tlog while adding the missing updates to the store.
When finished the client again compares its log state with that of the other nodes. 
It's either in back in sync or again within in a small lapse.

\paragraph{big lapse}
Small lapse would be enough if one keeps the log files. 
The only problem with this is that it wastes more than half of the available diskspace. 
The good news is that these log files compress quite well, 
so the only thing we want to be able to do is compress log files when rotate, 
and make sure we can still read the compressed logs. 
We also plan to implement the \emph{collapsing} of logs: 
Several updates for the same key can be replaced with the last update under certain conditions.
Applications that only manage a limited set of keys, but update the values frequently will benefit from this a lot.
\section{Electing a master}
Master election should happen using paxos. 
A master choice has a timeout, and a master tries to relect itself before the lease expires.
Details are described in the PaxosLease paper.