# erasure-coding-durability

## Overview

This is a simple statistical model for calculating the probability of losing
data that is stored using an erasure coding system, such as Reed-Solomon.

In erasure coding, each file stored is divided into D shards of the same length.
Then, P parity shards are computed using erasure coding, resulting in D+P shards.
Even if shards are lost, the original file can be recomputed from any D of the
D+P shards stored.  In other words, you can lose any P of the shards and still 
reconstruct the original file.

What we would like to compute is the durability of data stored with erasure coding
based on the durability of the individual shards.
The durability is the probability of not losing the data over a period of time.
The period of time we use here is one year, resulting in annual durability.
  
Systems that use erasure coding to store data will replace shards that are lost.
Once a shard is replaced, the data is fully intact again.  Data is lost only when
P+1 shards are all lost at the same time, before they are replaced.

## Assumptions

To calculate the probability of loss, we need to make some assumptions:

1. Data is stored using $D$ data shards and $P$ parity shards, and is lost when $P+1$ shards are lost.
1. The annual failure rate of each shard is $A$.
1. The number of days it takes to replace a failed shard is $R$.
1. The failures of individual shards are independent.

## Calculation

Let look at one period of $R$ days.  What are the chances of losing
data in that period?  For that to happen, P+1 shards (or more) would have to fail 
in that period.

Over one year, the chances that a shard will fail is evenly distributed over
all of the $R$-day periods in the year.  We will use $F$ to denote the failure
rate of one shard in an $R$-day period:

> $F = A\frac{R}{365}$

To precisely compute the probability of failure during one period, you need to
translate the failure rate into a failure probability.  Usually, we use a [Poisson
distribution](https://en.wikipedia.org/wiki/Poisson_distribution) for failures to
compute the probability.  Given a failure rate $F$, the probability of one or more
failures is $1 - e^F$.  For small $F$, this is very close to $F$, which is what we'll
use here for simplicity.  So we will be a little sloppy and use $F$ both for the 
failure rate, and for the probability of failure.

Given the probability of one shard failing, the probability of $n$ specific shards
failing is obtained by multiplying their probablities together, because we have 
assumed that their failures are independent:

> $F^n$
    
That was the probability for *n* specific shards.  What we care about is the
probability of losing any *n* shards.  For that we multiply the probability above
times the number of ways to choose *n* shards from the full set of D+P shards:
 
> $\binom{n}{D + P} F^n$
    
We also lose data if more than n shards fail in the period.  To include those,
we can sum the above formula for n through D+P shards:

> $\sum_{i=n}^{D+P} F^i$
    
The durability in each period is inverse of that.  Durability over the full year 
happens when there's durability in all of the periods, which is the product of
probabilities:

> $\prod_{1}^{365/R} (1 - \sum_{i=n}^{D+P} F^i)$

## Python code

The python code in `durability.py` does the calculations above, with a few tweaks
to maintain precision when dealing with tiny numbers, and prints out the results
for a given set of assumptions:

```
$ python durability.py
usage: durability.py [-h]
                     data_shards parity_shards annual_shard_failure_rate
                     shard_replacement_days
durability.py: error: too few arguments
$ python durability.py 4 2 0.10 1

#
# total shards: 6
# replacement period (days):  1.0
# annual shard failure rate: 0.10
#

|==================================================================================================================================|
| failure_threshold | individual_prob | cumulative_prob | annual_loss_rate |         annual_odds |        durability |       nines | 
|----------------------------------------------------------------------------------------------------------------------------------|
|                 6 |       4.229e-22 |       4.229e-22 |        1.544e-19 | 154 in a sextillion | 1.000000000000000 |    18 nines | 
|                 5 |       9.259e-18 |       9.260e-18 |        3.380e-15 |  3 in a quadrillion | 0.999999999999997 |    14 nines | 
|                 4 |       8.447e-14 |       8.448e-14 |        3.083e-11 |    31 in a trillion | 0.999999999969167 |    10 nines | 
|                 3 |       4.110e-10 |       4.110e-10 |        1.500e-07 |    150 in a billion | 0.999999849970558 | --> 6 nines | 
|                 2 |       1.125e-06 |       1.125e-06 |        4.106e-04 |    411 in a million | 0.999589425325156 |     3 nines | 
|                 1 |       1.642e-03 |       1.643e-03 |        4.512e-01 |            5 in ten | 0.548766521902091 |     0 nines | 
|                 0 |       9.984e-01 |       1.000e+00 |        1.000e+00 |              always | 0.000000000000000 |     0 nines | 
|==================================================================================================================================|
```