# Probability and Statistics

Before we dive into the durability calculation, let's set up some well-known
starting points.

First, the [Binomial Coefficient](https://en.wikipedia.org/wiki/Binomial_coefficient) tells you
the number of ways to choose $n$ things from a set of $n$ things.  You say "n choose k" and write: $\binom{n}{k}$

Related to this is the [Probability Mass Function](https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function) for a binomial.
This is used to calculate the probability of exactly $k$ things happening out of $n$ trials, if
the probability of each thing happening is $p$, and they are independent.
The probability is given by:

$$\binom{n}{k} \: p^k \: (1-p)^{n-k}$$


# Failure Rate vs. Probability of Failure

The distinction between failure rates and the probability of failure causes
lots of confusion.  I have to think about it carefully each time I come back
to the topic.

The *failure rate* over a period of time, is the average number of failures
in that time.  It's usually expressed as a percentage.  An annual failure 
rate of 25% means that the average number of failures in a year is $0.25$.

Failure rates for unreliable things can be over 100%.  This is counter-intuitive.

An annual failure rate of 1200% would mean that on average you would see
12 failures per year.  An annual failure rate of 1200% is the same thing
as a monthly failure rate of 100%, and is the same thing as a daily
failure rate of 3.33% (assuming 30-day months).

The probability of failure over a period of time is modeled with the
[Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution).
I won't go over all the details here (see Wikipedia for that), but the 
probability of having one or more failures over a time interval, with 
a failure rate of $\lambda$ is:

$$1 \: - \: e^{-\lambda}$$


# Data Durability

First, some naming.  We will use these names in the calculations:

* $S$ is the total number of shards (data plus parity)
* $R$ is the repair time for a shard in days: how long it takes to replace a shard after it fails
* $A$ is the annual failure rate of one shard
* $F$ is the failure rate of a shard in $R$ days
* $P$ is the probability of a shard failing at least once in $R$ days

One of the assumptions we make is that it takes $R$ days to repair a failed
shard.  Let's start with a simpler problem and look at the data durability
over a period of $R$ days.  For a data loss to happen in this time period,
$P+1$ shards (or more) would have to fail.

We will use $A$ to denote the annual failure rate of individual shards.
Over one year, the chances that a shard will fail is evenly distributed over
all of the $R$-day periods in the year.  We will use $F$ to denote the failure
rate of one shard in an $R$-day period:

$$F = A\frac{R}{365}$$

The probability of failure of a single shard in R days is approximately $F$, when $F$ is small.
The exact value, from the Poisson distribution is:

$$P = 1 \: - \: e^{-F}$$

Given the probability of one shard failing, we can use the binomial distribution's 
probability mass function to calculate the probability of exactly $n$ of the $D+P$
shards failing:

$$\binom{S}{n} \: P^n \: (1-P)^{S-n}$$
        
We also lose data if more than n shards fail in the period.  To include those,
we can sum the above formula for n through S shards, to get the probability of
data loss in $R$ days:

$$\sum_{k=n}^{S} \binom{S}{n} \: P^n \: (1-P)^{S-n}$$
    
The durability in each period is inverse of that:

$$1 \: - \: \sum_{k=n}^{S} \binom{S}{n} \: P^n \: (1-P)^{S-n}$$

Durability over the full year 
happens when there's durability in all of the periods, which is the product of
probabilities:

$$\Big( 1 - \sum_{k=n}^{S} \: \binom{S}{n} \: P^n \: (1-P)^{S-n} \Big) ^ {365/R}$$

And that's the answer!
