# Jackknife

See,
* http://people.bu.edu/aimcinto/jackknife.pdf

## Imports

In [3]:
%run /home/christopher/.jupyter/config.ipy
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Simplest Example

Let's say we have some data, $x = {x_1, ..., x_n}$, and we want to know both the mean and the uncertainty (we'll use variance here) on the mean. Now we can do this very easily as we know that the uncertainty on the mean is $\frac{\sigma_x^2}{n}$ (if we were using SD this would be the normal $\frac{\sigma_x}{\sqrt{n}}$).

In [89]:
n = 10_000
x = np.random.randn(n)
var_mean = np.var(x, ddof=1) / n
print(var_mean)
assert np.isclose(var_mean, 1e-4, rtol=1e-1)
assert np.isclose(np.sqrt(var_mean), 0.01, rtol=1e-1)

9.940065936804702e-05


However, let's imagine that we don't know how to do this. This is somewhat important as there are some cases where we can't. Consider for example trying to get the uncertainty on a two point correlation function. How do we get that analytically?

Let's try some resampling. Break $x$ into $n$ subsamples, where in the $i$'th subsample we leave out $x_i$. From this, we can compute:

$$
\overline{x_i} = \frac{1}{n-1} \sum_{j=1, j \ne i}^{n} x_j
$$

we now have a vector of means. Obviously the overall mean is the mean of these means, $\overline{x}$.

From these, we can compute the variance,
$$
\sigma_\overline{x}^2 = \frac{n-1}{n} \sum_{i=1}^{n} (\overline{x_i} - \overline{x})^2
$$

and standard deviations

$$
\sigma_\overline{x} = \bigg( \frac{n-1}{n} \sum_{i=1}^{n} (\overline{x_i} - \overline{x})^2 \bigg)^\frac{1}{2}
$$

This looks a lot like the normal calculation of the spread of the samples, $\sigma_x = \bigg( \frac{1}{n} \sum_{i=1}^{n} (x_i - \overline{x})^2 \bigg)^\frac{1}{2}$, but where does the $n - 1$ in the numerator come from?

I will skip the proof for now but show that it works below.

In [97]:
leave_out_size = 1
means, i = [], 0
while leave_out_size * (i+1) < n:
    s, e = leave_out_size * i, leave_out_size * (i+1)
    means.append(np.mean(np.append(x[:s], x[e:])))
    i += 1
var_mean = np.var(means) * (n-1)
print(var_mean)
assert np.isclose(var_mean, 1e-4, rtol=1e-1)

9.939806095271411e-05


What happens now if we don't leave out a single observation each time, but rather a group. This might be for performance reasons - we don't have time to compute the statistic for all $n$ subsamples.

In [128]:
leave_out_size = 100
means, i = [], 0
while leave_out_size * (i+1) < n:
    s, e = leave_out_size * i, leave_out_size * (i+1)
    means.append(np.mean(np.append(x[:s], x[e:])))
    i += 1

n_groups = n / leave_out_size
var_mean = np.var(means, ddof=1) * (n_groups - 1)
print(var_mean)
assert np.isclose(var_mean, 1e-4, rtol=1e-1)

8.942397398573991e-05


AssertionError: 

What if these groups aren't of equal size? It is actually pretty much the same.

In [127]:
group_size = [0] + [10 for i in range(200)] + [100 for i in range(50)] + [50 for i in range(60)]
groups = np.cumsum(group_size)
assert groups[-1] == n

means = []
for i in range(len(groups) - 1):
    s, e = groups[i], groups[i+1]
    means.append(np.mean(np.append(x[:s], x[e:])))

n_groups = len(group_size) - 1
var_mean = np.var(means, ddof=1) * (n_groups - 1)
print(var_mean)
assert np.isclose(var_mean, 1e-4, rtol=1e-1)

9.024547152240341e-05
