# Exercise 8 - Bootstrap

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import math
import random

## Task 1 - exercise 13 from chapter 8

In the bootstrap method we are given a sequence of $n$ datapoints and we then sample from this sequence using sampling with replacement to generate a bootstrap set $B_i$. In practice we might generate a large number of bootstrap set which we use to infer important statistical properties of our original data sequence.

In the exercise we have $n$ iid random variables with unknowns mean $\mu$ and two constants $a<b$. We aim to estimate $$p=P\left\{a<\sum_{i=1}^n X_i / n-\mu<b\right\}$$

We do that by generating $K$ bootstrap sets where each set has $n$ values. For each bootstrap set $B_i, i \in \{1,..,K\}$ we then compute the mean, $B_{means} := \mathbb{E}[B_i]$. Then we define a new variable
$$
Z := B_{means} - \mathbb{E}[{B_{means}}]
$$
where $\mathbb{E}[{B_{means}}]$ is out estimate of $\mu$. 

Hence, $Z = \sum_{i=1}^n X_i / n-\mu$

So, finally we evaluate $p_m = p(a < Z< b) = \frac{\sum(\mathbb{I}({a<z_i<b}))}{K}, \forall i \in {1,..,K}$. To ensure a robust estimate, we repeat this procedure $M$ times, such that our final estimate becomes

$$
p = \mathbb{E}[p_m]
$$


In [None]:
X = np.array([56,101,78,67,93,87,64,72,80,69])
a,b=-5,5

n = len(X) # size of each bootstrap set
K = 100 #number of bootstrap sets to produce
M = 100 # repitions of bootstrap procedure

bootstrap_samples = np.random.choice(X,replace=True,size=n*K*M).reshape(M,K,n)
B_i_means = bootstrap_samples.mean(-1)
Z = B_i_means - B_i_means.mean(-1)
prop_i = ((a<Z)&(Z<b)).mean(-1)
p = prop_i.mean()
print(f"Estimate of p using n={n}, K={K}, M={M} is: {p}")


A confidence interval of this estimate is then

In [None]:
p+np.array([-1,1])*1.96*np.sqrt(np.var(prop_i)/M)

## Task 2 - Exercise 15 from chapter 8

We are now given some new data, where $n=15$ and we want to estimate $Var(S^2)$ where $S^2$ is the sample variance.
We follow the same procedure as above, however for each $B_i$ we instead estimate the variance, $B_{var}:=Var[B_i]$ and then the variance of $B_{var}$, $Var[B_{var}]$, before finally returning the estimate $\mathbb{E}[Var[B_{var}]]$


In [None]:
X = np.array([5, 4, 9, 6, 21, 17, 11, 20, 7, 10, 21, 15, 13, 16, 8])

n = len(X) # size of each bootstrap set
K = 100 #number of bootstrap sets to produce
M = 100 # repitions of bootstrap procedure

bootstrap_samples = np.random.choice(X,replace=True,size=n*K*M).reshape(M,K,n)
B_i_var = bootstrap_samples.var(-1,ddof=1)
var_i = np.var(B_i_var,-1,ddof=1)
var_s2 = np.mean(var_i)

print(f"Var(S^2) = {var_s2}")


print(f"CI: {var_s2+np.array([-1,1])*1.96*np.sqrt(np.var(var_i)/M)}")



## Task 3- 
We now sample 200 samples from a pareto distribution with $\beta=1$ $k=1.05$ and generate bootsrap etimates of the variance of the median and mean.

In [None]:
def bootstrap_sim(X,n,M,K,type_:str='median'):
    bootstrap_samples = np.random.choice(X,replace=True,size=n*K*M).reshape(M,K,n)
    if type_ == 'median':
        b_i_agg = np.median(bootstrap_samples,-1)
        b_agg = np.var(b_i_agg,-1,ddof=1)
        estimate = np.mean(b_agg)
    elif type_ == 'mean':
        b_i_agg = np.mean(bootstrap_samples,-1)
        b_agg = np.var(b_i_agg,-1,ddof=1)
        estimate = np.mean(b_agg)
    else:
        raise NotImplementedError(f"type = {type_} is not implemented")

    print(f"Estimate of {type_} is: {estimate}")
    print(f"CI: {estimate + np.array([-1,1])*1.96*np.sqrt(np.var(b_agg)/M)}")
    #return estimate




In [None]:
N = 200
pareto_samples = stats.pareto.rvs(1.05,scale=1,size=N)

print(f"Sample mean of pareto samples is: {np.mean(pareto_samples)}")
print(f"Sample median of pareto samples is: {np.median(pareto_samples)}\n")

bootstrap_sim(X=pareto_samples,n=N,M=100,K=100,type_='median')
print("")
bootstrap_sim(X=pareto_samples,n=N,M=100,K=100,type_='mean')

We observe that the estimate of the variance of the median is more stable compared to the variance of the mean of the bootstrap estimates based on the confidence interval is more tight for the median. We know that for small $k$ close to 1 the Pareto distribution can take values that are very large outliers and hence disturb the mean and force it towards a higher value. Whereas the median is more robust towards such outliers, why we observe the above phenomonen, 

We observe that the estimate of the variance of the median is more stable compared to the variance of the mean of the bootstrap estimates based on the confidence interval is more tight for the median. We know that for small $k$ close to 1 the Pareto distribution can take values that are very large outliers and hence disturb the mean and force it towards a higher value. Whereas the median is more robust towards such outliers, why we observe the above phenomonen, 