In [2]:
import numpy as np
import scipy.stats as stats

# Exercise 8
#### Bootstraping
---


### 1.

The bootstrap is a resampling technique used to estimate statistics on a dataset. It involves repeatedly drawing samples from a dataset with replacement and calculating the statistic of interest across these samples. So to achieve what we want we can use bootstraping to generate e.g. 1000 bootstraped datasets of random variables and then for each calculate the mean. That should allow us to get the estimation of mean since bootstrap method assumes that "bootstraped" datasets reflect the statistical properties of the full original dataset.
Then we simply check if the condition is met for each bootstraped dataset and then calculate the probability.

In [11]:
def bootstrap(data):
    size = len(data)
    return np.random.choice(data, size, replace=True)

number_of_repetitions = 1000
data = np.array([56, 101, 78, 67, 93, 87, 64, 72, 80, 69])
a = -5
b = 5
counter = 0

for i in range(number_of_repetitions):
    bootstraped_data = bootstrap(data)
    mean = np.mean(bootstraped_data)
    condition_check = (a < sum(data/len(data))-mean) & (sum(data/len(data))-mean < b)
    if condition_check: counter += 1

p = counter/number_of_repetitions

print("Probability of the condition being fulfilled is:",p)

Probability of the condition being fulfilled is: 0.77


### 2.

We can estimate the variance with help of bootstraping in the same fashion as we have estimated the mean in the previous exercise.

In [32]:
data = np.array([5,4,9,6,21,17,11,20,7,10,21,15,13,16,8])
number_of_repetitions = 1000
sample_variance = []

for i in range(number_of_repetitions):
    bootstraped_data = bootstrap(data)
    sample_variance.append(np.sum(((bootstraped_data - np.mean(bootstraped_data))**2))/(len(bootstraped_data)-1))
    
print("The bootstrap estimate of Var(S^2) is:", np.sum(((sample_variance - np.mean(sample_variance))**2))/(len(sample_variance)-1))

The bootstrap estimate of Var(S^2) is: 59.90100593436975


### 3.

#### a) Mean and Median of the original dataset

In [35]:
N = 200
data = stats.pareto.rvs(1.05, scale=1, size=N)

print("mean:",np.mean(data))
print("median:",np.median(data))

mean: 5.719887959193138
median: 1.9921354846408907


#### b) Bootstrap estimation of Variance of sample mean

To get the bootstrap estimation of the variance of the sample mean, we need to make bootstraped datasets and then calculate means for them so we can next calculate the variance of those means.

In [52]:
number_of_repetitions = 100
bootstraped_means = []

for _ in range(number_of_repetitions):
    bootstraped_data = bootstrap(data)
    bootstraped_means.append(np.mean(bootstraped_data))

bootstraped_variance_of_mean = np.var(bootstraped_means, ddof=1)

print("Bootstraped variance of sample mean is:", bootstraped_variance_of_mean)

Bootstraped variance of sample mean is: 2.3323383273053158


#### c) Bootstrap estimation of Variance of sample median

Similarly we can calculate the bootstrap estimation of variance of sample median.  

In [54]:
number_of_repetitions = 100
bootstraped_medians = []

for _ in range(number_of_repetitions):
    bootstraped_data = bootstrap(data)
    bootstraped_medians.append(np.median(bootstraped_data))

bootstraped_variance_of_median = np.var(bootstraped_medians, ddof=1)

print("Bootstraped variance of sample median is:", bootstraped_variance_of_median)

Bootstraped variance of sample median is: 0.02291679046407597


#### d) Precision of estimated median and mean

In [55]:
print("True mean:",np.mean(data))
print("True median:",np.median(data))

print("Bootstraped variance of sample mean is:", bootstraped_variance_of_mean)
print("Bootstraped variance of sample median is:", bootstraped_variance_of_median)

True mean: 5.719887959193138
True median: 1.9921354846408907
Bootstraped variance of sample mean is: 2.3323383273053158
Bootstraped variance of sample median is: 0.02291679046407597


The lower variance of the median (0.0229) compared to the variance of the mean (2.3323) indicates that the median is a more precise. It is less sensitive to outliers and extreme values, which can significantly influence the mean and elevate the variance. The outcome will depend on the distribution used and its tendency to generate extreme values on one side.