# Resampling 


Once we have a data sample, it can be used to estimate the population parameter. The
problem is that we only have a single estimate of the population parameter, with little idea
of the variability or uncertainty in the estimate. One way to address this is by estimating the
population parameter multiple times from our data sample. This is called resampling.


A key difference is that resampling must be repeated multiple times. The problem with this is
that there will be some relationship between the samples as observations that will be shared
across multiple subsamples. This means that the subsamples and the estimated population
parameters are not strictly identical and independently distributed. This has implications for
statistical tests performed on the sample of estimated population parameters downstream

* **Bootstrap**. Samples are drawn from the dataset with replacement (allowing the same
sample to appear more than once in the sample), where those instances not drawn into
the data sample may be used for the test set

* **k-fold Cross-Validation**

## Bootstrap Method

The bootstrap method can be used to estimate a quantity of a population. This is done
by repeatedly taking small samples, calculating the statistic, and taking the average of the
calculated statistics.

```
# ----- start bootstrap ----- 

1. Choose a number of bootstrap samples to perform
2. Choose a sample size
3. For each bootstrap sample
    3.1 Draw a sample with replacement with the chosen size
    
# ----- end bootstrap  ----- 

    3.2 Calculate the statistic on the sample
4. Calculate the mean of the calculated sample statistics.


```

The samples not selected are usually referred to as the “out-of-bag” samples. For
a given iteration of bootstrap resampling, a model is built on the selected samples
and is used to predict the out-of-bag samples.



In [52]:
import numpy as np
np.random.seed(1)

data = np.arange(1, 101) / 10
repeat = 100
subsample_size = 100

means = []
for _ in range(repeat):
    
    ind_subsample = np.random.randint(0, 100, subsample_size)
    ind_oob = set(np.arange(100)) - set(ind_subsample)

    sub_sample = data[ind_subsample]
    oob = data[list(ind_oob)] # oob for cheape predictions
    
    means.append(np.mean(sub_sample))
    
print(f'Theoretical mean: {(data[-1] + 1) /  2}, Empirical mean: {np.mean(means)}')

Theoretical mean: 5.5, Empirical mean: 5.053619999999999


In [53]:
from sklearn.utils import resample

# data sample
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
print('Data: ', data)

# prepare bootstrap sample
boot = resample(data, replace=True, n_samples=4, random_state=1)
print('Bootstrap Sample: %s' % boot)


# out of bag observations
oob = [x for x in data if x not in boot]
print('OOB Sample: %s' % oob)

Data:  [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]
OOB Sample: [0.2, 0.3]


## Reources used:

Statistical Methods for Machine Learning. Discover How to Transform Data into Knowledge with Python (Brownlee) 1,4 ed (2019)