# Reservoir sampling

Is a sampling method use to collect iid samples from am unbounded stream of data.

The concept is that you have a stream of valus comming in starting at index 0, and growing to index n, where n is increasing forever.  You have a sample from this stream of size k, where each value has an equal probability of being in the sample (k/n).

The algorithm:
    For the first k values, collect them all and put them in your sample array.
    For the n = k+1, or the k+1th value in the stream, 
        keep the new item with probability k/n,
        if we choose to keep the new item, we replace one of the current samples with the new sample, with a uniform probability (1/k),
        This means that the probability of each item remaining in the sample is (k/n) * (1/k), or (1/n), which is what we want.

In [14]:
from random import randint

class ReservoirSampler:
    
    def __init__(self, sample_size):
        self.samples = []
        self.sample_size = sample_size
        self.seen_values = 0
    
    def is_full(self):
        return len(self.samples) == self.sample_size
    
    def update_sample(self, new_value):
        self.seen_values += 1
        if not self.is_full():
            # collect the first k
            self.samples.append(new_value)
        else:
            # select value, j, between (0,n-1)
            # it has a k/n probability of being in range (0,k-1)
            # if it is, replace samples[j] with the new value
            j = randint(0,self.seen_values)
            if j < self.sample_size:
                self.samples[j] = new_value
        

r_sampler = ReservoirSampler(25)
for i in range(1000):
    r_sampler.update_sample(i)

print(r_sampler.samples)

[953, 955, 2, 414, 210, 916, 690, 750, 902, 64, 947, 11, 44, 545, 688, 38, 245, 45, 776, 211, 326, 599, 905, 362, 730]
