Skip to content

Optimal implementation of reservoir sampling algorithm in Julia.

Notifications You must be signed in to change notification settings

AlexZasorin/fastshuf.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

fastshuf.jl

Optimal implementation of reservoir sampling algorithm in Julia.

Usage

This application functions similarly to the shuf command-line utility. It does not take any command-line flags, only a single argument to specify reservoir size.

It will accept a stream of elements and randomly sample these elements into a reservoir of the specified size.

Example:

    alex@PC:/home/alex/julia-projects$ seq 100 | julia fastshuf.jl 10
    53
    82
    6
    22
    55
    39
    77
    95
    48
    72

Requirements

Earlier versions may function but have not been tested.

Reservoir Sampling Proof

As part of my Big Data course, I wrote an extra credit proof that proves that the standard implementation of the reservoir sampling algorithm results in a random sample.

Let $r$ be our reservoir size and $n$ be the number of input elements we have seen so far. Whenever $n \le r$, we set $r = n$, as per the reservoir sampling algorithm. As our base case, let's use $n = r$. Each item chosen to be in the reservoir will have had a $$\frac{r}{n} = 1$$ chance to be chosen. By definition, our sample is a simple random sample.

For our inductive hypothesis, we will say that the algorithm produces a simple random sample for some $k = r$. We will try to show that this is true for $k+1$.

Consider the step where we read in the $(k+1)$-th element. According to the algorithm, we will pick a number $s$ between 1 and our current iteration which is $k+1$. If $s$ is less than or equal to $r$, we will replace the element at the index $s$ with the $(k+1)$-th element of the stream. Therefore, the $(k+1)$-th element has a $$\frac{r}{k+1}$$ probability of being chosen to be in the reservoir. Given that the $(k+1)$-th element has been chosen to replace an element, the probability that any element already in the reservoir gets removed is $$\frac{1}{r}$$ This makes the probability of an element to be removed at this step $$\frac{r}{k+1}\cdot\frac{1}{r}$$ The total probabilty of having been chosen initially and not being removed at this step becomes $$\frac{r}{k}\cdot(1-\frac{r}{k+1}\cdot\frac{1}{r})$$ Remember that $$\frac{r}{k} = 1$$ This simplifies to $$1\cdot(1-\frac{1}{k+1}) = \frac{k}{k+1}$$ Therefore, all elements in the reservoir, including the $(k+1)$-th element, have the same probability to be chosen and the sample is a simple random sample.

This proves that the reservoir sampling algorithm yields a simple random sample for all $n$.

Releases

No releases published

Packages

No packages published

Languages