Optimal implementation of reservoir sampling algorithm in Julia.
This application functions similarly to the shuf
command-line utility. It does not take any command-line flags, only a single argument to specify reservoir size.
It will accept a stream of elements and randomly sample these elements into a reservoir of the specified size.
Example:
alex@PC:/home/alex/julia-projects$ seq 100 | julia fastshuf.jl 10
53
82
6
22
55
39
77
95
48
72
Earlier versions may function but have not been tested.
As part of my Big Data course, I wrote an extra credit proof that proves that the standard implementation of the reservoir sampling algorithm results in a random sample.
Let
For our inductive hypothesis, we will say that the algorithm produces a simple random sample for some
Consider the step where we read in the
This proves that the reservoir sampling algorithm yields a simple random sample for all