Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: package for sampling without replacement #1787

Open
josharian opened this issue Mar 14, 2022 · 2 comments
Open

proposal: package for sampling without replacement #1787

josharian opened this issue Mar 14, 2022 · 2 comments

Comments

@josharian
Copy link

Background

Sampling without replacement is a core operation. There are a handful of useful algorithms, depending on the shape of the problem, some of which are not easy to implement.

I recently was in a situation for which Vitter's Algorithm D was a perfect fit and was surprised not to find any Go implementations. Gonum seems like a great home for a suite of sampling-without-replacement algorithms.

Proposal

I propose that gonum add a package containing sampling-without-replacement algorithms. It could grow over time. A great start would be Vitter's Algorithm D. I would not expect that they would necessarily share a common API, because the algorithms themselves have different needs.

Potential impact of proposal

I have a straightforward port of the code at https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/ that I would be happy to contribute. (It uses an ISC license, which should be compatible with appropriate attribution, but IANAL.) I will note that I did not make any attempt to understand it, I merely ported it.

The API I chose for my purposes is as follows, but I'd be happy to change it.

// VitterD calls choose want times with integers randomly sampled without replacement from the range [0, max].
func VitterD(prng *rand.Rand, want, max int, choose func(n int))

There is a good writeup of the state of the art at https://arxiv.org/pdf/2104.05091.pdf that would be a useful resource if other folks wanted to add other algorithms. Note that that paper indicates that Vitter's Algorithm D might be implementable more simply in terms of other primitives that (I believe) gonum provides.

I'm not deeply invested in this and can also happily carry around my little package vitter locally. This proposal is more of an "if it seems exciting to y'all". :)

@btracey
Copy link
Member

btracey commented Mar 14, 2022

Note that we already have this capability here:
https://pkg.go.dev/gonum.org/v1/gonum/stat/sampleuv#WithoutReplacement

Maybe we can just improve the implementation? I haven't read about the algorithm yet, so am unsure how vital the choose function is in practice.

@josharian
Copy link
Author

Maybe we can just improve the implementation?

Perhaps.

The reason I went with a callback is that this is in some quite performance sensitive code. Writing everything into a slice and then turning around and reading back out that slice to process each of the elements does unnecessary work, plus it requires that I arrange for a pool of slices to populate, which is challenging when the lengths are heterogenous. But those are things I could work around.

Here are benchmark results for len(idx)=want, n=1000000, going from the current implementation to my Vitter Algorithm D port:

name                             old time/op    new time/op    delta
WithoutReplacement/want=1-8        36.6ns ± 0%    27.1ns ± 0%   -25.88%  (p=0.000 n=10+10)
WithoutReplacement/want=100-8      74.8µs ± 0%     4.5µs ± 0%   -93.99%  (p=0.000 n=9+10)
WithoutReplacement/want=10000-8    4.17ms ± 1%    0.58ms ± 1%   -86.02%  (p=0.000 n=10+10)

name                             old alloc/op   new alloc/op   delta
WithoutReplacement/want=1-8         32.0B ± 0%      0.0B       -100.00%  (p=0.000 n=10+10)
WithoutReplacement/want=100-8      3.30kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
WithoutReplacement/want=10000-8    8.00MB ± 0%    0.00MB       -100.00%  (p=0.000 n=9+10)

name                             old allocs/op  new allocs/op  delta
WithoutReplacement/want=1-8          2.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
WithoutReplacement/want=100-8         101 ± 0%         0       -100.00%  (p=0.000 n=10+10)
WithoutReplacement/want=10000-8      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)

The arxiv paper I referenced above has some nice tables showing different approaches and their trade-offs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants