-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: package for sampling without replacement #1787
Comments
Note that we already have this capability here: Maybe we can just improve the implementation? I haven't read about the algorithm yet, so am unsure how vital the choose function is in practice. |
Perhaps. The reason I went with a callback is that this is in some quite performance sensitive code. Writing everything into a slice and then turning around and reading back out that slice to process each of the elements does unnecessary work, plus it requires that I arrange for a pool of slices to populate, which is challenging when the lengths are heterogenous. But those are things I could work around. Here are benchmark results for len(idx)=want, n=1000000, going from the current implementation to my Vitter Algorithm D port:
The arxiv paper I referenced above has some nice tables showing different approaches and their trade-offs. |
Background
Sampling without replacement is a core operation. There are a handful of useful algorithms, depending on the shape of the problem, some of which are not easy to implement.
I recently was in a situation for which Vitter's Algorithm D was a perfect fit and was surprised not to find any Go implementations. Gonum seems like a great home for a suite of sampling-without-replacement algorithms.
Proposal
I propose that gonum add a package containing sampling-without-replacement algorithms. It could grow over time. A great start would be Vitter's Algorithm D. I would not expect that they would necessarily share a common API, because the algorithms themselves have different needs.
Potential impact of proposal
I have a straightforward port of the code at https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/ that I would be happy to contribute. (It uses an ISC license, which should be compatible with appropriate attribution, but IANAL.) I will note that I did not make any attempt to understand it, I merely ported it.
The API I chose for my purposes is as follows, but I'd be happy to change it.
There is a good writeup of the state of the art at https://arxiv.org/pdf/2104.05091.pdf that would be a useful resource if other folks wanted to add other algorithms. Note that that paper indicates that Vitter's Algorithm D might be implementable more simply in terms of other primitives that (I believe) gonum provides.
I'm not deeply invested in this and can also happily carry around my little
package vitter
locally. This proposal is more of an "if it seems exciting to y'all". :)The text was updated successfully, but these errors were encountered: