Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a simple subsets / subsequences strategy? #1115

Closed
PiDelport opened this issue Feb 15, 2018 · 16 comments · Fixed by #1891
Closed

Consider adding a simple subsets / subsequences strategy? #1115

PiDelport opened this issue Feb 15, 2018 · 16 comments · Fixed by #1891
Labels
new-feature entirely novel capabilities or strategies

Comments

@PiDelport
Copy link
Contributor

PiDelport commented Feb 15, 2018

I find myself implementing this every now and then, to draw subsets of some fixed set of elements:

@st.composite
def subsets(draw, elements):
    return {e for e in elements if draw(st.booleans())}

(This is essentially equivalent to st.sets(st.sampled_from(elements), max_size=len(elements)), but simpler and presumably more efficient. (?))

Would this, or something like it, be useful enough to consider adding to the core Hypothesis strategies?

@Zac-HD Zac-HD added question not sure it's a bug? questions welcome new-feature entirely novel capabilities or strategies labels Feb 15, 2018
@DRMacIver
Copy link
Member

I'd definitely be fine with adding this to the API!

There are a couple things to consider though:

  • Do we want it to have the usual collection size parameters of min_size, average_size, max_size? (I think yes probably, though they don't have to be in the initial PR if you'd rather do something minimal)
  • What do we do about the ordering problem? Hypothesis kinda relies on having a stable order that things happen in across runs, and sets are in a potentially randomized order between runs. 😢

One option for the latter would be to only support this for orderable types and to sort the values up front. Would that be acceptable for your use case?

@PiDelport
Copy link
Contributor Author

Yup, those are good considerations!

The collection size parameters might be useful, yeah. If there's an elegant way to implement them without taking away from the simplicity of the base strategy, that would be cool (though I'd personally also be happy without them).

Regarding the ordering problem, my first instinct would be that the strategy should simply assume / require that the input is given in some reliable order, and warn or error if it isn't. This is what the sampled_from strategy does with its check_sample check: this strategy could probably just call the same check_sample and be fine?

@Zac-HD
Copy link
Member

Zac-HD commented Feb 15, 2018

I thought we were trying to get rid of average_size eventually? Apart from that, collection size controls make sense.

Perhaps this could be implemented as an upgrade to permutations()? Currently permutations() just casts to a list, which is nondeterministic if the input collection is unordered...

n-choose-k is at least related, and equivalent to taking a subset once you discard the ordering. It would make sense to me to use it as a general strategy, and just have a more efficient path for unordered operations internally.

@DRMacIver
Copy link
Member

Regarding the ordering problem, my first instinct would be that the strategy should simply assume / require that the input is given in some reliable order, and warn or error if it isn't. This is what the sampled_from strategy does with its check_sample check: this strategy could probably just call the same check_sample and be fine?

I'm definitely fine with doing this is as a first pass! The problem is that it results in making it invalid to call subsets with a set argument, which I think is going to be a bit weird and trip people up a lot.

@PiDelport
Copy link
Contributor Author

I'm definitely fine with doing this is as a first pass! The problem is that it results in making it invalid to call subsets with a set argument, which I think is going to be a bit weird and trip people up a lot.

I don't think this is much different to sampled_from. When I started, I also intuitively called sampled_from with set arguments, but the warning sorted me out. (And the check_sample call makes sure that it works anyway, so I'd assume the same would hold for subsets.)

@Zac-HD
Copy link
Member

Zac-HD commented Feb 15, 2018

IIRC sampled_from only accepts unordered collections to preserve backwards compatibility - for a new function we should make a strict error (though of course with a helpful message).

@PiDelport
Copy link
Contributor Author

PiDelport commented Feb 15, 2018

Hmm, just thinking out loud, how's this for a strategy that also supports min_size / max_size?

@st.composite
def subsets(draw, elements, *, min_size, max_size):
    def choices():
        always_size = min_size
        never_size = len(elements) - max_size
        maybe_size = len(elements) - always_size - never_size
        choices = (
            [True] * always_size +
            [draw(st.booleans()) for _ in range(maybe_size)] +
            [False] * never_size
        )
        assert len(elements) == len(choices)
        return draw(st.permutations(choices))

    return {element for (element, choose) in zip(elements, choices()) if choose}

If I understand it right, this variation should have the bonus feature of shrinking toward selecting earlier elements in the presence of min_size or max_size, due to how permutations works.

@PiDelport
Copy link
Contributor Author

Thinking out loud some more, the above might actually be more generally useful as a list-based n-choose-k strategy for subsequences, simply by changing the final set comprehension to a list comprehension. It could be named subsequences then instead?

In that case, getting actual subsets of a set would simply be a special case of doing subsequences(elements).map(set), which I could certainly live with.

@Zac-HD
Copy link
Member

Zac-HD commented Feb 15, 2018

Simpler: add the min and max size arguments to permutations, along with the sorting/warning we discussed, then (psudeocode) return draw(permutations(values))[:draw(integers(min_size, max_size))]. And yep, optionally .map(set).

@PiDelport
Copy link
Contributor Author

@Zac-HD: That wouldn't be the same thing as subsequences; it always permutes, rather than generating an ordered subsequence of the original.

It also "wastes" the entropy spent on the elements that get discarded: I'm not sure how much effect that has on the effectiveness of shrinking, but my intuition tells me it might? (With the subsequences approach, all the entropy ends up in the output.)

@Zac-HD
Copy link
Member

Zac-HD commented Feb 15, 2018

True! (assuming we make it not a set comprehension 😉) I see that as a feature though; and it shrinks towards shortest prefix in input order.

The shrinker would be fine; worst case it lowers that value alone and notices no change in status. The most efficient way is probably to loop over sampling an unused index, then including it in the output if drawing from an appropriately biased coin is True - this would allow the shrinker to delete pairs of draw calls. However at this higher level I wouldn't worry about the shrinker at all!

@Zac-HD Zac-HD added good first issue and removed question not sure it's a bug? questions welcome labels Mar 14, 2018
@PiDelport
Copy link
Contributor Author

PiDelport commented May 30, 2018

Here's the updated straw implementation, just for easier reference:

@st.composite
def subsequences(draw, elements, *, min_size, max_size):
    def choices():
        always_size = min_size
        never_size = len(elements) - max_size
        maybe_size = len(elements) - always_size - never_size
        choices = (
            [True] * always_size +
            [draw(st.booleans()) for _ in range(maybe_size)] +
            [False] * never_size
        )
        assert len(elements) == len(choices)
        return draw(st.permutations(choices))

    return [element for (element, choose) in zip(elements, choices()) if choose]

@PiDelport PiDelport changed the title Consider adding a simple "subsets" strategy? Consider adding a simple subsets / subsequences strategy? May 30, 2018
@tkcranny
Copy link

tkcranny commented Aug 27, 2018

Hi, this looks like a good one to implement for the PyCon sprints; if no-one has any objections, I'll get started implementing this 😁 (see #1513)

@tkcranny
Copy link

Ok, I have a proof-of-concept implemented by tkcranny@e49a018

I'd love some feedback on how the argument validation should be done, and perhaps on more/better testing.

@PiDelport
Copy link
Contributor Author

@tkcranny: Thanks for taking this to implementation! :)

@Zac-HD
Copy link
Member

Zac-HD commented Oct 23, 2018

Future work on this issue should build on #1533, with appropriate credit for that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature entirely novel capabilities or strategies
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants