Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement characters strategy #155

Merged
merged 1 commit into from Nov 23, 2015

Conversation

@kxepal
Copy link
Member

commented Sep 20, 2015

This strategy allows to produce single unicode characters by specified rules: you may have exact characters or characters that belongs some Unicode Category. Additionally, you may reverse your rule to specify which characters you don't want to get.

characters strategy preserved existed OneCharStringStrategy behaviour when with no arguments it produces all the characters except surrogates.

characters with exact only argument converted into sampled_from strategy implicitly.

Generally tests passed, but travis fails due to irrelevant reasons. What I can fix will send with additional PRs.

Feedback is welcome (:

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 20, 2015

Thanks for doing this! It looks pretty good, but I'm not a huge fan of some of the specifics of the API design.

In particular I think having a parameter that changes the meaning of the other parameters is a bad idea, so I really don't like the exclude parameter. Sorry.

I'd suggest either moving it to two parameters, one that is a strict inclusion the other a strict exclusion, e.g. whitelist_categories=..., blacklist_categories=...

Another option would simply be to have categories=... which may be either a collection of allowed categories or a function that takes a category and returns True/False.

I don't have a strong preference between the two, and I'm also happy to take other suggestions.

I'm also not totally convinced adding an 'exact' parameter is that useful because it duplicates functionality that sampled_from provides almost exactly and doesn't seem to combine very interestingly with the category choice. Do you have an example use case where it makes life a lot better?

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

@DRMacIver Sorry for delay with response!

I took a time to think about and...you may be right: I cannot figure any case when you would need to have whole unicode group + some chars (say Nd (numbers) and only ascii letters). My initial intentions to provide "exact" argument based on semantic reasons. characters(ascii_letters) looks more friendly thansampled_from(ascii_letters).

So, let's limit ourself with unicode groups for now. It would be easy to return this feature if someone find a need in it.

As for second argument, I like your idea with having just categories. Having both whitelist and blacklist is a bit strange: while it's possible to specify both, there is no case when it'll be valid.

So, summary:

  1. Change characters signature to categories which accepts either container or callable;
  2. For container argument characters emits only those chars which are belongs to the specified unicode categories;
  3. If callable is provided, it must return true to produce the character or false to not;
  4. By default, all chars are produces except surrogates;

Which leads to:

characters(lambda _: True) -> all categories
characters() -> all categories except Cs
characters({'Nd', 'Nl', 'No'}) -> all numbers
characters(lambda category: category not in {'Nd', 'Nl', 'No'}) -> all not numbers

While I don't like a bit blacklisting here, but I found this case quite rare one to really get annoyed.

Sounds good?

@kxepal kxepal force-pushed the kxepal:filtered-characters branch 2 times, most recently from e79211d to 2631a5e Sep 22, 2015
@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

PR updated by the last comment.

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

Found another issue: for hypothesis it's really hard to produce characters for some specific unicode group which doesn't covers ASCII range. For instance, for currency sign (Sc group) eventually may be not found any examples for over one minute.

I'll think about better generation/simplification algorithm. Especially, since we know which categories we are interested by.

@kxepal kxepal force-pushed the kxepal:filtered-characters branch from 2631a5e to 6fa5d8f Sep 22, 2015
@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

I'm happy with the choice of API

RE difficulty of generation: one option would be to precompute some sort of table of where different categories live

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

Use case you might want to think about but should free not to implement: would max and min codepoint arguments be useful?

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

one option would be to precompute some sort of table of where different categories live

Yea, I thought about the same. Will try this.

would max and min codepoint arguments be useful?

I think not, because it's basically st.integers(min_value=locp, max_value=hicp).map(chr). But what would be good is to support scripts and languages. Since you may not be interested in all of letters, but letters of some specific language group.

However, I don't think Python provides any API to guess/lookup them. I'll look into, but have a feel that it will require extra dependency. For now, it could be simply done with .filter(). Not good, but better than nothing.

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

My reasoning for max in particular was it lets you do things like handle mysql's 3 byte utf-8. It's the same as integers but it would be nice to be able to combine it with the categories feature.

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

Hm...How then max and min codepoint arguments should cooperate with categories? Does they extend or reduce the set of characters provided by categories?

In other words: characters('{'Nd', 'Nl', 'No'}, min_codepoint=48, max_codepoint=57) is about "give me all numbers and also characters with codepoint in range 48-57" or "give me all numbers, but only those which codepoint in range 48-57"?

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

Intersection (give me all numbers whose codepoint is in this range) is definitely the correct behaviour. It's both necessary for the use case of "MySQL can't handle codepoints high enough to require 4 bytes in utf-8" and also seems like the one that makes most logical sense.

One complication is that this can result in an empty set of valid code points. If I ask for all surrogates in the range [0, 127] this has no valid characters. This should raise an InvalidArgument error (I think it's fine if any individual category misses the range, just not if all of them do).

Depending on the representation of our precomputed table and/or the size of the range this might be easy or hard to detect, which is why I thought it worth mentioning here. You definitely should not feel in any way obliged to do this unless you actually want to.

1 similar comment
@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 22, 2015

Intersection (give me all numbers whose codepoint is in this range) is definitely the correct behaviour. It's both necessary for the use case of "MySQL can't handle codepoints high enough to require 4 bytes in utf-8" and also seems like the one that makes most logical sense.

One complication is that this can result in an empty set of valid code points. If I ask for all surrogates in the range [0, 127] this has no valid characters. This should raise an InvalidArgument error (I think it's fine if any individual category misses the range, just not if all of them do).

Depending on the representation of our precomputed table and/or the size of the range this might be easy or hard to detect, which is why I thought it worth mentioning here. You definitely should not feel in any way obliged to do this unless you actually want to.

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2015

Good case with MySQL. Then it makes a sense to have such.

Ok, I'll first take care about generation. And then will see if codepoint constraint will lay down here, but don't see any issues with.

Thanks for the help and the ideas! (:

@kxepal kxepal force-pushed the kxepal:filtered-characters branch 8 times, most recently from 6d46c8c to e8aac45 Sep 30, 2015
@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 30, 2015

@DRMacIver PR updated. See commit message for details. I failed to bake a good replacement for yours try_shrink implementation: it's too good. However, may be you can suggest something better using new implementation.

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 30, 2015

I need to sit down and properly read through this. I'm not yet 100% sure I understand the tree code yet. I'll try to do that tomorrow morning.

On the shrinking front, here's an idea that has just occurred to me: I think you should be able to treat your alphabet of available characters as a sorted indexable sequence with a little bit of preprocessing. So you should be able to say 'I have an n character alphabet. Give me the character at positon i'

What you should then be able to do when shrinking is map the character you're shrinking to its index in the sequence, and then probe to earlier points in the sequence using something along the lines of the old implementation, just replacing codepoints with indexes into the alphabet. Does that make sense?

should (`whitelist_categories`) or should not (`blacklist_categories`)
be produced.

Also there could be applied limitation by minimal and maximal produced

This comment has been minimized.

Copy link
@DRMacIver

DRMacIver Sep 30, 2015

Member

I think you mean "implied" instead of "applied".

This comment has been minimized.

Copy link
@kxepal

kxepal Sep 30, 2015

Author Member

No, they are not implied, but applicable.

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 30, 2015

I need to sit down and properly read through this. I'm not yet 100% sure I understand the tree code yet. I'll try to do that tomorrow morning.

Sure. Please note, that this isn't a second iteration, forth I guess, so some decisions may be ok for me, but not clear for you.

On the shrinking front, here's an idea that has just occurred to me: I think you should be able to treat your alphabet of available characters as a sorted indexable sequence with a little bit of preprocessing. So you should be able to say 'I have an n character alphabet. Give me the character at positon i'

What you should then be able to do when shrinking is map the character you're shrinking to its index in the sequence, and then probe to earlier points in the sequence using something along the lines of the old implementation, just replacing codepoints with indexes into the alphabet. Does that make sense?

Yes, it is and I tried this approach. It works fine when you have very limited characters scope (e.g. you blacklisted all private codepoints and letters), but otherwise the cost of having sorted indexable sequence for the whole range is too much high from what we have now.

I keep thinking about another idea, similar to this: have sorted sequence of chars ranges (leaves of current tree) and combine your current shrinking algorithm with bisect to iterate over this sequence. This will be about having sequence of ~3K length in worst case and O(logN) lookups, but this N will get reduced with each iteration. Well, it's hard to explain, I'll try to code it (:

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 30, 2015

Sorry, I didn't mean keeping the entire list in memory, just that you can implicitly regard it that way. You can build the data structure with a sorted list of intervals plus a running count of sizes so far.

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Sep 30, 2015

@DRMacIver Datastucture with sorted intervals indeed helps with shrinking, but it will bite us on template draw - we won't be able to maintain equal chances for each category during random choose because they distribution is not normalized. This will lead us to do more processing work and we'll just move issue to another place.

The compromise here is to have both tree and sorted sequence at the same time. This will solve both issues, but are you ok with that?

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Sep 30, 2015

I don't think the lists of intervals are going to actually be very large are they?

@kxepal kxepal force-pushed the kxepal:filtered-characters branch from e8aac45 to 1173b09 Oct 12, 2015
@kxepal kxepal force-pushed the kxepal:filtered-characters branch 15 times, most recently from 398f9e6 to 1d96d44 Nov 19, 2015
@kxepal

This comment has been minimized.

Copy link
Member Author

commented Nov 21, 2015

@DRMacIver I think, this strategy is ready for the first iteration now.
It implements:

  • Filter by unicode categories
  • Filter by code points range
  • Filter by exact characters

I spent a lot of time trying to make shrinking reuse charstree structure instead of doing blind random generation. However, I end with the tradeoff when I can greatly optimize cases when some specific Unicode category or character of specific code point range causes failure, but it hurts generic cases like find(text(), lambda x: x >= u'☃') - I have to spend more time and rounds to shrink toward to '☃' value. So current Unicode shrinking is good enough and balanced well. Ascii one though improved a bit, but assuming code points range there it will not give a big difference.

However, the general problem had solved well and characters strategy is able to quickly produce chars for any code point, even quite rare cases.

P.S. You may be interested in: https://travis-ci.org/DRMacIver/hypothesis/jobs/92345632#L889



def dump_tree(tree): # pragma: no cover
dump = marshal.dumps(tree)

This comment has been minimized.

Copy link
@DRMacIver

DRMacIver Nov 22, 2015

Member

I think I might have been the on who suggested it, but I'm wondering about the validity of using marshal here now. The marshal format is not guaranteed and I think not compatible between Python 2 or 3. This could cause problems for e.g. cross version libraries.

I wonder if using something more portable would make sense. e.g. json is used elsewhere. It would perform worse, but the size isn't large enough that that's likely to be a major problem.

This comment has been minimized.

Copy link
@kxepal

kxepal Nov 22, 2015

Author Member

We can explicitly set version 2 here which is supported both by 2.x and 3.x so far.

def codepoints_for_category(tree, category):
"""Iterates over all code points of the specified category in the tree."""
for value in iter_values(tree[category]):
for cp in range(value[0], value[1] + 1):

This comment has been minimized.

Copy link
@DRMacIver

DRMacIver Nov 22, 2015

Member

This should use hypothesis.internal.compat.hrange - in Python 2 range is a list.

This comment has been minimized.

Copy link
@kxepal

kxepal Nov 22, 2015

Author Member

Nice found. Btw, is it possible to add some linter rules to detect such things?

if whitelist_categories:
categories &= set(whitelist_categories)
if blacklist_categories:
categories ^= set(blacklist_categories)

This comment has been minimized.

Copy link
@DRMacIver

DRMacIver Nov 22, 2015

Member

I don't think this is the right operator. Do you mean -=? This will add in any cartegories that are blacklisted but didn't pass the whitelist.

This comment has been minimized.

Copy link
@kxepal

kxepal Nov 22, 2015

Author Member

Hm, right. However, it'll only add this if you make a mistake in blacklist categories, but this indeed shouldn't cause further failures. Thanks!

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Nov 22, 2015

Other than above comments, this LGTM.

@DRMacIver

This comment has been minimized.

Copy link
Member

commented Nov 22, 2015

Oh, RE the pypy segfault: It's a thing that happens sometimes. I haven't yet been able to get a reliable reproduction for the current intermittent one.

@kxepal kxepal force-pushed the kxepal:filtered-characters branch from 1d96d44 to ae4147f Nov 22, 2015
@kxepal

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2015

PR rebased and updated with the discussed issues.

This strategy is used to produce Unicode characters in all range till
`sys.maxunicode`. Additionally, this range could be limited by exact
minimal and maximal code points and by Unicode groups (white and black
lists).

This implementation adds new module `charstree` which operates with
underlying data structure used by OneCharacterStrategy. The choose of
used data structure was stopped on tree-like dict primary indexed by
Unicode categories names. Next three levels are used by intermediate
values which produced by applying special mask (0xfff000, 0xffff00 and
0xfffff0 respectively) onto code point value. For instance, for
character with code point 32433 (0x7eb1) the path would be:

    {'Lo': {0x7000: {0x0e00: {0x00b0: (1, 1)}}}}

Such kind of structure serves two proposes:
1. Reduce cost of code point lookup for specific category
2. Reduce amount of used big numbers which helps to keep the tree more
compact (~9KiB compressed).

Yes, size of the tree matters since the characters tree doesn't
generates on each run: it produced once and caches in to special file
within .hypothesis directory. This cache is optional (if write fails
due to readonly FS - it's ok), but it reduces first call time from
~2s to ~0.001s.

`OneCharacterStrategy` behavior had changed: it doesn't excludes
surrogates by default (`text()` strategy keeps following the old
behaviour). It also avoid generation of random characters in order to
check them after if they are good or not before yield. With generating
always good characters, performance of the strategy get improved
relative to amount of applied constraints: more narrow code points range
- better performance.
@kxepal kxepal force-pushed the kxepal:filtered-characters branch from ae4147f to 156d93e Nov 22, 2015
DRMacIver added a commit that referenced this pull request Nov 23, 2015
Implement characters strategy
@DRMacIver DRMacIver merged commit 487b736 into HypothesisWorks:master Nov 23, 2015
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@DRMacIver

This comment has been minimized.

Copy link
Member

commented Nov 23, 2015

👍 Thanks for the great work.

@kxepal

This comment has been minimized.

Copy link
Member Author

commented Nov 23, 2015

\o/ Thank you for the guidance and help with!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.