New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement characters strategy #155
Conversation
|
Thanks for doing this! It looks pretty good, but I'm not a huge fan of some of the specifics of the API design. In particular I think having a parameter that changes the meaning of the other parameters is a bad idea, so I really don't like the exclude parameter. Sorry. I'd suggest either moving it to two parameters, one that is a strict inclusion the other a strict exclusion, e.g. whitelist_categories=..., blacklist_categories=... Another option would simply be to have categories=... which may be either a collection of allowed categories or a function that takes a category and returns True/False. I don't have a strong preference between the two, and I'm also happy to take other suggestions. I'm also not totally convinced adding an 'exact' parameter is that useful because it duplicates functionality that sampled_from provides almost exactly and doesn't seem to combine very interestingly with the category choice. Do you have an example use case where it makes life a lot better? |
|
@DRMacIver Sorry for delay with response! I took a time to think about and...you may be right: I cannot figure any case when you would need to have whole unicode group + some chars (say So, let's limit ourself with unicode groups for now. It would be easy to return this feature if someone find a need in it. As for second argument, I like your idea with having just So, summary:
Which leads to: While I don't like a bit blacklisting here, but I found this case quite rare one to really get annoyed. Sounds good? |
e79211d
to
2631a5e
Compare
|
PR updated by the last comment. |
|
Found another issue: for hypothesis it's really hard to produce characters for some specific unicode group which doesn't covers ASCII range. For instance, for currency sign (Sc group) eventually may be not found any examples for over one minute. I'll think about better generation/simplification algorithm. Especially, since we know which categories we are interested by. |
2631a5e
to
6fa5d8f
Compare
|
I'm happy with the choice of API RE difficulty of generation: one option would be to precompute some sort of table of where different categories live |
|
Use case you might want to think about but should free not to implement: would max and min codepoint arguments be useful? |
Yea, I thought about the same. Will try this.
I think not, because it's basically However, I don't think Python provides any API to guess/lookup them. I'll look into, but have a feel that it will require extra dependency. For now, it could be simply done with |
|
My reasoning for max in particular was it lets you do things like handle mysql's 3 byte utf-8. It's the same as integers but it would be nice to be able to combine it with the categories feature. |
|
Hm...How then max and min codepoint arguments should cooperate with categories? Does they extend or reduce the set of characters provided by categories? In other words: |
|
Intersection (give me all numbers whose codepoint is in this range) is definitely the correct behaviour. It's both necessary for the use case of "MySQL can't handle codepoints high enough to require 4 bytes in utf-8" and also seems like the one that makes most logical sense. One complication is that this can result in an empty set of valid code points. If I ask for all surrogates in the range [0, 127] this has no valid characters. This should raise an InvalidArgument error (I think it's fine if any individual category misses the range, just not if all of them do). Depending on the representation of our precomputed table and/or the size of the range this might be easy or hard to detect, which is why I thought it worth mentioning here. You definitely should not feel in any way obliged to do this unless you actually want to. |
1 similar comment
|
Intersection (give me all numbers whose codepoint is in this range) is definitely the correct behaviour. It's both necessary for the use case of "MySQL can't handle codepoints high enough to require 4 bytes in utf-8" and also seems like the one that makes most logical sense. One complication is that this can result in an empty set of valid code points. If I ask for all surrogates in the range [0, 127] this has no valid characters. This should raise an InvalidArgument error (I think it's fine if any individual category misses the range, just not if all of them do). Depending on the representation of our precomputed table and/or the size of the range this might be easy or hard to detect, which is why I thought it worth mentioning here. You definitely should not feel in any way obliged to do this unless you actually want to. |
|
Good case with MySQL. Then it makes a sense to have such. Ok, I'll first take care about generation. And then will see if codepoint constraint will lay down here, but don't see any issues with. Thanks for the help and the ideas! (: |
6d46c8c
to
e8aac45
Compare
|
@DRMacIver PR updated. See commit message for details. I failed to bake a good replacement for yours |
|
I need to sit down and properly read through this. I'm not yet 100% sure I understand the tree code yet. I'll try to do that tomorrow morning. On the shrinking front, here's an idea that has just occurred to me: I think you should be able to treat your alphabet of available characters as a sorted indexable sequence with a little bit of preprocessing. So you should be able to say 'I have an n character alphabet. Give me the character at positon i' What you should then be able to do when shrinking is map the character you're shrinking to its index in the sequence, and then probe to earlier points in the sequence using something along the lines of the old implementation, just replacing codepoints with indexes into the alphabet. Does that make sense? |
| should (`whitelist_categories`) or should not (`blacklist_categories`) | ||
| be produced. | ||
|
|
||
| Also there could be applied limitation by minimal and maximal produced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean "implied" instead of "applied".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not implied, but applicable.
Sure. Please note, that this isn't a second iteration, forth I guess, so some decisions may be ok for me, but not clear for you.
Yes, it is and I tried this approach. It works fine when you have very limited characters scope (e.g. you blacklisted all private codepoints and letters), but otherwise the cost of having sorted indexable sequence for the whole range is too much high from what we have now. I keep thinking about another idea, similar to this: have sorted sequence of chars ranges (leaves of current tree) and combine your current shrinking algorithm with bisect to iterate over this sequence. This will be about having sequence of ~3K length in worst case and O(logN) lookups, but this N will get reduced with each iteration. Well, it's hard to explain, I'll try to code it (: |
|
Sorry, I didn't mean keeping the entire list in memory, just that you can implicitly regard it that way. You can build the data structure with a sorted list of intervals plus a running count of sizes so far. |
|
@DRMacIver Datastucture with sorted intervals indeed helps with shrinking, but it will bite us on template draw - we won't be able to maintain equal chances for each category during random choose because they distribution is not normalized. This will lead us to do more processing work and we'll just move issue to another place. The compromise here is to have both tree and sorted sequence at the same time. This will solve both issues, but are you ok with that? |
|
I don't think the lists of intervals are going to actually be very large are they? |
e8aac45
to
1173b09
Compare
398f9e6
to
1d96d44
Compare
|
@DRMacIver I think, this strategy is ready for the first iteration now.
I spent a lot of time trying to make shrinking reuse charstree structure instead of doing blind random generation. However, I end with the tradeoff when I can greatly optimize cases when some specific Unicode category or character of specific code point range causes failure, but it hurts generic cases like However, the general problem had solved well and P.S. You may be interested in: https://travis-ci.org/DRMacIver/hypothesis/jobs/92345632#L889 |
|
|
||
|
|
||
| def dump_tree(tree): # pragma: no cover | ||
| dump = marshal.dumps(tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might have been the on who suggested it, but I'm wondering about the validity of using marshal here now. The marshal format is not guaranteed and I think not compatible between Python 2 or 3. This could cause problems for e.g. cross version libraries.
I wonder if using something more portable would make sense. e.g. json is used elsewhere. It would perform worse, but the size isn't large enough that that's likely to be a major problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can explicitly set version 2 here which is supported both by 2.x and 3.x so far.
|
Other than above comments, this LGTM. |
|
Oh, RE the pypy segfault: It's a thing that happens sometimes. I haven't yet been able to get a reliable reproduction for the current intermittent one. |
1d96d44
to
ae4147f
Compare
|
PR rebased and updated with the discussed issues. |
This strategy is used to produce Unicode characters in all range till
`sys.maxunicode`. Additionally, this range could be limited by exact
minimal and maximal code points and by Unicode groups (white and black
lists).
This implementation adds new module `charstree` which operates with
underlying data structure used by OneCharacterStrategy. The choose of
used data structure was stopped on tree-like dict primary indexed by
Unicode categories names. Next three levels are used by intermediate
values which produced by applying special mask (0xfff000, 0xffff00 and
0xfffff0 respectively) onto code point value. For instance, for
character with code point 32433 (0x7eb1) the path would be:
{'Lo': {0x7000: {0x0e00: {0x00b0: (1, 1)}}}}
Such kind of structure serves two proposes:
1. Reduce cost of code point lookup for specific category
2. Reduce amount of used big numbers which helps to keep the tree more
compact (~9KiB compressed).
Yes, size of the tree matters since the characters tree doesn't
generates on each run: it produced once and caches in to special file
within .hypothesis directory. This cache is optional (if write fails
due to readonly FS - it's ok), but it reduces first call time from
~2s to ~0.001s.
`OneCharacterStrategy` behavior had changed: it doesn't excludes
surrogates by default (`text()` strategy keeps following the old
behaviour). It also avoid generation of random characters in order to
check them after if they are good or not before yield. With generating
always good characters, performance of the strategy get improved
relative to amount of applied constraints: more narrow code points range
- better performance.
ae4147f
to
156d93e
Compare
Implement characters strategy
|
|
|
\o/ Thank you for the guidance and help with! |
This strategy allows to produce single unicode characters by specified rules: you may have exact characters or characters that belongs some Unicode Category. Additionally, you may reverse your rule to specify which characters you don't want to get.
characters strategy preserved existed OneCharStringStrategy behaviour when with no arguments it produces all the characters except surrogates.
characters with exact only argument converted into sampled_from strategy implicitly.
Generally tests passed, but travis fails due to irrelevant reasons. What I can fix will send with additional PRs.
Feedback is welcome (: