-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add codec=
argument to restrict st.characters()
to codepoints representable in that encoding
#1664
Comments
For many encodings, this can be done using st.binary and decoding? from hypothesis import given, reject, strategies as some
@given(some.binary())
def test_latin3_specifically(self, encoded):
try:
text = encoded.decode('iso-8859-3')
except ValueError:
reject()
... With encodings like utf-8 the percentage rejected could get high, but for the likes of iso-8859-* where most bytes map to characters and there aren't any sequences, it should work fine? |
It does work for some encodings, but rejection sampling is almost always less efficient than getting it right by construction. It also skews the distribution of sizes - because long strings are more likely to include a non-decodable sequence, this will be biased towards small strings and thus less effective at finding bugs 😕 |
charset=
argument to restrict st.characters()
to codepoints representable in that encoding
In addition to |
Sounds interesting, but I don't Hypothesis is the right place to put a policy on which codepoints should be generated for a given script - at least initially - and If a third-party extension wanted to work out that mapping and demonstrated that people found it useful, I'd be happy to merge it back in later! Just cautious about making opinionated decisions under our backwards-compatibility constraints without first exploring some concrete use-cases 🙂 |
Thanks for the heads-up, @Zac-HD. It won’t happen this week or next but I’ll look into finding (or creating) a dataset that we can use to map scripts to codepoints, and then wrap a |
After some poking around, I think a first stab at building such an aforementioned strategy would be to sort the data. Unfortunately, Python’s own unicodedata module gives access to only limited character properties. However, looking at the Unicode Character Database (UCD) v14 we have the ucd/Scripts.txt which correlates codepoints and codepoint ranges with Script, Category, and actual Name. |
No worries - we already vendor |
charset=
argument to restrict st.characters()
to codepoints representable in that encodingcodec=
argument to restrict st.characters()
to codepoints representable in that encoding
It occurs to me that the most common use-cases will be |
May I give this a try @Zac-HD |
Certainly! I'm heading out on a two-week camping trip tomorrow, so don't worry when it takes me a long time to respond or review 😅 |
Enjoy @Zac-HD! @Cheukting I’m also interested in this, so please let me know if there’s anything I can contribute. |
@jenstroeger @Zac-HD Hey sorry it has been silent for a while. I started a new job last week and was a bit busy. I am diving back into this and will let you know if I have any updates :-) |
Best wishes for the new job, and no worries - the issue is almost five years old, not especially urgent 😉 |
Reflecting on chardet/chardet#167 made me realise that it may often be useful to restrict generated characters to those valid for a particular encoding. In fact, that's exactly why we exclude surrogate characters from the default
text()
strategy - because they're invalid inutf-8
!This is obviously going to be much more efficient by construction than by filtering, and not too difficult - conceptually it's just an addition to the
blacklist_characters
argument, and can be implemented as such (though with some more attention to error messages for invalidly intersecting arguments, and some memoisation for performance).Finally, the eternal question: what should we call this argument? I think the best option is
codec=None
, and ensure the error message asks for a codec name if we get a non-string value.The text was updated successfully, but these errors were encountered: