Add `codec=` argument to restrict `st.characters()` to codepoints representable in that encoding #1664

Zac-HD · 2018-11-01T12:41:17Z

Reflecting on chardet/chardet#167 made me realise that it may often be useful to restrict generated characters to those valid for a particular encoding. In fact, that's exactly why we exclude surrogate characters from the default text() strategy - because they're invalid in utf-8!

This is obviously going to be much more efficient by construction than by filtering, and not too difficult - conceptually it's just an addition to the blacklist_characters argument, and can be implemented as such (though with some more attention to error messages for invalidly intersecting arguments, and some memoisation for performance).

Finally, the eternal question: what should we call this argument? I think the best option is codec=None, and ensure the error message asks for a codec name if we get a non-string value.

The text was updated successfully, but these errors were encountered:

sabik · 2019-12-17T03:04:25Z

For many encodings, this can be done using st.binary and decoding?

from hypothesis import given, reject, strategies as some

@given(some.binary())
def test_latin3_specifically(self, encoded):
    try:
        text = encoded.decode('iso-8859-3')
    except ValueError:
        reject()

    ...

With encodings like utf-8 the percentage rejected could get high, but for the likes of iso-8859-* where most bytes map to characters and there aren't any sequences, it should work fine?

Zac-HD · 2019-12-17T03:49:25Z

It does work for some encodings, but rejection sampling is almost always less efficient than getting it right by construction. It also skews the distribution of sizes - because long strings are more likely to include a non-decodable sequence, this will be biased towards small strings and thus less effective at finding bugs 😕

jenstroeger · 2023-01-10T19:44:13Z

In addition to charset perhaps using Unicode Scripts would be a useful feature here?

Zac-HD · 2023-01-11T00:20:27Z

In addition to charset perhaps using Unicode Scripts would be a useful feature here?

Sounds interesting, but I don't Hypothesis is the right place to put a policy on which codepoints should be generated for a given script - at least initially - and st.characters() and st.text() are pretty tightly tied to codepoints. For example, should "uncoded" be included for a particular script? What about "inherited"?

If a third-party extension wanted to work out that mapping and demonstrated that people found it useful, I'd be happy to merge it back in later! Just cautious about making opinionated decisions under our backwards-compatibility constraints without first exploring some concrete use-cases 🙂

jenstroeger · 2023-01-11T11:41:42Z

Thanks for the heads-up, @Zac-HD. It won’t happen this week or next but I’ll look into finding (or creating) a dataset that we can use to map scripts to codepoints, and then wrap a st.character() around it. Something something… 🤔

jenstroeger · 2023-01-20T12:12:33Z

After some poking around, I think a first stab at building such an aforementioned strategy would be to sort the data.

Unfortunately, Python’s own unicodedata module gives access to only limited character properties. However, looking at the Unicode Character Database (UCD) v14 we have the ucd/Scripts.txt which correlates codepoints and codepoint ranges with Script, Category, and actual Name.

Zac-HD · 2023-01-20T12:52:31Z

No worries - we already vendor tlds-alpha-by-domain.txt, so all the machinery (including auto-updates, which should be rare for Unicode!) is in place 🙂

Zac-HD · 2023-02-20T08:37:11Z

It occurs to me that the most common use-cases will be codec="ascii", codec="utf-8", and codec=None (ie no restriction). These are also pretty easy to implement, so we might support that and come back for other codecs later.

Cheukting · 2023-07-23T08:37:08Z

May I give this a try @Zac-HD

Zac-HD · 2023-07-23T17:32:27Z

Certainly!

I'm heading out on a two-week camping trip tomorrow, so don't worry when it takes me a long time to respond or review 😅

jenstroeger · 2023-07-23T21:54:13Z

Enjoy @Zac-HD!

@Cheukting I’m also interested in this, so please let me know if there’s anything I can contribute.

Cheukting · 2023-08-06T20:06:46Z

@jenstroeger @Zac-HD Hey sorry it has been silent for a while. I started a new job last week and was a bit busy. I am diving back into this and will let you know if I have any updates :-)

Zac-HD · 2023-08-06T23:34:38Z

Best wishes for the new job, and no worries - the issue is almost five years old, not especially urgent 😉

Zac-HD added the enhancement it's not broken, but we want it to be better label Nov 1, 2018

techdragon mentioned this issue Aug 7, 2019

[WIP] Enable filtering characters by an encoding. #2069

Closed

Zac-HD changed the title ~~Argument to restrict st.characters to a given encoding~~ Add charset= argument to restrict st.characters() to codepoints representable in that encoding Jun 25, 2022

Zac-HD mentioned this issue Oct 17, 2022

Enhanced st.from_regex() strategy with alphabet=... argument and filter-rewriting #3479

Closed

Zac-HD changed the title ~~Add charset= argument to restrict st.characters() to codepoints representable in that encoding~~ Add codec= argument to restrict st.characters() to codepoints representable in that encoding Feb 19, 2023

Zac-HD mentioned this issue Feb 20, 2023

New arguments to from_schema() to constrain generated strings python-jsonschema/hypothesis-jsonschema#101

Closed

Cheukting mentioned this issue Aug 7, 2023

Adding codec= option to st.characters() #3711

Merged

This was referenced Sep 4, 2023

Add alphabet= argument to st.from_regex() #3730

Merged

Support all available codecs in st.characters() #3732

Merged

Zac-HD closed this as completed in #3732 Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `codec=` argument to restrict `st.characters()` to codepoints representable in that encoding #1664

Add `codec=` argument to restrict `st.characters()` to codepoints representable in that encoding #1664

Zac-HD commented Nov 1, 2018 •

edited

Loading

sabik commented Dec 17, 2019 •

edited by Zac-HD

Loading

Zac-HD commented Dec 17, 2019

jenstroeger commented Jan 10, 2023

Zac-HD commented Jan 11, 2023

jenstroeger commented Jan 11, 2023

jenstroeger commented Jan 20, 2023

Zac-HD commented Jan 20, 2023

Zac-HD commented Feb 20, 2023

Cheukting commented Jul 23, 2023

Zac-HD commented Jul 23, 2023

jenstroeger commented Jul 23, 2023

Cheukting commented Aug 6, 2023

Zac-HD commented Aug 6, 2023

Add codec= argument to restrict st.characters() to codepoints representable in that encoding #1664

Add codec= argument to restrict st.characters() to codepoints representable in that encoding #1664

Comments

Zac-HD commented Nov 1, 2018 • edited Loading

sabik commented Dec 17, 2019 • edited by Zac-HD Loading

Zac-HD commented Dec 17, 2019

jenstroeger commented Jan 10, 2023

Zac-HD commented Jan 11, 2023

jenstroeger commented Jan 11, 2023

jenstroeger commented Jan 20, 2023

Zac-HD commented Jan 20, 2023

Zac-HD commented Feb 20, 2023

Cheukting commented Jul 23, 2023

Zac-HD commented Jul 23, 2023

jenstroeger commented Jul 23, 2023

Cheukting commented Aug 6, 2023

Zac-HD commented Aug 6, 2023

Add `codec=` argument to restrict `st.characters()` to codepoints representable in that encoding #1664

Add `codec=` argument to restrict `st.characters()` to codepoints representable in that encoding #1664

Zac-HD commented Nov 1, 2018 •

edited

Loading

sabik commented Dec 17, 2019 •

edited by Zac-HD

Loading