Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add codec= argument to restrict st.characters() to codepoints representable in that encoding #1664

Closed
Zac-HD opened this issue Nov 1, 2018 · 13 comments · Fixed by #3732
Closed
Labels
enhancement it's not broken, but we want it to be better

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Nov 1, 2018

Reflecting on chardet/chardet#167 made me realise that it may often be useful to restrict generated characters to those valid for a particular encoding. In fact, that's exactly why we exclude surrogate characters from the default text() strategy - because they're invalid in utf-8!

This is obviously going to be much more efficient by construction than by filtering, and not too difficult - conceptually it's just an addition to the blacklist_characters argument, and can be implemented as such (though with some more attention to error messages for invalidly intersecting arguments, and some memoisation for performance).

Finally, the eternal question: what should we call this argument? I think the best option is codec=None, and ensure the error message asks for a codec name if we get a non-string value.

@Zac-HD Zac-HD added the enhancement it's not broken, but we want it to be better label Nov 1, 2018
@sabik
Copy link

sabik commented Dec 17, 2019

For many encodings, this can be done using st.binary and decoding?

from hypothesis import given, reject, strategies as some

@given(some.binary())
def test_latin3_specifically(self, encoded):
    try:
        text = encoded.decode('iso-8859-3')
    except ValueError:
        reject()

    ...

With encodings like utf-8 the percentage rejected could get high, but for the likes of iso-8859-* where most bytes map to characters and there aren't any sequences, it should work fine?

@Zac-HD
Copy link
Member Author

Zac-HD commented Dec 17, 2019

It does work for some encodings, but rejection sampling is almost always less efficient than getting it right by construction. It also skews the distribution of sizes - because long strings are more likely to include a non-decodable sequence, this will be biased towards small strings and thus less effective at finding bugs 😕

@Zac-HD Zac-HD changed the title Argument to restrict st.characters to a given encoding Add charset= argument to restrict st.characters() to codepoints representable in that encoding Jun 25, 2022
@jenstroeger
Copy link
Contributor

In addition to charset perhaps using Unicode Scripts would be a useful feature here?

@Zac-HD
Copy link
Member Author

Zac-HD commented Jan 11, 2023

In addition to charset perhaps using Unicode Scripts would be a useful feature here?

Sounds interesting, but I don't Hypothesis is the right place to put a policy on which codepoints should be generated for a given script - at least initially - and st.characters() and st.text() are pretty tightly tied to codepoints. For example, should "uncoded" be included for a particular script? What about "inherited"?

If a third-party extension wanted to work out that mapping and demonstrated that people found it useful, I'd be happy to merge it back in later! Just cautious about making opinionated decisions under our backwards-compatibility constraints without first exploring some concrete use-cases 🙂

@jenstroeger
Copy link
Contributor

Thanks for the heads-up, @Zac-HD. It won’t happen this week or next but I’ll look into finding (or creating) a dataset that we can use to map scripts to codepoints, and then wrap a st.character() around it. Something something… 🤔

@jenstroeger
Copy link
Contributor

After some poking around, I think a first stab at building such an aforementioned strategy would be to sort the data.

Unfortunately, Python’s own unicodedata module gives access to only limited character properties. However, looking at the Unicode Character Database (UCD) v14 we have the ucd/Scripts.txt which correlates codepoints and codepoint ranges with Script, Category, and actual Name.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jan 20, 2023

No worries - we already vendor tlds-alpha-by-domain.txt, so all the machinery (including auto-updates, which should be rare for Unicode!) is in place 🙂

@Zac-HD Zac-HD changed the title Add charset= argument to restrict st.characters() to codepoints representable in that encoding Add codec= argument to restrict st.characters() to codepoints representable in that encoding Feb 19, 2023
@Zac-HD
Copy link
Member Author

Zac-HD commented Feb 20, 2023

It occurs to me that the most common use-cases will be codec="ascii", codec="utf-8", and codec=None (ie no restriction). These are also pretty easy to implement, so we might support that and come back for other codecs later.

@Cheukting
Copy link
Contributor

May I give this a try @Zac-HD

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 23, 2023

Certainly!

I'm heading out on a two-week camping trip tomorrow, so don't worry when it takes me a long time to respond or review 😅

@jenstroeger
Copy link
Contributor

Enjoy @Zac-HD!

@Cheukting I’m also interested in this, so please let me know if there’s anything I can contribute.

@Cheukting
Copy link
Contributor

@jenstroeger @Zac-HD Hey sorry it has been silent for a while. I started a new job last week and was a bit busy. I am diving back into this and will let you know if I have any updates :-)

@Zac-HD
Copy link
Member Author

Zac-HD commented Aug 6, 2023

Best wishes for the new job, and no worries - the issue is almost five years old, not especially urgent 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants