Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"utf8" module supporting matching UTF-8/returning &str #59

Merged
merged 13 commits into from Jan 3, 2023

Conversation

mcclure
Copy link
Contributor

@mcclure mcclure commented Dec 28, 2022

I find "pom" very enjoyable to use but I find I have frustration around converting inputs and match-strings to/from UTF-8 &str (see #53). I think pom adding explicit support for UTF-8 would bring important advantages:

  • UX improvements when working with Rust strings and chars
  • Match primitives that guarantee at each step a valid UTF-8 string is being matched, for example an any() that matches UTF-8 chars only (yes, I know I can .convert(str::from_utf8) and it will correctly reject invalid UTF-8, but that bails out relatively late)

This is a draft/first attempt at a utf8 module. (The regular parser is unchanged, utf8 is opt-in.) You can see what using it is like in the example examples/utf8.rs but it's much like normal pom. (.parse() still requires the input to be :as_bytes()ed, but seq() accepts normal Rust strings). The basic approach is

  • use pom::utf8::* contains functions that have the same names and usage as the ones in pom::parser::* (so it is mostly a drop-in replacement), but any returns or arguments that are &[I] in parser::Parser are &str in utf8::Parser.
  • pom::utf8::Parser<'a, O> is implemented as a thin wrapper around pom::parser::Parser<'a, u8, O>— it is a separate type because by keeping track of which patterns are pure utf8, collect() over a tree of utf8::Parsers can return a &str safely. But because at core it's still just parser::Parser<_, u8,_>, it can be combined into a single pattern with non-UTF8 parser::Parser (at the cost of no longer being able to do a collect() without re-verifying UTF-8).

This prototype has just enough functions to implement the examples/utf8.rs example. It implements UTF-8 aware seq() and any() combinators, has the UTF-8 aware collect and convert, you can turn a utf8 Parser into parser::Parser with from/into, and it so far has methods passing discard, map, parse, repeat, | and * on to the underlying parser::Parser implementation.

Next steps are:

  • Implement rest of parser:: functions/methods (I may do this with a macro? I think I would have to write the macro myself. There are some delegation macro crates but none of them seem exactly fit to this situation.)
  • sym needs to be special because this is the one function I intend to use a slightly different interface from parser::Parser:
    • pub fn sym<'a>(tag: char) -> Parser<'a, &'a str> will return a single-char string
    • pub fn sym_char<'a>(tag: char) -> Parser<'a, char> will return a parsed char, to make constructions like sym_char(ch).is_a(str::is_alphabetic) possible
  • The utf8 module uses unsafe {} because it calls str::from_utf8_unchecked on slices it has already confirmed contain complete UTF-8 characters. I would like to introduce a Cargo "feature" to remove use of unsafe from utf8, at the cost of a redundant str::from_utf8 check in places.
  • May create a utf8::Parser.parse_str(input:&str) that just calls parse(input:as_bytes()), for convenience (?)
  • Versions of +, - etc that take one parser::Parser and one utf8::Parser and return a parser::Parser, to make it easy to mix them; also I want to create an examples/utf8_mixed.rs demonstrating using parser::Parser and utf8::Parser in the same pattern (EG a simple MsgPack parser or something).

Long term additions I'd be interested in attempting are:

  • Possibly a Unicode version of pom::char_class?
  • Possibly glyph support, or support for normalization forms (this would make possible things like seq_case_insensitive which would be very useful to me)

What I need to know from @J-F-Liu:

  • Are you interested in this PR, in theory? If you do not think this belongs in pom, I will probably publish it as a second supplementary crate.
  • Should utf-8 support be behind a "feature" so it can be disabled? It does introduce complexities such as external crate support (it uses bstr) and unsafe.
  • The type of utf8::Parser is Parser<'a, O>. This makes sense because by definition it can only ever work on u8, but means mixing fns that define utf8::Parsers and parser::Parsers in the same file would be a little confusing because some functions would have 2 generic arguments and some would have 3. Would it make sense to put the I type argument back in with a where I=u8, and require the user to type the u8 generic argument every time? (My vote is no, it's fine the way it is now, but I wanted to ask.)

Thank you for this neat library! I have used it a lot this month.

@J-F-Liu
Copy link
Owner

J-F-Liu commented Dec 29, 2022

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char.
An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

@mcclure
Copy link
Contributor Author

mcclure commented Dec 29, 2022

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

I am currently using use bstr::decode_utf8; for this purpose and it seems to work very well, it returns both size and char and it works on slices (next_code_point requires an iterator). It is also safe (although probably the bstr-internal implementation makes use of unsafe). It did mean bringing in bstr.

I think even if we take a utf-8 string directly as the input, it is adequate to use bytes internally (IE take utf-8 string as input and call as_bytes immediately). Although if we operated on &str throughout it might allow removing some or all of the unsafes if that matters.

In my research it appears the number of bytes in a char is predictable because Rust rejects overlong-encoded UTF-8 characters as invalid. But I still feel safer that bstr::decode_utf8 returns a byte count.

@J-F-Liu
Copy link
Owner

J-F-Liu commented Dec 29, 2022

Well, bstr::decode_utf8 already does this.
It's ok to define the type of utf8::Parser as Parser<'a, O>.
The type of any should be pub fn any<'a>() -> Parser<'a, char>.

@mcclure
Copy link
Contributor Author

mcclure commented Dec 29, 2022

Well, bstr::decode_utf8 already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should be pub fn any<'a>() -> Parser<'a, char>.

Do you have an opinion about the return type of sym()? Also char then?


By the way, here is something I am still confused about. Let's say I run

any().repeat(1..).collect()
or
any().repeat(1..).discard()

Say it matches 864 characters. In either case, will the repeat() wind up creating a vec of 864 chars and then return them, only for them to immediately be thrown away?
Is this a potential performance issue?
Or will the compiler notice the result is thrown away on the .collect() chain and eliminate that code?

@J-F-Liu
Copy link
Owner

J-F-Liu commented Dec 29, 2022

Yes, sym() also return char.
I'm not sure about compiler optimization, take(n) or skip(n) maybe better in this case.

@mcclure
Copy link
Contributor Author

mcclure commented Dec 29, 2022

Thanks. I will update when I have a fuller implementation.

I will not worry about the compiler optimization question further for now because this problem is also present in the parser::Parser version anyhow.

@mcclure
Copy link
Contributor Author

mcclure commented Jan 2, 2023

Hm, "parser::tag" is not documented in https://crates.io/crates/pom and I'm a little unclear what its function is... Am I correct it matches only on inputs which are slices of char arrays, IE, I=char?

I think there is no relevant way to implement this function in the utf8 module because it's a special function for a special case the utf8 function will not hit, and I should just skip it. Is this correct?

@mcclure
Copy link
Contributor Author

mcclure commented Jan 2, 2023

I am also a little bit confused by the "shr" operation. The doc on crates.io says "Parse p and get result P, then parse q and return result of q(P).", which implies to me that parsers p and q both parse as-is, but from reading the code it looks like q is a function that returns a parser, which I guess then runs. Which of these is correct?

(If that second thing is how it works (the parser is result_of_p(q), not q) that's very useful because it makes it possible to do things like have "p" return a number of bytes and that get passed to take().)

@mcclure mcclure changed the title "utf8" module supporting matching UTF-8/returning &str [WIP] "utf8" module supporting matching UTF-8/returning &str Jan 2, 2023
}

/// Read n chars.
pub fn take<'a>(n: usize) -> Parser<'a, &'a str> {
Copy link
Contributor Author

@mcclure mcclure Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice: read n chars not n bytes, this is a difference from the base parser behavior/interface

@mcclure
Copy link
Contributor Author

mcclure commented Jan 3, 2023

The PR now has feature parity with parser::Parser, the only thing I think holding this back from a potential merge is writing some test cases (I have not tested it other than the utf8 and utf8_mixed examples, and a quick test with shr).

Other than that, I think the following are good ideas, but I would suggest leaving them to a followup PR:

  • A take_bytes / skip_bytes function EDIT: I just did this one
  • As mentioned above, glyphs graphemes and normalized forms
  • As mentioned above, utf8::char_class
  • As mentioned above, a "super_safe" feature that eliminates use of unsafe
  • pos() gives a byte offset; a pos_char() giving a character position would be very interesting (this might be a little bit difficult)

@J-F-Liu J-F-Liu merged commit 7efec98 into J-F-Liu:master Jan 3, 2023
@J-F-Liu
Copy link
Owner

J-F-Liu commented Jan 3, 2023

There is a usage of shr in https://github.com/J-F-Liu/lopdf/blob/master/src/parser.rs#L131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants