"utf8" module supporting matching UTF-8/returning &str #59

mcclure · 2022-12-28T05:20:17Z

I find "pom" very enjoyable to use but I find I have frustration around converting inputs and match-strings to/from UTF-8 &str (see #53). I think pom adding explicit support for UTF-8 would bring important advantages:

UX improvements when working with Rust strings and chars
Match primitives that guarantee at each step a valid UTF-8 string is being matched, for example an any() that matches UTF-8 chars only (yes, I know I can .convert(str::from_utf8) and it will correctly reject invalid UTF-8, but that bails out relatively late)

This is a draft/first attempt at a utf8 module. (The regular parser is unchanged, utf8 is opt-in.) You can see what using it is like in the example examples/utf8.rs but it's much like normal pom. (.parse() still requires the input to be :as_bytes()ed, but seq() accepts normal Rust strings). The basic approach is

use pom::utf8::* contains functions that have the same names and usage as the ones in pom::parser::* (so it is mostly a drop-in replacement), but any returns or arguments that are &[I] in parser::Parser are &str in utf8::Parser.
pom::utf8::Parser<'a, O> is implemented as a thin wrapper around pom::parser::Parser<'a, u8, O>— it is a separate type because by keeping track of which patterns are pure utf8, collect() over a tree of utf8::Parsers can return a &str safely. But because at core it's still just parser::Parser<_, u8,_>, it can be combined into a single pattern with non-UTF8 parser::Parser (at the cost of no longer being able to do a collect() without re-verifying UTF-8).

This prototype has just enough functions to implement the examples/utf8.rs example. It implements UTF-8 aware seq() and any() combinators, has the UTF-8 aware collect and convert, you can turn a utf8 Parser into parser::Parser with from/into, and it so far has methods passing discard, map, parse, repeat, | and * on to the underlying parser::Parser implementation.

Next steps are:

Implement rest of parser:: functions/methods (I may do this with a macro? I think I would have to write the macro myself. There are some delegation macro crates but none of them seem exactly fit to this situation.)
sym needs to be special because this is the one function I intend to use a slightly different interface from parser::Parser:
- pub fn sym<'a>(tag: char) -> Parser<'a, &'a str> will return a single-char string
- pub fn sym_char<'a>(tag: char) -> Parser<'a, char> will return a parsed char, to make constructions like sym_char(ch).is_a(str::is_alphabetic) possible
The utf8 module uses unsafe {} because it calls str::from_utf8_unchecked on slices it has already confirmed contain complete UTF-8 characters. I would like to introduce a Cargo "feature" to remove use of unsafe from utf8, at the cost of a redundant str::from_utf8 check in places.
May create a utf8::Parser.parse_str(input:&str) that just calls parse(input:as_bytes()), for convenience (?)
Versions of +, - etc that take one parser::Parser and one utf8::Parser and return a parser::Parser, to make it easy to mix them; also I want to create an examples/utf8_mixed.rs demonstrating using parser::Parser and utf8::Parser in the same pattern (EG a simple MsgPack parser or something).

Long term additions I'd be interested in attempting are:

Possibly a Unicode version of pom::char_class?
Possibly glyph support, or support for normalization forms (this would make possible things like seq_case_insensitive which would be very useful to me)

What I need to know from @J-F-Liu:

Are you interested in this PR, in theory? If you do not think this belongs in pom, I will probably publish it as a second supplementary crate.
Should utf-8 support be behind a "feature" so it can be disabled? It does introduce complexities such as external crate support (it uses bstr) and unsafe.
The type of utf8::Parser is Parser<'a, O>. This makes sense because by definition it can only ever work on u8, but means mixing fns that define utf8::Parsers and parser::Parsers in the same file would be a little confusing because some functions would have 2 generic arguments and some would have 3. Would it make sense to put the I type argument back in with a where I=u8, and require the user to type the u8 generic argument every time? (My vote is no, it's fine the way it is now, but I wanted to ask.)

Thank you for this neat library! I have used it a lot this month.

J-F-Liu · 2022-12-29T03:58:29Z

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char.
An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

mcclure · 2022-12-29T04:07:21Z

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

I am currently using use bstr::decode_utf8; for this purpose and it seems to work very well, it returns both size and char and it works on slices (next_code_point requires an iterator). It is also safe (although probably the bstr-internal implementation makes use of unsafe). It did mean bringing in bstr.

I think even if we take a utf-8 string directly as the input, it is adequate to use bytes internally (IE take utf-8 string as input and call as_bytes immediately). Although if we operated on &str throughout it might allow removing some or all of the unsafes if that matters.

In my research it appears the number of bytes in a char is predictable because Rust rejects overlong-encoded UTF-8 characters as invalid. But I still feel safer that bstr::decode_utf8 returns a byte count.

J-F-Liu · 2022-12-29T04:19:26Z

Well, bstr::decode_utf8 already does this.
It's ok to define the type of utf8::Parser as Parser<'a, O>.
The type of any should be pub fn any<'a>() -> Parser<'a, char>.

mcclure · 2022-12-29T04:34:46Z

Well, bstr::decode_utf8 already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should be pub fn any<'a>() -> Parser<'a, char>.

Do you have an opinion about the return type of sym()? Also char then?

By the way, here is something I am still confused about. Let's say I run

any().repeat(1..).collect()
or
any().repeat(1..).discard()

Say it matches 864 characters. In either case, will the repeat() wind up creating a vec of 864 chars and then return them, only for them to immediately be thrown away?
Is this a potential performance issue?
Or will the compiler notice the result is thrown away on the .collect() chain and eliminate that code?

J-F-Liu · 2022-12-29T05:50:42Z

Yes, sym() also return char.
I'm not sure about compiler optimization, take(n) or skip(n) maybe better in this case.

mcclure · 2022-12-29T05:52:47Z

Thanks. I will update when I have a fuller implementation.

I will not worry about the compiler optimization question further for now because this problem is also present in the parser::Parser version anyhow.

mcclure · 2023-01-02T04:02:33Z

Hm, "parser::tag" is not documented in https://crates.io/crates/pom and I'm a little unclear what its function is... Am I correct it matches only on inputs which are slices of char arrays, IE, I=char?

I think there is no relevant way to implement this function in the utf8 module because it's a special function for a special case the utf8 function will not hit, and I should just skip it. Is this correct?

mcclure · 2023-01-02T22:23:50Z

I am also a little bit confused by the "shr" operation. The doc on crates.io says "Parse p and get result P, then parse q and return result of q(P).", which implies to me that parsers p and q both parse as-is, but from reading the code it looks like q is a function that returns a parser, which I guess then runs. Which of these is correct?

(If that second thing is how it works (the parser is result_of_p(q), not q) that's very useful because it makes it possible to do things like have "p" return a number of bytes and that get passed to take().)

mcclure · 2023-01-02T23:57:50Z

src/utf8.rs

+}
+
+/// Read n chars.
+pub fn take<'a>(n: usize) -> Parser<'a, &'a str> {


Notice: read n chars not n bytes, this is a difference from the base parser behavior/interface

mcclure · 2023-01-03T00:04:51Z

The PR now has feature parity with parser::Parser, the only thing I think holding this back from a potential merge is writing some test cases (I have not tested it other than the utf8 and utf8_mixed examples, and a quick test with shr).

Other than that, I think the following are good ideas, but I would suggest leaving them to a followup PR:

~~A take_bytes / skip_bytes function~~ EDIT: I just did this one
As mentioned above, ~~glyphs~~ graphemes and normalized forms
As mentioned above, utf8::char_class
As mentioned above, a "super_safe" feature that eliminates use of unsafe
pos() gives a byte offset; a pos_char() giving a character position would be very interesting (this might be a little bit difficult)

J-F-Liu · 2023-01-03T12:58:28Z

There is a usage of shr in https://github.com/J-F-Liu/lopdf/blob/master/src/parser.rs#L131

mcclure added 2 commits December 27, 2022 23:10

Create a utf8::Parser newtype so that collect() can safely return a str

32bc469

Replace utf8 parser.as_bytes() with parser.into<parser::Parser>()

f364003

mcclure added 2 commits December 29, 2022 22:10

Example program mixing u8 parser and UTF-8 parser

b6420f6

utf8 sym tag / fix any tag return type

0654ef7

mcclure added 4 commits January 2, 2023 13:53

utf8 one_of/none_of

268e29c

is_a/not_a

241eeca

Remaining parser constructors for utf8

0e9df5a

More operator overloads: And, Sub, degrade versions for BitOr and Mul

fd58c71

mcclure added 2 commits January 2, 2023 18:24

Doc comments on utf8 macro expansions, utf8 Shr

b8088ba

Doc comments on utf8 macro expansions, rest of parser functions in utf8

7311346

mcclure changed the title ~~"utf8" module supporting matching UTF-8/returning &str [WIP]~~ "utf8" module supporting matching UTF-8/returning &str Jan 2, 2023

mcclure added 2 commits January 2, 2023 18:49

parse_str convenience calls .as_bytes for you

1a70751

Feature for utf8, enabled by default

d15ff7c

mcclure commented Jan 2, 2023

View reviewed changes

utf::take_bytes() and utf::skip_bytes()

2ce65a2

J-F-Liu approved these changes Jan 3, 2023

View reviewed changes

J-F-Liu merged commit 7efec98 into J-F-Liu:master Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"utf8" module supporting matching UTF-8/returning &str #59

"utf8" module supporting matching UTF-8/returning &str #59

mcclure commented Dec 28, 2022 •

edited

J-F-Liu commented Dec 29, 2022

mcclure commented Dec 29, 2022

J-F-Liu commented Dec 29, 2022 •

edited

mcclure commented Dec 29, 2022 •

edited

J-F-Liu commented Dec 29, 2022

mcclure commented Dec 29, 2022

mcclure commented Jan 2, 2023

mcclure commented Jan 2, 2023

mcclure Jan 2, 2023 •

edited

mcclure commented Jan 3, 2023 •

edited

J-F-Liu commented Jan 3, 2023

"utf8" module supporting matching UTF-8/returning &str #59

"utf8" module supporting matching UTF-8/returning &str #59

Conversation

mcclure commented Dec 28, 2022 • edited

J-F-Liu commented Dec 29, 2022

mcclure commented Dec 29, 2022

J-F-Liu commented Dec 29, 2022 • edited

mcclure commented Dec 29, 2022 • edited

J-F-Liu commented Dec 29, 2022

mcclure commented Dec 29, 2022

mcclure commented Jan 2, 2023

mcclure commented Jan 2, 2023

mcclure Jan 2, 2023 • edited

Choose a reason for hiding this comment

mcclure commented Jan 3, 2023 • edited

J-F-Liu commented Jan 3, 2023

mcclure commented Dec 28, 2022 •

edited

J-F-Liu commented Dec 29, 2022 •

edited

mcclure commented Dec 29, 2022 •

edited

mcclure Jan 2, 2023 •

edited

mcclure commented Jan 3, 2023 •

edited