Conversation
Sure, I'll try to review in the next few days (when time allows) |
I assume this is an exhaustive list of (tests demonstrating) the various issues with the previous parsing algorithm? Are there any non-covered cases that you're aware of (ignoring the attribute value normalization work that I'll eventually get back to)
|
Yes, it is. If you checkout this commit you will see what exactly errors present. Particularly, in some cases The attributes parser itself is very rough and do not follow specification. It doesn't check for illegal symbols in a key and unquoted value, entities does not expanded (but that is fine at that level -- you anyway needs an entity manager to get a list of defined entities). For now I left that as "by design" -- correctness of each attribute should be checked after getting it. But I have plans to try to replace manually written parser with one generated by grammar using rust-peg with more robust parsing. If you want to work on attribute normalization, some my minds: first, you need to introduce an entity manager. That manager should collect entities defined in a document and allow user to define external entities. Collecting entities requires, in turn, DTD parser. There already exist crates for that, I did not look deeply. May be we can use some of them. Then normalization would require reference to the entity manager. No need to store it inside an attribute, it is better to create a dedicated type for a well-formed attribute that would created by normalization. In fact, that PR already did that differentiation: Introducing entity manager and doing attribute normalization could be done in different PRs, because entity manager should be used everywhere |
7d4131b
to
e090006
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, new algorithm around 5 us (~10%) slower than previous, which, I believe, because of recovery checks. Not a significance difference
None => return attr!(self.position..len), | ||
pub fn next(&mut self, slice: &[u8]) -> Option<AttrResult> { | ||
let mut iter = match self.recover(slice) { | ||
Some(offset) => (offset..).zip(slice[offset..].iter()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this instead of
slice.iter().enumerate().skip(offset)
really improves performance around by 10 us (~15%), according to Criterion, so applied it
} | ||
}; | ||
|
||
match iter.find(|(_, &b)| b == quote) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried to replace this by
match memchr::memchr(quote, &slice[start_value..]) {
...
}
but that does not give a significance difference
src/events/attributes.rs
Outdated
/// That error can be raised only if iterator in XML mode. | ||
UnquotedValue(usize), | ||
/// Attribute value not finished with a matching quote, position relative to | ||
/// start of owning tag and a quote is provided. That position is always a last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this to mean that "the position will be after a quote character" but to say that the position is relative to the "start of the owning tag and a quote" is a little bit confusing. It sounds like it could mean either of
A) the position will be on a quote character, which isn't true
B) the position is relative to both, which isn't possible
I think "and a quote" could be removed and it would still make sense (given the examples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to rephrase it and I'll add it to the other your suggestions.
@Mingun I'm going through commit-by-commit so this isn't a final review yet. I'll continue in a few hours. The code looks great so far, I only have a few minor grammatical / wording nitpicks. |
Ok. Yeah, grammar should be checked because I'm not a native speaker :) I tried, but native speaker always can write it better |
@dralley, tell me when you finish with review. I'll plan to make a 0.23 release after merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mingun Here's another round of review, I have some questions and some code suggestions
@@ -41,7 +42,7 @@ where | |||
/// do not store reference to `Attributes` itself but instead create | |||
/// a new object on each advance of `Attributes` iterator, so we need | |||
/// to restore last position before advance. | |||
position: usize, | |||
iter: IterState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some other questions about this (below) but the docstring needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow forgot to save changes in the editor, so comment is changed outside of this PR in f9453b9
/// if `with_checks`, contains the ranges corresponding to the | ||
/// attribute names already parsed in this `Element` | ||
consumed: Vec<Range<usize>>, | ||
/// Iterator state, independent from the actual source of bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind the change?
It leads to some weird interfaces below - e.g. all the methods of IterState
take an external slice, and the caller could theoretically call iter.next(...)
or other methods with two entirely different slices. The slice has to get passed around more often, the methods here are now delegating directly to self.state
, etc.
Intuitively leaving everything in the Attributes
struct as it was makes more sense to me, but I might be missing some details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for another iterator implementation to share the code. The problem is described in tafia/quick-xml#337 (comment): we have a borrowing iterator, but doesn't have an owned iterator. For example, for MapAccess
we need an owned iterator and in this PR MapAccess
de facto implements it. In the future we should expose that iterator as public API
In any case IterState
is supposed to stay an internal detail and will never be exposed in public API. All internal usages will guaranties that they will pass correct slices
} | ||
Some((i, _)) => err!(AttrError::UnquotedValue(i)), | ||
None => return attr!(start_key..end_key), | ||
match self.state.next(self.bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels awkward because it's not the signature you'd expect from a next()
method (even though this isn't trying to implement Iterator::next()
, it's still the same thing conceptually). Usually iterators know what they're iterating over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for lifetime correctness. Owned iterator will store a Cow<'input, [u8]>
, borrowing iterator stores a &'input [u8]
. If you will try to move slice
argument of next
into IterState
and store in a Cow<'input, [u8]>
, even when borrowing iterator always will pass a Cow::Borrowed
, borrow checker will not allow you to return slices with the 'input
lifetime, because Cow
potentially can be owned.
You also cannot store a &'input [u8]
in the IterState
, because owned iterator is supposed to be
struct OwnedAttributes<'input> {
slice: Cow<'input, [u8]>,
state: IterState,
}
and you cannot store in that struct reference to a field of this struct (state
would be referenced to a slice
)
src/events/attributes.rs
Outdated
/// [events]: crate::events::Event::Start | ||
Unquoted(T, T), | ||
/// Attribute without value. Attribute key provided. This is HTML-style attribute, | ||
/// it can be returned only in HTML-mode parsing only. In XML mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"only in HTML-mode parsing only" => I think you meant to remove one of the "only"s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
impl<T> Attr<T> { | ||
/// Maps an `Attr<T>` to `Attr<U>` by applying a function to a contained key and value. | ||
#[inline] | ||
pub fn map<U, F>(self, mut f: F) -> Attr<U> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Regarding the whole implementation of Attr
, not this specific method)
I see where you're coming from with all of this, but there are some downsides to doing it this way.
- it would add a lot more branches to the generated code, because every access has to go through
match
- also general code bloat when it gets inlined everywhere that a dev uses
.key()
,.value()
, etc. - it feels like a detail few users would care too much about, even ones parsing HTML
As far as I can tell, HTML doesn't have any attributes that are required to not have a value - attributes with no value, and attributes with an empty-string value are supposed to be treated identically.
A number of attributes are boolean attributes. The presence of a boolean attribute on an element represents the true value, and the absence of the attribute represents the false value.
If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.
https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#boolean-attributes
Just the attribute name. The value is implicitly the empty string.
https://html.spec.whatwg.org/multipage/syntax.html#syntax-attr-empty
So given that the spec says that all of these syntaxes are equivalent ways to express the attribute, I'm not sure that all of this is worth it, compared to just having a struct with key
+ value
fields. I wouldn't have a problem with a third field to denote unquoted, empty, single-quote, double-quote though. It's just not important enough to treat like Option
or Result
and force dealing with specific variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That benchmarks shows only small 5 us regression (from ~75 to ~80 with checks, from ~44 to ~49 without checks, from ~155 to ~159 for try_get_attribute
) in parsing which, I think, not a big pay for an additional info, that could be useful:
Lines 256 to 333 in 6a3fd90
/// Benchmarks parsing attributes from events | |
fn attributes(c: &mut Criterion) { | |
let mut group = c.benchmark_group("attributes"); | |
group.bench_function("with_checks = true", |b| { | |
b.iter(|| { | |
let mut r = Reader::from_reader(PLAYERS); | |
r.check_end_names(false).check_comments(false); | |
let mut count = criterion::black_box(0); | |
let mut buf = Vec::new(); | |
loop { | |
match r.read_event(&mut buf) { | |
Ok(Event::Empty(e)) => { | |
for attr in e.attributes() { | |
let _attr = attr.unwrap(); | |
count += 1 | |
} | |
} | |
Ok(Event::Eof) => break, | |
_ => (), | |
} | |
buf.clear(); | |
} | |
assert_eq!(count, 1041); | |
}) | |
}); | |
group.bench_function("with_checks = false", |b| { | |
b.iter(|| { | |
let mut r = Reader::from_reader(PLAYERS); | |
r.check_end_names(false).check_comments(false); | |
let mut count = criterion::black_box(0); | |
let mut buf = Vec::new(); | |
loop { | |
match r.read_event(&mut buf) { | |
Ok(Event::Empty(e)) => { | |
for attr in e.attributes().with_checks(false) { | |
let _attr = attr.unwrap(); | |
count += 1 | |
} | |
} | |
Ok(Event::Eof) => break, | |
_ => (), | |
} | |
buf.clear(); | |
} | |
assert_eq!(count, 1041); | |
}) | |
}); | |
group.bench_function("try_get_attribute", |b| { | |
b.iter(|| { | |
let mut r = Reader::from_reader(PLAYERS); | |
r.check_end_names(false).check_comments(false); | |
let mut count = criterion::black_box(0); | |
let mut buf = Vec::new(); | |
loop { | |
match r.read_event(&mut buf) { | |
Ok(Event::Empty(e)) if e.name() == b"player" => { | |
for name in ["num", "status", "avg"] { | |
if let Some(_attr) = e.try_get_attribute(name).unwrap() { | |
count += 1 | |
} | |
} | |
assert!(e | |
.try_get_attribute("attribute-that-doesn't-exist") | |
.unwrap() | |
.is_none()); | |
} | |
Ok(Event::Eof) => break, | |
_ => (), | |
} | |
buf.clear(); | |
} | |
assert_eq!(count, 150); | |
}) | |
}); | |
group.finish(); | |
} |
Besides, I'll also have a next PR with Span
support and I want to expose the information from where that or another piece of data come. That is dependent of the attribute shape and useful for reporting errors.
bq. I wouldn't have a problem with a third field to denote unquoted, empty, single-quote, double-quote though. It's just not important enough to treat like Option
or Result
and force dealing with specific variants.
That wouldn't be idiomatic Rust. Usually you will use .key()
and .value()
methods, so you hardly likely will deal with the concrete variant. Moreover, I've introduced a new Attr
type instead of modifying Attribute
to allow users to not deal with attribute shape when it do not matter. There will only one conversion from Attr
to Attribute
that just moves shape erasing from internals of iterator to a user-defined place. That should not bloat code well and likely will be optimized by the compiler if you did that just after getting the attribute from the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wouldn't be idiomatic Rust.
There are cases where it actually is idiomatic, or at least a common pattern. std::io::ErrorKind is used because all variants have the same signature (like is the case here). nom::error::ErrorKind is the same.
The problem with going purely off the benchmarks here is that depending on the consistency of the input, the branch predictor is going to get it right most of the time (and if it's a small enough input, the the branch predictor is going to learn it over the course of many iterations even if it's not consistent), so it's not going to look that expensive.
But it's very dependent on the input, and how much the extra bloat hurts depends a lot on what code the user is writing and how much cache their CPU has. It's just a lot of extra stuff to put in such a hot path.
Maybe it could look like this?
struct Attr {
key: ...,
value: ...,
shape: AttrShape,
}
enum AttrShape {
Unqoted(..span stuff..),
DoubleQuoted(..span stuff..),
SingleQuoted(..span stuff..),
Empty(..span stuff..),
}
And then the user could avoid the cost completely if they don't care about it, and the Span work still has an obvious place to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that what you suggesting, is implemented now:
- your
AttrShape
isAttr<Range<usize>>
-- that is the type that returned byIterState::next()
- your
Attr
is the futureAttribute
Because we in any case need to get spans for key
and value
, and because value
can be missed, and because I want to keep only one parser implementation and not copy it for different types, the current IterState
design seems to me the almost only possible (the other way is return (Range<usize>, Option<Range<usize>>)
instead but I don't think that it gives any benefits while reducing code clarity).
I see that your point in possible code bloat and/or inconveniences to use match
constantly, but that is impossible to trigger from the user code now, because the only way to create an Attr
instance in the user code is to create it by yourself. IterState
is a private type. The only place where conversion from Attr
to Attribute
is performed now is inside Attributes::next()
:
fast-xml/src/events/attributes.rs
Lines 335 to 346 in eaebf45
impl<'a> Iterator for Attributes<'a> { | |
type Item = Result<Attribute<'a>, AttrError>; | |
#[inline] | |
fn next(&mut self) -> Option<Self::Item> { | |
match self.state.next(self.bytes) { | |
None => None, | |
Some(Ok(a)) => Some(Ok(a.map(|range| &self.bytes[range]).into())), | |
Some(Err(e)) => Some(Err(e)), | |
} | |
} | |
} |
I still believe that compiler is able to optimize all right there. Maybe that place seems too generic, because I hide conversion under map() + Into<Attribute>
, but it is fairly straight-ford
Attr::Unquoted(key, value) => (key, Some(value)), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all this code would be deleted up to my previous comment. key()
and value()
methods would stay but it would just be a simple accessor.
@Mingun Barring the two docstrings that need to be fixed, let's call this approved - I'd like to experiment for myself to either better understand the problems you've discussed and maybe gather some concrete data about the costs or lack thereof of the extra branching - but since that will take some time I won't block the PR on it. The tests pass at least. |
Also fix incorrect references to `Attributes` instead of `Attribute`
.find(|&(_, &b)| b == b'=' || is_whitespace(b)) cannot return Some((_, b'"')) or Some((_, b'\'')) because `"` and `'` are not whitespaces
Because iterator over attributes is recovering, user can want to inspect possible errors and filtering out irrelevant errors should help him Co-authored-by: Daniel Alley <dalley@redhat.com>
Because now error type is PartialEq, we can use `assert_eq!` directly, which is especially nice when used with pretty_assertions crate
failures (24): events::attributes::html::duplicated::with_check::double_quoted events::attributes::html::duplicated::with_check::key_only events::attributes::html::duplicated::with_check::single_quoted events::attributes::html::duplicated::with_check::unquoted events::attributes::html::single::missed_value events::attributes::html::sparsed::double_quoted events::attributes::html::sparsed::key_contains_invalid events::attributes::html::sparsed::key_only events::attributes::html::sparsed::key_start_invalid events::attributes::html::sparsed::missed_value events::attributes::html::sparsed::single_quoted events::attributes::html::sparsed::unquoted events::attributes::xml::duplicated::with_check::double_quoted events::attributes::xml::duplicated::with_check::key_only events::attributes::xml::duplicated::with_check::single_quoted events::attributes::xml::duplicated::with_check::unquoted events::attributes::xml::duplicated::without_check::key_only events::attributes::xml::duplicated::without_check::unquoted events::attributes::xml::first::key_only events::attributes::xml::first::missed_value events::attributes::xml::first::unquoted events::attributes::xml::single::key_only events::attributes::xml::single::missed_value events::attributes::xml::sparsed::missed_value
Introduce new `Attr` type that stores not only an attribute content, but also its shape
@dralley, thanks for your work on reviewing this PR! |
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | patch | `0.23.0-alpha3` -> `0.23.0` | | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.22.0` -> `0.23.0` | --- ### Release Notes <details> <summary>tafia/quick-xml</summary> ### [`v0.23.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#​0230----2022-05-08) - feat: add support for `i128` / `u128` in attributes or text/CDATA content - test: add tests for malformed inputs for serde deserializer - fix: allow to deserialize `unit`s from any data in attribute values and text nodes - refactor: unify errors when EOF encountered during serde deserialization - test: ensure that after deserializing all XML was consumed - feat: add `Deserializer::from_str`, `Deserializer::from_slice` and `Deserializer::from_reader` - refactor: deprecate `from_bytes` and `Deserializer::from_borrowing_reader` because they are fully equivalent to `from_slice` and `Deserializer::new` - refactor: reduce number of unnecessary copies when deserialize numbers/booleans/identifiers from the attribute and element names and attribute values - fix: allow to deserialize `unit`s from text and CDATA content. `DeError::InvalidUnit` variant is removed, because after fix it is no longer used - fix: `ElementWriter`, introduced in [#​274](tafia/quick-xml#274) (0.23.0-alpha2) now available to end users - fix: allow lowercase `<!doctype >` definition (used in HTML 5) when parse document from `&[u8]` - test: add tests for consistence behavior of buffered and borrowed readers - fix: produce consistent error positions in buffered and borrowed readers - feat: `Error::UnexpectedBang` now provide the byte found - refactor: unify code for buffered and borrowed readers - fix: fix internal panic message when parse malformed XML ([#​344](tafia/quick-xml#344)) - test: add tests for trivial documents (empty / only comment / `<root>...</root>` -- one tag with content) - fix: CDATA was not handled in many cases where it should - fix: do not unescape CDATA content because it never escaped by design. CDATA event data now represented by its own `BytesCData` type ([quick-xml#​311](tafia/quick-xml#311)) - feat: add `Reader::get_ref()` and `Reader::get_mut()`, rename `Reader::into_underlying_reader()` to `Reader::into_inner()` - refactor: now `Attributes::next()` returns a new type `AttrError` when attribute parsing failed ([#​4](Mingun/fast-xml#4)) - test: properly test all paths of attributes parsing ([#​4](Mingun/fast-xml#4)) - feat: attribute iterator now implements `FusedIterator` ([#​4](Mingun/fast-xml#4)) - fix: fixed many errors in attribute parsing using iterator, returned from `attributes()` or `html_attributes()` ([#​4](Mingun/fast-xml#4)) </details> --- ### Configuration📅 **Schedule**: At any time (no schedule defined).🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox. --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1377 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
This PR adds tests for attribute parsing and fixes bugs found. It introduces new
AttrError
type that now returned byAttributes::next()
and contains all possible errors when attributes is parsed.Also it introduces a new type
Attr
which is by design is borrowed-only. In version 0.24 I plan to changeAttributes
iterator and return that new type.@dralley, it would be great to run performance tests and see the results and maybe make optimizations like that you've done in #3