Skip to content
This repository

Tokenizer corner cases #2

Closed
SimonSapin opened this Issue April 18, 2012 · 2 comments

1 participant

Simon Sapin
Simon Sapin
Owner

Now that the descendant selector bug are fixed (unless I missed
something) the remaining issues that I see are:

  1. The current tokenizer for Symbol uses something like the '\w' regex,
    while a CSS IDENT token can contain any non-ASCII character (including
    U+00A0 no-break space, for example), can have backslash-escapes but can
    not start with a digit.

  2. Unicode white space (like U+00A0) counts as white space (either
    ignored or a descendant combinator) but should not (related to 1)

  3. 2n+1 or similar strings (arguments to :nth-child()) are tokenized as
    Symbol objects, and are then accepted by the parser as element types,
    class names, IDs, etc.

I think that any valid (for CSS) selector that only uses ASCII without
backslash-escapes should be fine now, so maybe this is not really a
problem ...

Simon Sapin
Owner

https://bugs.launchpad.net/lxml/+bug/754636 is a special case of this bug. parse(u'.test\u201d') fails but should not. The Unicode character should be part of the class.

Simon Sapin
Owner

Actually, the valid escaping would be .test\201d. The u is a Python thing.

Simon Sapin SimonSapin closed this issue from a commit June 14, 2012
Simon Sapin Add tests for series with whitespace
Together with the previous 2 commits, this fixes #2 and #7
d405f89
Simon Sapin SimonSapin closed this in d405f89 June 14, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.