Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Tokenizer corner cases #2

Closed
SimonSapin opened this Issue · 2 comments

1 participant

Simon Sapin
Simon Sapin
Owner

Now that the descendant selector bug are fixed (unless I missed
something) the remaining issues that I see are:

  1. The current tokenizer for Symbol uses something like the '\w' regex,
    while a CSS IDENT token can contain any non-ASCII character (including
    U+00A0 no-break space, for example), can have backslash-escapes but can
    not start with a digit.

  2. Unicode white space (like U+00A0) counts as white space (either
    ignored or a descendant combinator) but should not (related to 1)

  3. 2n+1 or similar strings (arguments to :nth-child()) are tokenized as
    Symbol objects, and are then accepted by the parser as element types,
    class names, IDs, etc.

I think that any valid (for CSS) selector that only uses ASCII without
backslash-escapes should be fine now, so maybe this is not really a
problem ...

Simon Sapin
Owner

https://bugs.launchpad.net/lxml/+bug/754636 is a special case of this bug. parse(u'.test\u201d') fails but should not. The Unicode character should be part of the class.

Simon Sapin
Owner

Actually, the valid escaping would be .test\201d. The u is a Python thing.

Simon Sapin SimonSapin closed this issue from a commit
Simon Sapin Add tests for series with whitespace
Together with the previous 2 commits, this fixes #2 and #7
d405f89
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.