Skip to content

Tokenizer corner cases #2

SimonSapin opened this Issue Apr 18, 2012 · 2 comments

1 participant


Now that the descendant selector bug are fixed (unless I missed
something) the remaining issues that I see are:

  1. The current tokenizer for Symbol uses something like the '\w' regex,
    while a CSS IDENT token can contain any non-ASCII character (including
    U+00A0 no-break space, for example), can have backslash-escapes but can
    not start with a digit.

  2. Unicode white space (like U+00A0) counts as white space (either
    ignored or a descendant combinator) but should not (related to 1)

  3. 2n+1 or similar strings (arguments to :nth-child()) are tokenized as
    Symbol objects, and are then accepted by the parser as element types,
    class names, IDs, etc.

I think that any valid (for CSS) selector that only uses ASCII without
backslash-escapes should be fine now, so maybe this is not really a
problem ...

Owner is a special case of this bug. parse(u'.test\u201d') fails but should not. The Unicode character should be part of the class.


Actually, the valid escaping would be .test\201d. The u is a Python thing.

@SimonSapin SimonSapin added a commit that closed this issue Jun 14, 2012
@SimonSapin Add tests for series with whitespace
Together with the previous 2 commits, this fixes #2 and #7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.