Now that the descendant selector bug are fixed (unless I missed
something) the remaining issues that I see are:
The current tokenizer for Symbol uses something like the '\w' regex,
while a CSS IDENT token can contain any non-ASCII character (including
U+00A0 no-break space, for example), can have backslash-escapes but can
not start with a digit.
Unicode white space (like U+00A0) counts as white space (either
ignored or a descendant combinator) but should not (related to 1)
2n+1 or similar strings (arguments to :nth-child()) are tokenized as
Symbol objects, and are then accepted by the parser as element types,
class names, IDs, etc.
I think that any valid (for CSS) selector that only uses ASCII without
backslash-escapes should be fine now, so maybe this is not really a
https://bugs.launchpad.net/lxml/+bug/754636 is a special case of this bug. parse(u'.test\u201d') fails but should not. The Unicode character should be part of the class.
Actually, the valid escaping would be .test\201d. The u is a Python thing.
Add tests for series with whitespace
Together with the previous 2 commits, this fixes #2 and #7