New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor HTML Parser #803
Refactor HTML Parser #803
Conversation
I was considering rewriting this myself to fix some of this stuff once and for all, so I'm glad to see the work done here. I'll have to take a look at this and see how it handles or see if there are additional things we need to address. It seems you simplified a lot of the code which is a good thing. |
Added some tests using the new framework. The new tests include comments on things we might want to reevaluate. Also, one of the new tests is failing and needs fixed. The plan is to continue adding new tests and fix various edge cases until all the old tests pass At that point, the old tests should be redundant and can be removed. |
Here's a question. Why don't we build |
Maybe the new |
I've rounded out the tests of valid raw HTML blocks. I have yet to address any of the failing tests. And I haven't even looked at invalid raw HTML. However, I have questions regarding what the correct behavior should be in comments associated with various tests in the test file. Any feedback on those would be appreciated. |
1caf92b
to
3b05be9
Compare
Sigh. I finally looked at the failing processing instruction tests. I never thought this would be what breaks the standard lib HTML parser: >>> class TestParser(HTMLParser):
... def handle_pi(self, data):
... print ('PI:', '"{}"'.format(data))
... def handle_data(self, data):
... print ('TEXT:', '"{}"'.format(data))
...
>>> parser = TestParser()
>>> parser.feed("<?php echo '>'; ?>")
PI: "php echo '"
TEXT: "'; ?>" Note that the output should be:
However, the angle bracket in the quote is mistaken for the closing bracket and everything after it is considered text outside of a tag. In fact, a look at the source code revels that it really is only looking for the next And then there is this comment in another section of the code (for finding the closing bracket of an end tag):
I would guess that Python-Markdown doesn't currently handle that case either, but is seems that these sorts of edge cases are not handled very well. In the processing instruction case, that is a very reasonable sort of thing to expect. Although, I suppose it would be weird for PHP code to be embedded in a Markdown file which is not in a code block. The only purpose for that would be to use PHP an a template engine for generating Markdown. But in that case, the document would be run through PHP first, before being run through the Markdown parser and the PHP would no longer be present. That said, it should work correctly. I suppose we could override the |
All things considered, regardless of whether we use the HTML parsing implementation in this PR, I created a pretty good set of tests which we can use regardless of how HTML parsing is implemented. |
I did some research on processing instructions and it seems that, as Wikipedia summarizes:
Wondering which applies to HTML, I checked the HTML5 spec, and under Dependencies, it specifically states that the Associating Style Sheets with XML documents specification is where processing instructions are defined. From that it seems clear that valid processing instructions in HTML5 should be enclosed within |
This is experimental. More of the HTMLParser methods need to be fleshed out. So far the basic stuff works as long as there is no invalid HTML in the document.
There are some design desisions to make as noted in comments.
I added the following to the markdown/tests/test_syntax/extensions/test_md_in_html.py Lines 24 to 30 in 2d8ce54
It works well to ensure that the extension doesn't break the default behavior. The problem is that While running the tests an extra time makes the test run slightly longer, that's only a minor inconvenience. The real problem is that it is confusing that the same test failure gets reported three times, two of them being the exact same test. It doesn't even report the two failures as being from different locations. It reports both as being in |
We now have 100% patch coverage with all tests passing. 💯 🎉 We just need to address the following items and this should be ready to go. 🚀
|
One option would be adding a Maybe it will also work (not tested) if you split class HTMLBlocks:
# Test functions go here.
class TestHTMLBlocks(HTMLBlocks, TestCase):
# No functions here, just deriving from HTMLBlocks and TestCase. and in the other file: class TestDefaultwMdInHTML(HTMLBlocks, TestCase):
default_kwargs = {'extensions': ['md_in_html']} |
This is fantastic news. |
@facelessuser @mitya57 I have done a final read-through of this and am ready to merge. However, if either of you expect to have time to take a look soon, I'll happily wait. Let me know and I'll proceed accordingly. |
I'd like an opportunity to try it out this weekend, but if I don't get to it by then, if move ahead without me. |
Sorry, I didn't have time to review this properly. But from a quick look it is good. Thanks for your work! |
x-ref: https://gitlab.com/mbarkhau/markdown-katex/-/issues/14 Just in case anybody else runs into this. Since 3.3 the new intermediate processing of html may change attributes with single quotes to double quotes. This broke some extensions that used naive string replacement rather than html parsing. Regardless, thanks for the continued work. |
This is experimental. More of the HTMLParser methods need to be fleshed out. So far the basic stuff works as long as there is no invalid HTML in the document (which is untested at this point).
Input:
Output:
... which exactly matches the existing behavior.
I havn't actually run the tests on this yet, so I'm curious to see what Travis says...