Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor HTML Parser #803

Merged
merged 67 commits into from Sep 22, 2020
Merged

Refactor HTML Parser #803

merged 67 commits into from Sep 22, 2020

Conversation

waylan
Copy link
Member

@waylan waylan commented Mar 7, 2019

This is experimental. More of the HTMLParser methods need to be fleshed out. So far the basic stuff works as long as there is no invalid HTML in the document (which is untested at this point).

Input:

Some *Markdown* text.

<p>Some *raw* HTML</p>

<span>*inline*</span>

    <p>code block</p>

`<em>code span</em>`

<div>

foo *bar*

* baz bar

blah blah

</div>

More *Markdown*.

Output:

<p>Some <em>Markdown</em> text.</p>
<p>Some *raw* HTML</p>

<p><span><em>inline</em></span></p>
<pre><code>&lt;p&gt;code block&lt;/p&gt;
</code></pre>
<p><code>&lt;em&gt;code span&lt;/em&gt;</code></p>
<div>

foo *bar*

* baz bar

blah blah

</div>

<p>More <em>Markdown</em>.</p>

... which exactly matches the existing behavior.

I havn't actually run the tests on this yet, so I'm curious to see what Travis says...

@waylan waylan added the work-in-progress A partial solution. More changes will be coming. label Mar 7, 2019
@facelessuser
Copy link
Collaborator

I was considering rewriting this myself to fix some of this stuff once and for all, so I'm glad to see the work done here. I'll have to take a look at this and see how it handles or see if there are additional things we need to address. It seems you simplified a lot of the code which is a good thing.

@waylan
Copy link
Member Author

waylan commented Mar 8, 2019

Added some tests using the new framework. The new tests include comments on things we might want to reevaluate.

Also, one of the new tests is failing and needs fixed. The plan is to continue adding new tests and fix various edge cases until all the old tests pass At that point, the old tests should be redundant and can be removed.

@facelessuser
Copy link
Collaborator

Here's a question. Why don't we build markdown="1" functionality right into the HTML parser? I guess you could still have an extension that enables the functionality, but it seems it would be simpler and more straight forward to just build it right in with some kind of toggle.

@facelessuser
Copy link
Collaborator

Maybe the new extra would simply just replace the default HTML parser instance with a new instance that enables markdown=1 feature. No need to add a separate plugin that has to intercept blocks and then modify a difficult to understand tracking object. I think that would be easier to maintain.

@waylan
Copy link
Member Author

waylan commented Mar 11, 2019

I've rounded out the tests of valid raw HTML blocks. I have yet to address any of the failing tests. And I haven't even looked at invalid raw HTML.

However, I have questions regarding what the correct behavior should be in comments associated with various tests in the test file. Any feedback on those would be appreciated.

tests/test_syntax/blocks/test_html_blocks.py Outdated Show resolved Hide resolved
tests/test_syntax/blocks/test_html_blocks.py Outdated Show resolved Hide resolved
tests/test_syntax/blocks/test_html_blocks.py Outdated Show resolved Hide resolved
tests/test_syntax/blocks/test_html_blocks.py Outdated Show resolved Hide resolved
tests/test_syntax/blocks/test_html_blocks.py Outdated Show resolved Hide resolved
tests/test_syntax/blocks/test_html_blocks.py Outdated Show resolved Hide resolved
@waylan waylan force-pushed the html branch 2 times, most recently from 1caf92b to 3b05be9 Compare March 14, 2019 21:24
@waylan
Copy link
Member Author

waylan commented Mar 16, 2019

Sigh. I finally looked at the failing processing instruction tests. I never thought this would be what breaks the standard lib HTML parser:

>>> class TestParser(HTMLParser):
...     def handle_pi(self, data):
...             print ('PI:', '"{}"'.format(data))
...     def handle_data(self, data):
...             print ('TEXT:', '"{}"'.format(data))
...
>>> parser = TestParser()
>>> parser.feed("<?php echo '>'; ?>")
PI: "php echo '"
TEXT: "'; ?>"

Note that the output should be:

PI: "php echo '>'; ?"

However, the angle bracket in the quote is mistaken for the closing bracket and everything after it is considered text outside of a tag. In fact, a look at the source code revels that it really is only looking for the next > irrespective of the context. I could file a bug, but this would need to get fixed and backported to all supported versions before it would be useful.

And then there is this comment in another section of the code (for finding the closing bracket of an end tag):

# consume and ignore other stuff between the name and the >
# Note: this is not 100% correct, since we might have things like
# </tag attr=">">, but looking for > after tha name should cover
# most of the cases and is much simpler

I would guess that Python-Markdown doesn't currently handle that case either, but is seems that these sorts of edge cases are not handled very well. In the processing instruction case, that is a very reasonable sort of thing to expect.

Although, I suppose it would be weird for PHP code to be embedded in a Markdown file which is not in a code block. The only purpose for that would be to use PHP an a template engine for generating Markdown. But in that case, the document would be run through PHP first, before being run through the Markdown parser and the PHP would no longer be present. That said, it should work correctly.

I suppose we could override the parse_pi method and fix it ourselves. Technically, the method is not private (it does not start with an underscore), although it is labeled "internal" in a comment. However, I'm not really interested in fixing underlying bugs in HTMLParser.

@waylan
Copy link
Member Author

waylan commented Mar 16, 2019

All things considered, regardless of whether we use the HTML parsing implementation in this PR, I created a pretty good set of tests which we can use regardless of how HTML parsing is implemented.

@waylan
Copy link
Member Author

waylan commented Mar 16, 2019

I did some research on processing instructions and it seems that, as Wikipedia summarizes:

An SGML processing instruction is enclosed within <? and >.

An XML processing instruction is enclosed within <? and ?>.

Wondering which applies to HTML, I checked the HTML5 spec, and under Dependencies, it specifically states that the Associating Style Sheets with XML documents specification is where processing instructions are defined.

From that it seems clear that valid processing instructions in HTML5 should be enclosed within <? and ?>. Of course, XHTML is XML, so the same applies there as well. That being the case, it shouldn't be difficult to monkeypatch HTMLParser to search for a closing ?>, which will mostly resolve the issue. Any thoughts?

@waylan
Copy link
Member Author

waylan commented Sep 15, 2020

I added the following to the md_in_html tests in 2d8ce54:

from ..blocks.test_html_blocks import TestHTMLBlocks
class TestDefaultwMdInHTML(TestHTMLBlocks):
""" Ensure the md_in_html extension does not break the default behavior. """
default_kwargs = {'extensions': ['md_in_html']}

It works well to ensure that the extension doesn't break the default behavior. The problem is that unittest discover sees the TestHTMLBlocks class which is imported and runs it as well. However, it is also run from its original location. Therefore, all of the testcases get run 3 times, once in TestDefaultwMdInHTML (with the extension enabled) and twice in TestHTMLBlocks (without the extension enabled). Any idea how to avoid the extra, unnecessary run of those tests?

While running the tests an extra time makes the test run slightly longer, that's only a minor inconvenience. The real problem is that it is confusing that the same test failure gets reported three times, two of them being the exact same test. It doesn't even report the two failures as being from different locations. It reports both as being in tests.test_syntax.blocks.test_html_blocks.TestHTMLBlocks.

@waylan
Copy link
Member Author

waylan commented Sep 15, 2020

We now have 100% patch coverage with all tests passing. 💯 🎉

We just need to address the following items and this should be ready to go. 🚀

  • 1. The issue raised in my previous comment.
  • 2. Release notes.
  • 3. A final review (preferably by someone other than myself).

@mitya57
Copy link
Collaborator

mitya57 commented Sep 15, 2020

Any idea how to avoid the extra, unnecessary run of those tests?

One option would be adding a load_tests function to the module, like described here: https://docs.python.org/3/library/unittest.html#load-tests-protocol.

Maybe it will also work (not tested) if you split TestHTMLBlocks into two classes: a base class that will have the actual tests and a subclass that will inherit from TestCase, like this:

class HTMLBlocks:
    # Test functions go here.

class TestHTMLBlocks(HTMLBlocks, TestCase):
    # No functions here, just deriving from HTMLBlocks and TestCase.

and in the other file:

class TestDefaultwMdInHTML(HTMLBlocks, TestCase):
    default_kwargs = {'extensions': ['md_in_html']}

@waylan
Copy link
Member Author

waylan commented Sep 15, 2020

I just confirmed that this PR fixes all of the following open issues: #595, #780, #830, and #1012. It appears that we have addressed all open issues related to HTML parsing. 👍 🎉

@waylan
Copy link
Member Author

waylan commented Sep 15, 2020

One option would be adding a load_tests function to the module

@mitya57 that did the trick (1eb9fd3 ). Thanks.

@waylan waylan added needs-review Needs to be reviewed and/or approved. and removed work-in-progress A partial solution. More changes will be coming. labels Sep 15, 2020
@facelessuser
Copy link
Collaborator

This is fantastic news.

@waylan
Copy link
Member Author

waylan commented Sep 16, 2020

@facelessuser @mitya57 I have done a final read-through of this and am ready to merge. However, if either of you expect to have time to take a look soon, I'll happily wait. Let me know and I'll proceed accordingly.

@facelessuser
Copy link
Collaborator

I'd like an opportunity to try it out this weekend, but if I don't get to it by then, if move ahead without me.

@waylan waylan merged commit b701c34 into Python-Markdown:master Sep 22, 2020
@waylan waylan added approved The pull request is ready to be merged. and removed needs-review Needs to be reviewed and/or approved. labels Sep 22, 2020
@mitya57
Copy link
Collaborator

mitya57 commented Sep 22, 2020

Sorry, I didn't have time to review this properly. But from a quick look it is good. Thanks for your work!

@mbarkhau
Copy link

x-ref: https://gitlab.com/mbarkhau/markdown-katex/-/issues/14

Just in case anybody else runs into this. Since 3.3 the new intermediate processing of html may change attributes with single quotes to double quotes. This broke some extensions that used naive string replacement rather than html parsing.

Regardless, thanks for the continued work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved The pull request is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants