Refactor HTML Parser #803

waylan · 2019-03-07T20:53:52Z

This is experimental. More of the HTMLParser methods need to be fleshed out. So far the basic stuff works as long as there is no invalid HTML in the document (which is untested at this point).

Input:

Some *Markdown* text.

<p>Some *raw* HTML</p>

<span>*inline*</span>

    <p>code block</p>

`<em>code span</em>`

<div>

foo *bar*

* baz bar

blah blah

</div>

More *Markdown*.

Output:

<p>Some <em>Markdown</em> text.</p>
<p>Some *raw* HTML</p>

<p><span><em>inline</em></span></p>
<pre><code>&lt;p&gt;code block&lt;/p&gt;
</code></pre>
<p><code>&lt;em&gt;code span&lt;/em&gt;</code></p>
<div>

foo *bar*

* baz bar

blah blah

</div>

<p>More <em>Markdown</em>.</p>

... which exactly matches the existing behavior.

I havn't actually run the tests on this yet, so I'm curious to see what Travis says...

facelessuser · 2019-03-07T21:15:03Z

I was considering rewriting this myself to fix some of this stuff once and for all, so I'm glad to see the work done here. I'll have to take a look at this and see how it handles or see if there are additional things we need to address. It seems you simplified a lot of the code which is a good thing.

waylan · 2019-03-08T01:22:46Z

Added some tests using the new framework. The new tests include comments on things we might want to reevaluate.

Also, one of the new tests is failing and needs fixed. The plan is to continue adding new tests and fix various edge cases until all the old tests pass At that point, the old tests should be redundant and can be removed.

facelessuser · 2019-03-08T23:22:54Z

Here's a question. Why don't we build markdown="1" functionality right into the HTML parser? I guess you could still have an extension that enables the functionality, but it seems it would be simpler and more straight forward to just build it right in with some kind of toggle.

facelessuser · 2019-03-09T00:31:52Z

Maybe the new extra would simply just replace the default HTML parser instance with a new instance that enables markdown=1 feature. No need to add a separate plugin that has to intercept blocks and then modify a difficult to understand tracking object. I think that would be easier to maintain.

waylan · 2019-03-11T14:16:47Z

I've rounded out the tests of valid raw HTML blocks. I have yet to address any of the failing tests. And I haven't even looked at invalid raw HTML.

However, I have questions regarding what the correct behavior should be in comments associated with various tests in the test file. Any feedback on those would be appreciated.

tests/test_syntax/blocks/test_html_blocks.py

waylan · 2019-03-16T01:36:07Z

Sigh. I finally looked at the failing processing instruction tests. I never thought this would be what breaks the standard lib HTML parser:

>>> class TestParser(HTMLParser):
...     def handle_pi(self, data):
...             print ('PI:', '"{}"'.format(data))
...     def handle_data(self, data):
...             print ('TEXT:', '"{}"'.format(data))
...
>>> parser = TestParser()
>>> parser.feed("<?php echo '>'; ?>")
PI: "php echo '"
TEXT: "'; ?>"

Note that the output should be:

PI: "php echo '>'; ?"

However, the angle bracket in the quote is mistaken for the closing bracket and everything after it is considered text outside of a tag. In fact, a look at the source code revels that it really is only looking for the next > irrespective of the context. I could file a bug, but this would need to get fixed and backported to all supported versions before it would be useful.

And then there is this comment in another section of the code (for finding the closing bracket of an end tag):

# consume and ignore other stuff between the name and the >
# Note: this is not 100% correct, since we might have things like
# </tag attr=">">, but looking for > after tha name should cover
# most of the cases and is much simpler

I would guess that Python-Markdown doesn't currently handle that case either, but is seems that these sorts of edge cases are not handled very well. In the processing instruction case, that is a very reasonable sort of thing to expect.

Although, I suppose it would be weird for PHP code to be embedded in a Markdown file which is not in a code block. The only purpose for that would be to use PHP an a template engine for generating Markdown. But in that case, the document would be run through PHP first, before being run through the Markdown parser and the PHP would no longer be present. That said, it should work correctly.

I suppose we could override the parse_pi method and fix it ourselves. Technically, the method is not private (it does not start with an underscore), although it is labeled "internal" in a comment. However, I'm not really interested in fixing underlying bugs in HTMLParser.

waylan · 2019-03-16T01:38:24Z

All things considered, regardless of whether we use the HTML parsing implementation in this PR, I created a pretty good set of tests which we can use regardless of how HTML parsing is implemented.

waylan · 2019-03-16T18:10:17Z

I did some research on processing instructions and it seems that, as Wikipedia summarizes:

An SGML processing instruction is enclosed within <? and >.

An XML processing instruction is enclosed within <? and ?>.

Wondering which applies to HTML, I checked the HTML5 spec, and under Dependencies, it specifically states that the Associating Style Sheets with XML documents specification is where processing instructions are defined.

From that it seems clear that valid processing instructions in HTML5 should be enclosed within <? and ?>. Of course, XHTML is XML, so the same applies there as well. That being the case, it shouldn't be difficult to monkeypatch HTMLParser to search for a closing ?>, which will mostly resolve the issue. Any thoughts?

This is experimental. More of the HTMLParser methods need to be fleshed out. So far the basic stuff works as long as there is no invalid HTML in the document.

There are some design desisions to make as noted in comments.

waylan · 2020-09-15T13:35:47Z

I added the following to the md_in_html tests in 2d8ce54:

markdown/tests/test_syntax/extensions/test_md_in_html.py

Lines 24 to 30 in 2d8ce54

    
           from ..blocks.test_html_blocks import TestHTMLBlocks 
        
           class TestDefaultwMdInHTML(TestHTMLBlocks): 
        
               """ Ensure the md_in_html extension does not break the default behavior. """ 
        
               default_kwargs = {'extensions': ['md_in_html']}

It works well to ensure that the extension doesn't break the default behavior. The problem is that unittest discover sees the TestHTMLBlocks class which is imported and runs it as well. However, it is also run from its original location. Therefore, all of the testcases get run 3 times, once in TestDefaultwMdInHTML (with the extension enabled) and twice in TestHTMLBlocks (without the extension enabled). Any idea how to avoid the extra, unnecessary run of those tests?

While running the tests an extra time makes the test run slightly longer, that's only a minor inconvenience. The real problem is that it is confusing that the same test failure gets reported three times, two of them being the exact same test. It doesn't even report the two failures as being from different locations. It reports both as being in tests.test_syntax.blocks.test_html_blocks.TestHTMLBlocks.

waylan · 2020-09-15T15:54:10Z

We now have 100% patch coverage with all tests passing. 💯 🎉

We just need to address the following items and this should be ready to go. 🚀

1. ~~The issue raised in my previous comment.~~
2. ~~Release notes.~~
3. A final review (preferably by someone other than myself).

mitya57 · 2020-09-15T17:45:04Z

Any idea how to avoid the extra, unnecessary run of those tests?

One option would be adding a load_tests function to the module, like described here: https://docs.python.org/3/library/unittest.html#load-tests-protocol.

Maybe it will also work (not tested) if you split TestHTMLBlocks into two classes: a base class that will have the actual tests and a subclass that will inherit from TestCase, like this:

class HTMLBlocks:
    # Test functions go here.

class TestHTMLBlocks(HTMLBlocks, TestCase):
    # No functions here, just deriving from HTMLBlocks and TestCase.

and in the other file:

class TestDefaultwMdInHTML(HTMLBlocks, TestCase):
    default_kwargs = {'extensions': ['md_in_html']}

waylan · 2020-09-15T17:46:31Z

I just confirmed that this PR fixes all of the following open issues: #595, #780, #830, and #1012. It appears that we have addressed all open issues related to HTML parsing. 👍 🎉

waylan · 2020-09-15T18:09:26Z

One option would be adding a load_tests function to the module

@mitya57 that did the trick (1eb9fd3 ). Thanks.

facelessuser · 2020-09-15T23:24:38Z

This is fantastic news.

waylan · 2020-09-16T18:39:14Z

@facelessuser @mitya57 I have done a final read-through of this and am ready to merge. However, if either of you expect to have time to take a look soon, I'll happily wait. Let me know and I'll proceed accordingly.

facelessuser · 2020-09-16T18:48:14Z

I'd like an opportunity to try it out this weekend, but if I don't get to it by then, if move ahead without me.

mitya57 · 2020-09-22T15:11:50Z

Sorry, I didn't have time to review this properly. But from a quick look it is good. Thanks for your work!

mbarkhau · 2021-03-14T19:49:33Z

x-ref: https://gitlab.com/mbarkhau/markdown-katex/-/issues/14

Just in case anybody else runs into this. Since 3.3 the new intermediate processing of html may change attributes with single quotes to double quotes. This broke some extensions that used naive string replacement rather than html parsing.

Regardless, thanks for the continued work.

waylan added the work-in-progress A partial solution. More changes will be coming. label Mar 7, 2019

mitya57 reviewed Mar 12, 2019

View reviewed changes

waylan force-pushed the html branch 2 times, most recently from 1caf92b to 3b05be9 Compare March 14, 2019 21:24

waylan force-pushed the html branch from 83894f2 to 182853a Compare March 25, 2019 23:43

waylan mentioned this pull request Mar 26, 2019

Bump version to 3.1 #809

Merged

waylan added this to the Version 3.2 milestone Mar 26, 2019

waylan force-pushed the html branch from 182853a to 0b96cfb Compare March 26, 2019 00:26

waylan mentioned this pull request Sep 30, 2019

Python Version Support Timeline #760

Open

waylan removed this from the Version 3.2 milestone Feb 6, 2020

waylan added 12 commits July 1, 2020 13:07

Refactor HTML Parser

585f207

This is experimental. More of the HTMLParser methods need to be fleshed out. So far the basic stuff works as long as there is no invalid HTML in the document.

fix silly error

77baade

Add some new tests

d4c8951

More tests.

356f5c3

Round out tests of valid markup.

ff0f8f2

There are some design desisions to make as noted in comments.

Some cleanup and bugfixes

6efe8d5

Some minor tweaks

e5f9ca4

comments partially fixed.

95e8498

Support 0-3 spaces of indent for raw HTML blocks

ea98546

Remove need to wrap raw in blank lines

23e41d3

More tests passing

46b3a1b

All handle_* methods are now defined and tested

8a17794

Ensure extension doesn't break default behavior.

2d8ce54

waylan added 4 commits September 15, 2020 10:18

update abbr tests

4856e86

add basic link ref tests.

07c9267

flake8 cleanup

82b97e5

footnote tests. 100% patch coverage

1a0a893

Add test for case in Python-Markdown#1012.

46ac436

waylan added 2 commits September 15, 2020 13:54

Add release notes.

9cfbf20

Avoid duplicate tests.

1eb9fd3

Fix a broken link

6f3b417

waylan added needs-review Needs to be reviewed and/or approved. and removed work-in-progress A partial solution. More changes will be coming. labels Sep 15, 2020

Final cleanup.

15b431a

waylan merged commit b701c34 into Python-Markdown:master Sep 22, 2020

waylan added approved The pull request is ready to be merged. and removed needs-review Needs to be reviewed and/or approved. labels Sep 22, 2020

dbader mentioned this pull request Oct 14, 2020

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

Closed

waylan mentioned this pull request Nov 19, 2020

unclosed tag in code span #1066

Closed

remyroy mentioned this pull request May 12, 2021

Monkeypatching html.parser fails when importing with zipimporter #1132

Open

oprypin mentioned this pull request Jul 8, 2021

Different generated markup from serve and gh-deploy mkdocs/mkdocs#2488

Closed

kernc mentioned this pull request Nov 24, 2022

<html> tag wrapped in <p> #1308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor HTML Parser #803

Refactor HTML Parser #803

waylan commented Mar 7, 2019 •

edited

facelessuser commented Mar 7, 2019

waylan commented Mar 8, 2019

facelessuser commented Mar 8, 2019

facelessuser commented Mar 9, 2019

waylan commented Mar 11, 2019 •

edited

waylan commented Mar 16, 2019

waylan commented Mar 16, 2019

waylan commented Mar 16, 2019 •

edited

waylan commented Sep 15, 2020 •

edited

waylan commented Sep 15, 2020 •

edited

mitya57 commented Sep 15, 2020

waylan commented Sep 15, 2020

waylan commented Sep 15, 2020

facelessuser commented Sep 15, 2020

waylan commented Sep 16, 2020

facelessuser commented Sep 16, 2020

mitya57 commented Sep 22, 2020

mbarkhau commented Mar 14, 2021

Refactor HTML Parser #803

Refactor HTML Parser #803

Conversation

waylan commented Mar 7, 2019 • edited

Input:

Output:

facelessuser commented Mar 7, 2019

waylan commented Mar 8, 2019

facelessuser commented Mar 8, 2019

facelessuser commented Mar 9, 2019

waylan commented Mar 11, 2019 • edited

waylan commented Mar 16, 2019

waylan commented Mar 16, 2019

waylan commented Mar 16, 2019 • edited

waylan commented Sep 15, 2020 • edited

waylan commented Sep 15, 2020 • edited

mitya57 commented Sep 15, 2020

waylan commented Sep 15, 2020

waylan commented Sep 15, 2020

facelessuser commented Sep 15, 2020

waylan commented Sep 16, 2020

facelessuser commented Sep 16, 2020

mitya57 commented Sep 22, 2020

mbarkhau commented Mar 14, 2021

waylan commented Mar 7, 2019 •

edited

waylan commented Mar 11, 2019 •

edited

waylan commented Mar 16, 2019 •

edited

waylan commented Sep 15, 2020 •

edited

waylan commented Sep 15, 2020 •

edited