Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<hr> tags break markup processing in 3.3.0+? #1053

Closed
dbader opened this issue Oct 23, 2020 · 9 comments
Closed

<hr> tags break markup processing in 3.3.0+? #1053

dbader opened this issue Oct 23, 2020 · 9 comments
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.

Comments

@dbader
Copy link
Contributor

dbader commented Oct 23, 2020

<hr> elements seem to interrupt markdown text processing:

  • Markdown Input: '*emphasis1*\n<hr>\n*emphasis2*'
  • HTML Output (actual): '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
  • HTML Output (expected): '<p><em>emphasis1</em>\n<hr>\n<em>emphasis2</em></p>'

*emphasis1*\n<hr/>\n*emphasis2* (closing the hr tag) causes *emphasis2* to render correctly.

Naked <hr> tags should be valid HTML: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/hr

Having the md_in_html extension enabled or disabled doesn't seem to make a difference.

Versions

  • Markdown==3.3.2 (affects 3.3.0 and above)
  • CPython 3.8.6

Steps to reproduce

mderror.py:

import markdown


def test(md_text):
    html = markdown.markdown(
        md_text,
        # Same result with or without md_in_html
        # extensions=["markdown.extensions.md_in_html"],
    )
    okay = "*emphasis" not in html
    print(f"{'OK ' if okay else 'ERR'} {md_text!r}{html!r}")


# ERR
test("*emphasis1*\n<hr>\n*emphasis2*")

# OK
test("*emphasis1*\n<hr/>\n*emphasis2*")
test("*emphasis1*\n<hr></hr>\n*emphasis2*")
test("*emphasis1*\n<img>\n*emphasis2*")

Output:

897c854 also fails:

pip install git+https://github.com/Python-Markdown/markdown@897c8541998cdbd0499b04942244f7fdf1191a6c && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'
$ pip install Markdown==3.3.2 && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'
$ pip install Markdown==3.3.1 && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

3.3.0 starts failing:

→ pip install Markdown==3.3.0 && python mderror.py
ERR '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr>\n*emphasis2*'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

3.2.2 passes all tests:

→ pip install Markdown==3.2.2 && python mderror.py
OK  '*emphasis1*\n<hr>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr>\n<em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/>\n<em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr></hr>\n<em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'
@dbader dbader changed the title <hr> tags rendering differently in 3.3.0+ <hr> tags break markup processing in 3.3.0+? Oct 23, 2020
@facelessuser
Copy link
Collaborator

I'll have to test the tip. We have some unreleased fixes as there were some inconsistencies with certain tags.

@facelessuser
Copy link
Collaborator

Yeah, it looks like new behavior introduced by the new parser. I would think <hr> would get handled similar to <br>, but it isn't.

On the one hand, as it is a block item, I'm not sure it should render inside a paragraph. But on the other hand, it seems to treat it as a block (which it is) and just keep searching for an end tag. It doesn't find the end, but assumes all the trailing text is inside the <hr> and since blocks are not parsed for Markdown, the emphasis isn't parsed.

Interestingly, if we add markdown="1" to it, we get data loss:

OK  '*emphasis1*\n<hr markdown="1">\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr />'
OK  '*emphasis1*\n<hr/>\n*emphasis2*' → '<p><em>emphasis1</em>\n<hr/></p>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<hr></hr>\n*emphasis2*' → '<p><em>emphasis1</em></p>\n<hr></hr>\n<p><em>emphasis2</em></p>'
OK  '*emphasis1*\n<img>\n*emphasis2*' → '<p><em>emphasis1</em>\n<img>\n<em>emphasis2</em></p>'

Seems we need to handle unclosed tags a bit better:

  1. To not have data loss
  2. Maybe to recognize and handle tags that shouldn't have content.

@waylan thoughts?

@facelessuser
Copy link
Collaborator

Here is a potential fix. Interested to hear your opinion on approach @waylan or if you have a different suggestion:

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index a2137c7..4bfc6c9 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -36,9 +36,10 @@ class HTMLExtractorExtra(HTMLExtractor):
         # Block-level tags which never get their content parsed.
         self.raw_tags = ['canvas', 'math', 'option', 'pre', 'script', 'style', 'textarea']
         # Block-level tags in which the content gets parsed as blocks
-        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
         super().__init__(md, *args, **kwargs)
 
+        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
+
     def reset(self):
         """Reset this instance.  Loses all unprocessed data."""
         self.mdstack = []  # When markdown=1, stack contains a list of tags
@@ -119,6 +120,9 @@ class HTMLExtractorExtra(HTMLExtractor):
                 else:
                     self.handle_data(text)
 
+        if tag in self.empty_tags:
+            self.handle_endtag(tag)
+
     def handle_endtag(self, tag):
         if tag in self.block_level_tags:
             if self.inraw:
diff --git a/markdown/htmlparser.py b/markdown/htmlparser.py
index 6776d34..bd75368 100644
--- a/markdown/htmlparser.py
+++ b/markdown/htmlparser.py
@@ -56,6 +56,10 @@ class HTMLExtractor(htmlparser.HTMLParser):
     def __init__(self, md, *args, **kwargs):
         if 'convert_charrefs' not in kwargs:
             kwargs['convert_charrefs'] = False
+
+        # Block tags that should contain no content (self closing)
+        self.empty_tags = ['hr']
+
         # This calls self.reset
         super().__init__(*args, **kwargs)
         self.md = md
@@ -135,6 +139,9 @@ class HTMLExtractor(htmlparser.HTMLParser):
                 # This is presumably a standalone tag in a code span (see #1036).
                 self.clear_cdata_mode()
 
+        if tag in self.empty_tags:
+            self.handle_endtag(tag)
+
     def handle_endtag(self, tag):
         text = self.get_endtag_text(tag)
 

@facelessuser
Copy link
Collaborator

Should be noted that this doesn't universally fix potential data loss in md_in_html if ending tag is missing.

@facelessuser
Copy link
Collaborator

This will have to be tweaked as well as it isn't perfect and doubles <hr> tags, but the general direction is what I'm trying to highlight.

@facelessuser
Copy link
Collaborator

This is a bit more sane:

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index a2137c7..1fc4ab8 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -36,9 +36,10 @@ class HTMLExtractorExtra(HTMLExtractor):
         # Block-level tags which never get their content parsed.
         self.raw_tags = ['canvas', 'math', 'option', 'pre', 'script', 'style', 'textarea']
         # Block-level tags in which the content gets parsed as blocks
-        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
         super().__init__(md, *args, **kwargs)
 
+        self.block_tags = [tag for tag in self.block_level_tags if tag not in self.span_tags + self.raw_tags]
+
     def reset(self):
         """Reset this instance.  Loses all unprocessed data."""
         self.mdstack = []  # When markdown=1, stack contains a list of tags
diff --git a/markdown/htmlparser.py b/markdown/htmlparser.py
index 6776d34..2e14038 100644
--- a/markdown/htmlparser.py
+++ b/markdown/htmlparser.py
@@ -56,6 +56,10 @@ class HTMLExtractor(htmlparser.HTMLParser):
     def __init__(self, md, *args, **kwargs):
         if 'convert_charrefs' not in kwargs:
             kwargs['convert_charrefs'] = False
+
+        # Block tags that should contain no content (self closing)
+        self.empty_tags = ['hr']
+
         # This calls self.reset
         super().__init__(*args, **kwargs)
         self.md = md
@@ -264,6 +268,13 @@ class HTMLExtractor(htmlparser.HTMLParser):
         if end.endswith('/>'):
             # XHTML-style empty tag: <span attr="value" />
             self.handle_startendtag(tag, attrs)
+        elif tag in self.empty_tags:
+            if re.match(r'\s*</\s*{}\s*>'.format(tag), self.rawdata[self.line_offset + self.offset + len(self.__starttag_text):]):
+                if tag in self.CDATA_CONTENT_ELEMENTS:
+                    self.set_cdata_mode(tag)
+                self.handle_starttag(tag, attrs)
+            else:
+                self.handle_startendtag(tag, attrs)
         else:
             # *** set cdata_mode first so we can override it in handle_starttag (see #1036) ***
             if tag in self.CDATA_CONTENT_ELEMENTS:

@facelessuser
Copy link
Collaborator

Still not happy with how the <hr> is still included in the previous <p>. It's probably due to the fact that handle_startendtag was more geared to handling inline tags. I don't have time to investigate that tonight.

@waylan
Copy link
Member

waylan commented Oct 24, 2020

This was an intentional change in behavior. Previously, we strictly enforced the rule which required that an HTML block must start with a blank line. However, in practice, in most cases, even the reference implementation doesn't follow that rule (strangely it does with <hr>). Therefore, as of version 3.3, Python-Markdown ends a Markdown paragraph (or any other block) upon finding an opening tag (or a self-closing tag) of a block-level element at the start of a line.

I should also note that it is invalid HTML to have an <hr> within a <p>. In fact, browsers will interpret

<p>foo
<hr>
bar
</p>

as

<p>foo
</p><hr>
bar
<p></p>

Note that bar is not even in a <p> tag as far as the browser is concerned. Therefore, it is nonsensical to output that HTML. The correct way to parse the following Markdown

foo
<hr>
bar

would be

<p>foo</p>
<hr>
<p>bar</p>

And that will provide a better rendering in the browser.

Which means that we do have a bug in that the line after the <hr> is not being wrapped in a <p>. Presumably a blank line needs to be inserted after block-level self-closing tags in the htmlparser.

@facelessuser
Copy link
Collaborator

I have some work here: #1054. We can discuss changes there and change direction if I'm going in bad direction.

This case:

foo
<hr>
bar

Should render as

<p>foo</p>
<hr>
<p>bar</p>

@waylan waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code. labels Oct 24, 2020
@waylan waylan closed this as completed in 11c9e17 Oct 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.
Projects
None yet
Development

No branches or pull requests

3 participants