Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<html> tag wrapped in <p> #1308

Closed
kernc opened this issue Nov 24, 2022 · 4 comments · Fixed by #1309
Closed

<html> tag wrapped in <p> #1308

kernc opened this issue Nov 24, 2022 · 4 comments · Fixed by #1309
Labels
3rd-party Should be implemented as a third party extension. invalid Invalid report (user error, upstream issue, etc).

Comments

@kernc
Copy link
Contributor

kernc commented Nov 24, 2022

Since v3.3 (more precisely, #803 / b701c34), converting the following test document bug-test.txt:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"/>
    <link rel="stylesheet" href="style.css"/>
  </head>
  <body>

!!! header "foo"
    bar

by running e.g.

markdown_py -o html5 -x extra -x admonition bug-test.txt

results in

<!DOCTYPE html>
<p><html>
  <head>
    <meta charset="utf-8"/>
    <link rel="stylesheet" href="style.css"/>
  </head>
  <body></p>
<div class="admonition header">
<p class="admonition-title">foo</p>
<p>bar</p>
</div>

Prior to v3.3, the result was as expected:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"/>
    <link rel="stylesheet" href="style.css"/>
  </head>
  <body>

<div class="admonition header">
<p class="admonition-title">foo</p>
<p>bar</p>
</div>

Does the new HTML parser added in #803 perhaps not account for <html> HTML tag?

@waylan
Copy link
Member

waylan commented Nov 28, 2022

The syntax rules state:

Markdown formatting syntax is not processed within block-level HTML tags. E.g., you can’t use Markdown-style emphasis inside an HTML block.

In fact, the current behavior exactly matches the reference implementation. Therefore, I'm surprised this ever worked for you. If anything, I would consider the old behavior a bug.

You may want to consider using the md_in_html extension. However, note that html and body tags are not listed among block-level tags there. Markdown content is considered an HTML fragment, which is to be inserted within an HTML document. Therefore, the entire document is expected to be within a body tag. That being the case, html and body tags should not be within Markdown content.

In fact, the html and body tags have never been included in the list of block-level tags in our codebase (including prior to #803). I can only guess as to why you were getting the behavior you were before. Presumably, the lack of a closing tag was confusing the parser as the old implementation expected XHTML (which required closing tags to be valid). The new implementation parses HTML5 which is much more forgiving. Why that was resulting in the html not being wrapped in a p I have no idea. But is does explain why the Markdown content was previously being parsed and it no longer is. In any event, as the current behavior is intentional, I have no inclination to dig through the old code to figure out what it was doing.

Regardless, you can override/alter the list of block-level tags via an extension. Feel free to create your own third-party extension which adds html, block. and any other tags you want to be considered block-level. For that matter, you don't even need an extension, Just alter an instance of the markdown.Markdown class:

import markdown
md = markdown.Markdown()
md.block_level_elements.append('html')
html = md.convert(src)

Of course, your Markdown content still won't be parsed as Markdown, so that doesn't really help you. Perhaps a better approach would be to use the md_in_html extension and set markdown=block on the html and body tags. That should force them to be parsed as block level tags and allow the Markdown content within to be parsed as Markdown.

As there is no bug, I am closing this as wontfix.

@waylan waylan added invalid Invalid report (user error, upstream issue, etc). 3rd-party Should be implemented as a third party extension. labels Nov 28, 2022
@kernc
Copy link
Contributor Author

kernc commented Nov 28, 2022

Perhaps a better approach would be to use the md_in_html extension and set markdown=block on the html and body` tags. That should force them to be parsed as block level tags and allow the Markdown content within to be parsed as Markdown.

If you meant it like this, with v3.4, it doesn't seem to work:

$ cat /tmp/test_case.md
<!DOCTYPE html>
<html markdown=block>
<body markdown=block>

Should be **bold**

(Note, <body> needs to be unindented to pick 0-column markdown.)

$ markdown_py -x md_in_html /tmp/test_case.md

<!DOCTYPE html>
<p><html markdown=block></p>
<body>
<p>Should be <strong>bold</strong></p>
</body>

It does works after also setting:

md.block_level_elements.append('html')

but this then can't be used from the command line.

Therefore, the entire document is expected to be within a body tag. That being the case, html and body tags should not be within Markdown content.

Wouldn't you add html to default BLOCK_LEVEL_ELEMENTS like body is?

# Other elements which Markdown should not be mucking up the contents of.
'canvas', 'colgroup', 'dd', 'body', 'dt', 'group', 'iframe', 'li', 'legend',

# Other elements which Markdown should not be mucking up the contents of.
'canvas', 'colgroup', 'dd', 'body', 'dt', 'group', 'iframe', 'li', 'legend',

(These are two separate lists? 🤔)

My use case is a simple .md file that compiles directly into the whole HTML page, circumventing the need for a separate templating engine.

@facelessuser
Copy link
Collaborator

facelessuser commented Nov 28, 2022

I am curious if there is a reason to not just specify html by default? I do understand that it is stated to not work, but I wonder if there is a strong justification to not just add html?

@waylan
Copy link
Member

waylan commented Nov 28, 2022

I wonder if there is a strong justification to not just add html?

The reason is as I stated earlier...

Markdown content is considered an HTML fragment, which is to be inserted within an HTML document. Therefore, the entire document is expected to be within a body tag. That being the case, html and body tags should not be within Markdown content.

Truth be told, I didn't remember that the body tag was included. I question whether it should be.

On the other hand, I agree that html and body tags are "elements which Markdown should not be mucking up the contents of" as noted in a code comment. So maybe we should add it.

(These are two separate lists? 🤔)

Yes, the old list was left at markdown/util.py to maintain backward compatibility as various older third party extensions have used it. Truth be told, we can probably remove it by now. I forgot it was still there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd-party Should be implemented as a third party extension. invalid Invalid report (user error, upstream issue, etc).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants