Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclosed script and style tags cause data loss #1036

Closed
antirais opened this issue Oct 9, 2020 · 7 comments · Fixed by #1038
Closed

Unclosed script and style tags cause data loss #1036

antirais opened this issue Oct 9, 2020 · 7 comments · Fixed by #1038
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.

Comments

@antirais
Copy link

antirais commented Oct 9, 2020

Markdown 3.3 does not properly generate HTML for `<script>`
markdown:

Current output:

Python 3.8.6 (default, Sep 30 2020, 04:00:38)

In [1]: import markdown

In [2]: markdown.__version__
Out[2]: '3.3'

In [3]: markdown.markdown('test: `test`')
Out[3]: '<p>test: <code>test</code></p>'

In [4]: markdown.markdown('test: `<script>`')
Out[4]: '<p>test: `<script></p>'

Expected output:

In [1]: import markdown

In [2]: markdown.__version__
Out[2]: '3.2.2'

In [3]: markdown.markdown('test: `test`')
Out[3]: '<p>test: <code>test</code></p>'

In [4]: markdown.markdown('test: `<script>`')
Out[4]: '<p>test: <code>&lt;script&gt;</code></p>'
@waylan waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code. labels Oct 9, 2020
@waylan
Copy link
Member

waylan commented Oct 9, 2020

Thanks for the report. I didn't check every block-level tag, but of those I did test, this only occurs for <script> and <style> tags.

@waylan
Copy link
Member

waylan commented Oct 9, 2020

The issue is different than I originally expected. It turns out that unless there is a closing <script> or <style> tag, the HTML parser eats all content after the opening tag. Of course, for the simple examples provided, that means the closing backtick is lost. But in more complex examples we get complete loss of the rest of the document.

>>> src = '''
... Some text `<script>` more text.
...
... A separate paragraph.
... '''
>>> markdown.markdown(src)
'<p>Some text `<script></p>'

@waylan waylan changed the title Backtick does not handle script tags properly Unclosed script and style tags cause data loss Oct 9, 2020
@facelessuser
Copy link
Collaborator

Oh, that's not good :(.

@waylan
Copy link
Member

waylan commented Oct 9, 2020

For comparison, see this example which includes a closing tag:

>>> src = '''
... Some text `<script>` more text.
...
... A separate paragraph with a closing `</script>` tag.
... '''
>>> markdown.markdown(src)
'<p>Some text <code>&lt;script&gt;</code> more text.</p>\n<p>A separate paragraph with a closing <code>&lt;/script&gt;</code> tag.</p>'

It appears that the underlying parser is withholding what is perceives to be content of the script tag until it finds the closing tag. As long as it never finds one, it never passes the content on. Another block-level tag doesn't even alter that behavior.

>>> markdown.markdown('foo `<script>` bar `<div>`')
'<p>foo `<script></p>'

I'm afraid we may need to override some of the parent class to fix this one. 😬

@waylan
Copy link
Member

waylan commented Oct 9, 2020

I went spelunking in the code and it appears that the problem is related to html/parser.py#L345-L346

if tag in self.CDATA_CONTENT_ELEMENTS:
    self.set_cdata_mode(tag)

CDATA_CONTENT_ELEMENTS is defined in html/parser.py#L84

CDATA_CONTENT_ELEMENTS = ("script", "style")

which confirms that this only applies to script and style tags.

It appears that the primary difference with cdata_mode is that escaping is different. However, when the end of the document is reached (EOF) the parser is supposed to output all remaining rawdata. However, the if statement for that doesn't include a branch for cdata_mode. See html/parser.py#L244-L250

if end and i < n and not self.cdata_elem:
    if self.convert_charrefs and not self.cdata_elem:
        self.handle_data(unescape(rawdata[i:n]))
   else:
       self.handle_data(rawdata[i:n])
   i = self.updatepos(i, n)
self.rawdata = rawdata[i:]

I suspect that this is an upstream bug, After all, the nested check for not self.cdata_elem (on L245) is completely unnecessary. Looks like someone overlooked a scenario. However, it is not (yet) clear to me if the check should be removed from the outer if statement or if an else should be tacked on the end of that outer if statement, or something else completely. I expect some testing is in order.

@waylan
Copy link
Member

waylan commented Oct 9, 2020

There are two related issues here:

  1. The upstream htmlparser is eating any content after the opening script or style tag which is unclosed.
  2. Even when a closing tag exists, the parser is handling escaping differently for all content between the tags.

Note that the current behavior in issue 2 is desired when the tag is part of an HTML block. However, when we have standalone tags in code spans (for example: `<script>` and `</script>`) we don't want the default escaping behavior. That should be easy enough to address in our parser subclass. When we detect if a starttag actually starts a new HTML block (at start of line) we can 'turn off' cdata_mode (with a call to self.clear_cdata_mode()) if the tag does not start a new HTML block or is not a child of an HTML block.

I expect that 'fix' will have the effect of avoiding issue 1, except when the document contains an actual HTML block with a missing endtag. We don't guarantee good output from invalid HTML, but we generally try to avoid data loss. So, in the end, issue 1 needs to be fixed as well.

@waylan
Copy link
Member

waylan commented Oct 10, 2020

I opened an issue upstream at https://bugs.python.org/issue41989 related to the dataloss and a PR has been submitted at python/cpython#22658

waylan added a commit to waylan/markdown that referenced this issue Oct 12, 2020
waylan added a commit to waylan/markdown that referenced this issue Oct 12, 2020
waylan added a commit that referenced this issue Oct 12, 2020
* Ensure unclosed script tags are parsed correctly by providing a workaround for https://bugs.python.org/issue41989.
* Avoid cdata_mode outside of HTML blocks, such as in inline code spans.

Fixes #1036.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants