Placeholder in output for fenced code blocks nested in raw HTML blocks #458

mDuo13 · 2016-03-09T22:40:31Z

When using the markdown.extensions.extra, syntax looking like a code fence inside of HTML block elements gets rendered as gibberish instead of text, unless the HTML block has markdown=1 applied.

This test case demonstrates what I mean:
test.md.txt

The output I get from parsing that page includes a section like this:

<div>

Save this file to `test.md` and then try this:


�wzxhzdk:0�


</div>

The exact contents of the gibberish change, but it typically seems to start with wz... which suggests to me that it's rendering something that's not meant to be a string as a string.

Same behavior on Python 3.5.1, using Python-Markdown 2.6.2 and Python-Markdown 2.6.5, as well as Python 2.7.11 using Python-Markdown 2.6.5. (On two different Arch Linux machines.)

The text was updated successfully, but these errors were encountered:

ryneeverett · 2016-03-10T06:17:52Z

Tried on markdown 2.6.5 with python 2.7 and 3.4 and couldn't reproduce.

waylan · 2016-03-10T15:28:33Z

Those "gibberish" strings are (internal) raw HTML placeholders. While they would ideally never appear in the output, in this case I don't really care much as the docs clearly state that Markdown is not processed inside raw HTML. Therefore you shouldn't be putting Markdown inside raw HTML to begin with (without `markdown=1').

That said, my best guess would be that the placeholder for the fenced code block is looked up first and not found in the document (as it isn't there). Then the placeholder for the raw HTML block is looked up, found and swapped out which contains the fenced code block placeholder. However, as that placeholder was previously not found it is't lookup up again, leaving it in the output. The fix would be to change the lookup order: perhaps the reserve of the order in which they were added to the stash (presumably @markdown/postprocessors.py#L55).

Although, if @ryneeverett can't reproduce the problem (I haven't checked myself), its not clear that that would solve the problem. Although, I see that after all the placeholders are processed they are passed to a re.sub call which uses a single regex to search for and replace all of them @markdown/postprocessors.py#L73). While certainly more performant, in this edge case, that may be a problem. Can't say I want to decrease performance for this edge case.

mitya57 · 2016-03-10T18:24:32Z

I also cannot reproduce this with Python-Markdown 2.6.5 or with current master (tested with Python 2.7 and 3.5).

waylan · 2016-03-11T22:57:25Z

I was just able to reproduce this on LinuxMint, Python 2.7.6 and Master. I noticed that @mDuo13 was using Arch Linux. Wonder if Linux has something to do with it...

>>> t = """
... <div>
... ```foo
... bar
... ```
... </div>
... """
>>> markdown.markdown(t, extensions=['fenced_code'])
u'<div>\n\n\x02wzxhzdk:0\x03\n\n</div>'

waylan · 2016-03-11T23:12:34Z

I forgot out 429cc98 but the problem persists before that commit also.

waylan · 2016-03-11T23:24:12Z

And reversing the placeholders makes no difference before or after 429cc98...

All I can say is: Don't do that!

As I stated before, I don't really care about this as it is clearly invalid input. I'm willing to accept a PR if someone wants to work through it, but I'm not motivated to address it. Especially when the entire mechanism for dealing with both raw HTML and fenced code blocks will be different in version 3.0. My time will be better spent there.

facelessuser · 2017-01-24T06:50:06Z

The only thing you could really do here is save away fenced block original text on conversion and as you parse raw HTML, replace any placeholders you come across with the original text.

I did this in SuperFences to restore fenced syntax found in indented code blocks as it is hard to prevent that in the preprocessor. I never bothered with HTML in this scenario because I just don't put fenced block syntax in HTML unless I enable markdown="1". But fenced content in an indented block seemed more likely, even if only by accidentally indenting too much.

I'm not sure if it's worth the work, but it is possible to fix this. Out of all the syntax out of the box, it is really only fenced blocks that currently cause this to happen. If we actually want a solution for this, I can put one together, but basically, it would involve having a stash of pre-converted fenced block text, and as a block of HTML is identified, you would have to search that HTML block for placeholders and replace them with the original text.

waylan · 2017-01-25T14:09:17Z

So I was thinking about this problem on my walk into the office this morning and realized there was a simple solution to the general problem (the output contains placeholders). In the postprocessor we iterate over the list of placeholders in the stash and do a search-and-replace on the output text. If an earlier placeholder is contained in the stashed text of a later placeholder (the case here), then the placeholder is never found. Simply reversing the list of placeholders before iteration would resolve this problem. As the list retains the order in which items are stashed, this should eliminate the general problem of outputing placeholders.

But that doesn't solve the specific problem reported here.

The only thing you could really do here is save away fenced block original text on conversion and as you parse raw HTML, replace any placeholders you come across with the original text.

Exactly. As we don't do that, then we would have parsed Markdown inside raw HTML. Given it is an edge case we could ignore the inconsistency, but what if the raw HTML is something like this:

<pre><code>
```
# A fenced code block in a raw code block
```
</code></pre>

Regardless, the general problem should still be fixed. We should never be returning output with placeholders. The internal implementation details should be transparent to the end user.

facelessuser · 2017-01-25T14:33:05Z

As long as we don't throw away the original text and cache it, we should always be able to restore it. I figure fences could cache the text, and when raw html is being processed, if markdown="1" isn't enabled, we can just scan for placeholders and replace the greedily converted regions before we store the HTML.

This is all just a bandaid, but given the current state of fenced code, I don't see any more eloquent solutions.

facelessuser · 2017-01-25T14:48:38Z

One thing to note, if the false fence contains tags, I imagine this can throw off raw html parsing. So I don't think we can ever 100% fix all issues unless preprocessor events serialized things differently. Like each preprocessor is fed a line to determine if it wants to perform anything on the document from that line on. Then that chunk is given to it and processed.

facelessuser · 2017-01-25T14:52:36Z

Then you could essentially make raw HTML parsing and fences both be preprocessors that could avoid stepping all over each other. But this is beyond the scope of 2.6. I also don't know if that would when be ideal as I haven't considered it more than in passing.

waylan · 2017-01-25T14:57:21Z

Right. In an ideal world, both raw HTML and Fenced Code Blocks would be parsed by blockprocessors. With the refactor planned for version 3.0, this is what I had intended. Except for the problems with fenced code blocks I see 2.x as stable enough that I would be happy leaving it in maintenance mode indefinitely (fixing minor bugs only). But we really should have better support for fenced code blocks even if the 3.0 refactor is a lot of work. But my time is way to limited to work on 3.0 in earnest so the ETA is unknown.

facelessuser · 2017-01-25T16:22:53Z

Understood. I feel like we could maybe pull it off if the raw HTML parser was rewritten (for performance reasons, I think this would be good anyways). It needs to process the block line by line and unpack stashed text as the original text as it encounters them (if not within a markdown="1" tag). Ideally it would track HTML tag nesting depth to prevent it from back tracking and re-evaluating text it as already processed. It is a bit of work. But with time, I think it can be pulled off.

I probably couldn't get to it for for a while, but it sounds like 3.0 might be further off than that, so as I get a chance, I can try and tackle this. I won't commit to anything, but I can probably resolve this without affecting the existing API. At most we would need to maybe have a general cache that an extension can store original text in. That cache could take a placeholder and find the original text in the cache. Out of the box only fenced blocks would use this, but if someone had some syntax like fenced blocks, they could probably leverage it as well, but mainly we would just be trying to finish getting 2.6 solid. If another ambitious person gets to it before me, that is fine too :).

waylan · 2017-01-25T18:14:57Z

Some years ago I built a preprocessor which used a proper HTML parser under the hood. However, that HTML parser was the one in the python standard library which (at the time) crashed hard on invalid HTML with no way to recover. It solved a bunch of problems with the current approach, but I abandoned it because of the inability to handle invalid HTML (not to mention valid Markdown like <http://example.com>). More recently the standard library HTML parser has been updated and can handle those sorts of things relatively gracefully. Unfortunately, my old work is lost to time. I did try to recreate it and have just enough to convey the concept here: https://gist.github.com/waylan/84eadbf6873965886a16. It is very rough and has some obvious issues, but I suspect it could be made to work with less work that rebuilding the current preprocessor from scratch. If anyone is considering refactoring the 2.6 raw HTML handling, this may be a place to start.

facelessuser · 2017-01-25T18:20:17Z

Thanks for the heads up. They may help.

waylan · 2017-01-25T22:41:42Z

Back to the issue reported here...

Simply reversing the list of placeholders before iteration would resolve this problem. As the list retains the order in which items are stashed, this should eliminate the general problem of outputing placeholders.

Except that I forgot about 429cc98 (discussed in #454). That made a big performance improvement but it created this bug (leaving placeholders in the output). Before 429cc98, each placeholder was replaced as a separate string replacement. But now we generate a single regex and run it with re.sub, which only finds non-overlapping matches. The failure happens when one placeholder is in another (overlapping).

Working out a performat way to handle overlapping matches would at least prevent any placeholders from sneaking through to the final output.

Would be good to upgrade much further, but there are various regressions, mostly with markdown internal placeholders showing up in the output, which is no good. This starts at 2.3 and perhaps is because of our own extensions needing update, but also I think due to handling invalid markdown (e.g. markdown within html) differently -- but I don't want that to break. More info at Python-Markdown/markdown#458 and Python-Markdown/markdown#222

mDuo13 · 2019-02-08T23:16:24Z

🎉 Looking forward to trying the fix in the next release.

waylan added the someday-maybe Approved low priority request. label May 25, 2016

facelessuser mentioned this issue Aug 9, 2017

Strange data corruption with markdown.extensions.extra #576

Closed

waylan changed the title ~~markdown.extensions.extra - Code fence syntax turns into gibberish when nested in non-Markdown HTML blocks~~ Placeholder in output for fenced code blocks nested in raw HTML blocks Jul 25, 2018

waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code. and removed confirmed Confirmed bug report or approved feature request. labels Oct 23, 2018

PhilipTrauner mentioned this issue Dec 28, 2018

Correctly handle overlapping raw HTML matches #769

Closed

waylan closed this as completed in 12864d2 Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Placeholder in output for fenced code blocks nested in raw HTML blocks #458

Placeholder in output for fenced code blocks nested in raw HTML blocks #458

mDuo13 commented Mar 9, 2016

ryneeverett commented Mar 10, 2016

waylan commented Mar 10, 2016

mitya57 commented Mar 10, 2016

waylan commented Mar 11, 2016

waylan commented Mar 11, 2016

waylan commented Mar 11, 2016

facelessuser commented Jan 24, 2017

waylan commented Jan 25, 2017 •

edited

Loading

facelessuser commented Jan 25, 2017

facelessuser commented Jan 25, 2017

facelessuser commented Jan 25, 2017

waylan commented Jan 25, 2017

facelessuser commented Jan 25, 2017

waylan commented Jan 25, 2017

facelessuser commented Jan 25, 2017

waylan commented Jan 25, 2017 •

edited

Loading

mDuo13 commented Feb 8, 2019

Placeholder in output for fenced code blocks nested in raw HTML blocks #458

Placeholder in output for fenced code blocks nested in raw HTML blocks #458

Comments

mDuo13 commented Mar 9, 2016

ryneeverett commented Mar 10, 2016

waylan commented Mar 10, 2016

mitya57 commented Mar 10, 2016

waylan commented Mar 11, 2016

waylan commented Mar 11, 2016

waylan commented Mar 11, 2016

facelessuser commented Jan 24, 2017

waylan commented Jan 25, 2017 • edited Loading

facelessuser commented Jan 25, 2017

facelessuser commented Jan 25, 2017

facelessuser commented Jan 25, 2017

waylan commented Jan 25, 2017

facelessuser commented Jan 25, 2017

waylan commented Jan 25, 2017

facelessuser commented Jan 25, 2017

waylan commented Jan 25, 2017 • edited Loading

mDuo13 commented Feb 8, 2019

waylan commented Jan 25, 2017 •

edited

Loading

waylan commented Jan 25, 2017 •

edited

Loading