Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placeholder in output for fenced code blocks nested in raw HTML blocks #458

Closed
mDuo13 opened this issue Mar 9, 2016 · 17 comments
Closed
Labels
bug Bug report. core Related to the core parser code. someday-maybe Approved low priority request.

Comments

@mDuo13
Copy link

mDuo13 commented Mar 9, 2016

When using the markdown.extensions.extra, syntax looking like a code fence inside of HTML block elements gets rendered as gibberish instead of text, unless the HTML block has markdown=1 applied.

This test case demonstrates what I mean:
test.md.txt

The output I get from parsing that page includes a section like this:

<div>

Save this file to `test.md` and then try this:


�wzxhzdk:0�


</div>

The exact contents of the gibberish change, but it typically seems to start with wz... which suggests to me that it's rendering something that's not meant to be a string as a string.

Same behavior on Python 3.5.1, using Python-Markdown 2.6.2 and Python-Markdown 2.6.5, as well as Python 2.7.11 using Python-Markdown 2.6.5. (On two different Arch Linux machines.)

@ryneeverett
Copy link
Contributor

Tried on markdown 2.6.5 with python 2.7 and 3.4 and couldn't reproduce.

@waylan
Copy link
Member

waylan commented Mar 10, 2016

Those "gibberish" strings are (internal) raw HTML placeholders. While they would ideally never appear in the output, in this case I don't really care much as the docs clearly state that Markdown is not processed inside raw HTML. Therefore you shouldn't be putting Markdown inside raw HTML to begin with (without `markdown=1').

That said, my best guess would be that the placeholder for the fenced code block is looked up first and not found in the document (as it isn't there). Then the placeholder for the raw HTML block is looked up, found and swapped out which contains the fenced code block placeholder. However, as that placeholder was previously not found it is't lookup up again, leaving it in the output. The fix would be to change the lookup order: perhaps the reserve of the order in which they were added to the stash (presumably @markdown/postprocessors.py#L55).

Although, if @ryneeverett can't reproduce the problem (I haven't checked myself), its not clear that that would solve the problem. Although, I see that after all the placeholders are processed they are passed to a re.sub call which uses a single regex to search for and replace all of them @markdown/postprocessors.py#L73). While certainly more performant, in this edge case, that may be a problem. Can't say I want to decrease performance for this edge case.

@mitya57
Copy link
Collaborator

mitya57 commented Mar 10, 2016

I also cannot reproduce this with Python-Markdown 2.6.5 or with current master (tested with Python 2.7 and 3.5).

@waylan
Copy link
Member

waylan commented Mar 11, 2016

I was just able to reproduce this on LinuxMint, Python 2.7.6 and Master. I noticed that @mDuo13 was using Arch Linux. Wonder if Linux has something to do with it...

>>> t = """
... <div>
... ```foo
... bar
... ```
... </div>
... """
>>> markdown.markdown(t, extensions=['fenced_code'])
u'<div>\n\n\x02wzxhzdk:0\x03\n\n</div>'

@waylan
Copy link
Member

waylan commented Mar 11, 2016

I forgot out 429cc98 but the problem persists before that commit also.

@waylan
Copy link
Member

waylan commented Mar 11, 2016

And reversing the placeholders makes no difference before or after 429cc98...

All I can say is: Don't do that!

As I stated before, I don't really care about this as it is clearly invalid input. I'm willing to accept a PR if someone wants to work through it, but I'm not motivated to address it. Especially when the entire mechanism for dealing with both raw HTML and fenced code blocks will be different in version 3.0. My time will be better spent there.

@waylan waylan added the someday-maybe Approved low priority request. label May 25, 2016
@facelessuser
Copy link
Collaborator

The only thing you could really do here is save away fenced block original text on conversion and as you parse raw HTML, replace any placeholders you come across with the original text.

I did this in SuperFences to restore fenced syntax found in indented code blocks as it is hard to prevent that in the preprocessor. I never bothered with HTML in this scenario because I just don't put fenced block syntax in HTML unless I enable markdown="1". But fenced content in an indented block seemed more likely, even if only by accidentally indenting too much.

I'm not sure if it's worth the work, but it is possible to fix this. Out of all the syntax out of the box, it is really only fenced blocks that currently cause this to happen. If we actually want a solution for this, I can put one together, but basically, it would involve having a stash of pre-converted fenced block text, and as a block of HTML is identified, you would have to search that HTML block for placeholders and replace them with the original text.

@waylan
Copy link
Member

waylan commented Jan 25, 2017

So I was thinking about this problem on my walk into the office this morning and realized there was a simple solution to the general problem (the output contains placeholders). In the postprocessor we iterate over the list of placeholders in the stash and do a search-and-replace on the output text. If an earlier placeholder is contained in the stashed text of a later placeholder (the case here), then the placeholder is never found. Simply reversing the list of placeholders before iteration would resolve this problem. As the list retains the order in which items are stashed, this should eliminate the general problem of outputing placeholders.

But that doesn't solve the specific problem reported here.

The only thing you could really do here is save away fenced block original text on conversion and as you parse raw HTML, replace any placeholders you come across with the original text.

Exactly. As we don't do that, then we would have parsed Markdown inside raw HTML. Given it is an edge case we could ignore the inconsistency, but what if the raw HTML is something like this:

<pre><code>
```
# A fenced code block in a raw code block
```
</code></pre>

Regardless, the general problem should still be fixed. We should never be returning output with placeholders. The internal implementation details should be transparent to the end user.

@facelessuser
Copy link
Collaborator

As long as we don't throw away the original text and cache it, we should always be able to restore it. I figure fences could cache the text, and when raw html is being processed, if markdown="1" isn't enabled, we can just scan for placeholders and replace the greedily converted regions before we store the HTML.

This is all just a bandaid, but given the current state of fenced code, I don't see any more eloquent solutions.

@facelessuser
Copy link
Collaborator

One thing to note, if the false fence contains tags, I imagine this can throw off raw html parsing. So I don't think we can ever 100% fix all issues unless preprocessor events serialized things differently. Like each preprocessor is fed a line to determine if it wants to perform anything on the document from that line on. Then that chunk is given to it and processed.

@facelessuser
Copy link
Collaborator

Then you could essentially make raw HTML parsing and fences both be preprocessors that could avoid stepping all over each other. But this is beyond the scope of 2.6. I also don't know if that would when be ideal as I haven't considered it more than in passing.

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Right. In an ideal world, both raw HTML and Fenced Code Blocks would be parsed by blockprocessors. With the refactor planned for version 3.0, this is what I had intended. Except for the problems with fenced code blocks I see 2.x as stable enough that I would be happy leaving it in maintenance mode indefinitely (fixing minor bugs only). But we really should have better support for fenced code blocks even if the 3.0 refactor is a lot of work. But my time is way to limited to work on 3.0 in earnest so the ETA is unknown.

@facelessuser
Copy link
Collaborator

Understood. I feel like we could maybe pull it off if the raw HTML parser was rewritten (for performance reasons, I think this would be good anyways). It needs to process the block line by line and unpack stashed text as the original text as it encounters them (if not within a markdown="1" tag). Ideally it would track HTML tag nesting depth to prevent it from back tracking and re-evaluating text it as already processed. It is a bit of work. But with time, I think it can be pulled off.

I probably couldn't get to it for for a while, but it sounds like 3.0 might be further off than that, so as I get a chance, I can try and tackle this. I won't commit to anything, but I can probably resolve this without affecting the existing API. At most we would need to maybe have a general cache that an extension can store original text in. That cache could take a placeholder and find the original text in the cache. Out of the box only fenced blocks would use this, but if someone had some syntax like fenced blocks, they could probably leverage it as well, but mainly we would just be trying to finish getting 2.6 solid. If another ambitious person gets to it before me, that is fine too :).

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Some years ago I built a preprocessor which used a proper HTML parser under the hood. However, that HTML parser was the one in the python standard library which (at the time) crashed hard on invalid HTML with no way to recover. It solved a bunch of problems with the current approach, but I abandoned it because of the inability to handle invalid HTML (not to mention valid Markdown like <http://example.com>). More recently the standard library HTML parser has been updated and can handle those sorts of things relatively gracefully. Unfortunately, my old work is lost to time. I did try to recreate it and have just enough to convey the concept here: https://gist.github.com/waylan/84eadbf6873965886a16. It is very rough and has some obvious issues, but I suspect it could be made to work with less work that rebuilding the current preprocessor from scratch. If anyone is considering refactoring the 2.6 raw HTML handling, this may be a place to start.

@facelessuser
Copy link
Collaborator

Thanks for the heads up. They may help.

@waylan
Copy link
Member

waylan commented Jan 25, 2017

Back to the issue reported here...

Simply reversing the list of placeholders before iteration would resolve this problem. As the list retains the order in which items are stashed, this should eliminate the general problem of outputing placeholders.

Except that I forgot about 429cc98 (discussed in #454). That made a big performance improvement but it created this bug (leaving placeholders in the output). Before 429cc98, each placeholder was replaced as a separate string replacement. But now we generate a single regex and run it with re.sub, which only finds non-overlapping matches. The failure happens when one placeholder is in another (overlapping).

Working out a performat way to handle overlapping matches would at least prevent any placeholders from sneaking through to the final output.

asfgit pushed a commit to apache/allura that referenced this issue Oct 12, 2017
Would be good to upgrade much further, but there are various regressions,
mostly with markdown internal placeholders showing up in the output, which is
no good.  This starts at 2.3 and perhaps is because of our own extensions
needing update, but also I think due to handling invalid markdown (e.g.
markdown within html) differently -- but I don't want that to break.  More info
at Python-Markdown/markdown#458 and
Python-Markdown/markdown#222
@waylan waylan changed the title markdown.extensions.extra - Code fence syntax turns into gibberish when nested in non-Markdown HTML blocks Placeholder in output for fenced code blocks nested in raw HTML blocks Jul 25, 2018
@waylan waylan added bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code. and removed confirmed Confirmed bug report or approved feature request. labels Oct 23, 2018
@waylan waylan closed this as completed in 12864d2 Feb 7, 2019
@mDuo13
Copy link
Author

mDuo13 commented Feb 8, 2019

🎉 Looking forward to trying the fix in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. core Related to the core parser code. someday-maybe Approved low priority request.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants