-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Placeholder in output for fenced code blocks nested in raw HTML blocks #458
Comments
Tried on markdown 2.6.5 with python 2.7 and 3.4 and couldn't reproduce. |
Those "gibberish" strings are (internal) raw HTML placeholders. While they would ideally never appear in the output, in this case I don't really care much as the docs clearly state that Markdown is not processed inside raw HTML. Therefore you shouldn't be putting Markdown inside raw HTML to begin with (without `markdown=1'). That said, my best guess would be that the placeholder for the fenced code block is looked up first and not found in the document (as it isn't there). Then the placeholder for the raw HTML block is looked up, found and swapped out which contains the fenced code block placeholder. However, as that placeholder was previously not found it is't lookup up again, leaving it in the output. The fix would be to change the lookup order: perhaps the reserve of the order in which they were added to the stash (presumably @markdown/postprocessors.py#L55). Although, if @ryneeverett can't reproduce the problem (I haven't checked myself), its not clear that that would solve the problem. Although, I see that after all the placeholders are processed they are passed to a |
I also cannot reproduce this with Python-Markdown 2.6.5 or with current master (tested with Python 2.7 and 3.5). |
I was just able to reproduce this on LinuxMint, Python 2.7.6 and Master. I noticed that @mDuo13 was using Arch Linux. Wonder if Linux has something to do with it... >>> t = """
... <div>
... ```foo
... bar
... ```
... </div>
... """
>>> markdown.markdown(t, extensions=['fenced_code'])
u'<div>\n\n\x02wzxhzdk:0\x03\n\n</div>' |
I forgot out 429cc98 but the problem persists before that commit also. |
And reversing the placeholders makes no difference before or after 429cc98... All I can say is: Don't do that! As I stated before, I don't really care about this as it is clearly invalid input. I'm willing to accept a PR if someone wants to work through it, but I'm not motivated to address it. Especially when the entire mechanism for dealing with both raw HTML and fenced code blocks will be different in version 3.0. My time will be better spent there. |
The only thing you could really do here is save away fenced block original text on conversion and as you parse raw HTML, replace any placeholders you come across with the original text. I did this in SuperFences to restore fenced syntax found in indented code blocks as it is hard to prevent that in the preprocessor. I never bothered with HTML in this scenario because I just don't put fenced block syntax in HTML unless I enable I'm not sure if it's worth the work, but it is possible to fix this. Out of all the syntax out of the box, it is really only fenced blocks that currently cause this to happen. If we actually want a solution for this, I can put one together, but basically, it would involve having a stash of pre-converted fenced block text, and as a block of HTML is identified, you would have to search that HTML block for placeholders and replace them with the original text. |
So I was thinking about this problem on my walk into the office this morning and realized there was a simple solution to the general problem (the output contains placeholders). In the postprocessor we iterate over the list of placeholders in the stash and do a search-and-replace on the output text. If an earlier placeholder is contained in the stashed text of a later placeholder (the case here), then the placeholder is never found. Simply reversing the list of placeholders before iteration would resolve this problem. As the list retains the order in which items are stashed, this should eliminate the general problem of outputing placeholders. But that doesn't solve the specific problem reported here.
Exactly. As we don't do that, then we would have parsed Markdown inside raw HTML. Given it is an edge case we could ignore the inconsistency, but what if the raw HTML is something like this:
Regardless, the general problem should still be fixed. We should never be returning output with placeholders. The internal implementation details should be transparent to the end user. |
As long as we don't throw away the original text and cache it, we should always be able to restore it. I figure fences could cache the text, and when raw html is being processed, if This is all just a bandaid, but given the current state of fenced code, I don't see any more eloquent solutions. |
One thing to note, if the false fence contains tags, I imagine this can throw off raw html parsing. So I don't think we can ever 100% fix all issues unless preprocessor events serialized things differently. Like each preprocessor is fed a line to determine if it wants to perform anything on the document from that line on. Then that chunk is given to it and processed. |
Then you could essentially make raw HTML parsing and fences both be preprocessors that could avoid stepping all over each other. But this is beyond the scope of 2.6. I also don't know if that would when be ideal as I haven't considered it more than in passing. |
Right. In an ideal world, both raw HTML and Fenced Code Blocks would be parsed by blockprocessors. With the refactor planned for version 3.0, this is what I had intended. Except for the problems with fenced code blocks I see 2.x as stable enough that I would be happy leaving it in maintenance mode indefinitely (fixing minor bugs only). But we really should have better support for fenced code blocks even if the 3.0 refactor is a lot of work. But my time is way to limited to work on 3.0 in earnest so the ETA is unknown. |
Understood. I feel like we could maybe pull it off if the raw HTML parser was rewritten (for performance reasons, I think this would be good anyways). It needs to process the block line by line and unpack stashed text as the original text as it encounters them (if not within a I probably couldn't get to it for for a while, but it sounds like 3.0 might be further off than that, so as I get a chance, I can try and tackle this. I won't commit to anything, but I can probably resolve this without affecting the existing API. At most we would need to maybe have a general cache that an extension can store original text in. That cache could take a placeholder and find the original text in the cache. Out of the box only fenced blocks would use this, but if someone had some syntax like fenced blocks, they could probably leverage it as well, but mainly we would just be trying to finish getting 2.6 solid. If another ambitious person gets to it before me, that is fine too :). |
Some years ago I built a preprocessor which used a proper HTML parser under the hood. However, that HTML parser was the one in the python standard library which (at the time) crashed hard on invalid HTML with no way to recover. It solved a bunch of problems with the current approach, but I abandoned it because of the inability to handle invalid HTML (not to mention valid Markdown like |
Thanks for the heads up. They may help. |
Back to the issue reported here...
Except that I forgot about 429cc98 (discussed in #454). That made a big performance improvement but it created this bug (leaving placeholders in the output). Before 429cc98, each placeholder was replaced as a separate string replacement. But now we generate a single regex and run it with re.sub, which only finds non-overlapping matches. The failure happens when one placeholder is in another (overlapping). Working out a performat way to handle overlapping matches would at least prevent any placeholders from sneaking through to the final output. |
Would be good to upgrade much further, but there are various regressions, mostly with markdown internal placeholders showing up in the output, which is no good. This starts at 2.3 and perhaps is because of our own extensions needing update, but also I think due to handling invalid markdown (e.g. markdown within html) differently -- but I don't want that to break. More info at Python-Markdown/markdown#458 and Python-Markdown/markdown#222
🎉 Looking forward to trying the fix in the next release. |
When using the
markdown.extensions.extra
, syntax looking like a code fence inside of HTML block elements gets rendered as gibberish instead of text, unless the HTML block hasmarkdown=1
applied.This test case demonstrates what I mean:
test.md.txt
The output I get from parsing that page includes a section like this:
The exact contents of the gibberish change, but it typically seems to start with wz... which suggests to me that it's rendering something that's not meant to be a string as a string.
Same behavior on Python 3.5.1, using Python-Markdown 2.6.2 and Python-Markdown 2.6.5, as well as Python 2.7.11 using Python-Markdown 2.6.5. (On two different Arch Linux machines.)
The text was updated successfully, but these errors were encountered: