Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?] CommonMark Markdown Table cells are not CommonMark leaf blocks #66

Open
schlichtanders opened this issue Nov 9, 2023 · 23 comments

Comments

@schlichtanders
Copy link

schlichtanders commented Nov 9, 2023

better example

it turned out the actual confusion is better grasped by different examples
see #66 (comment)

original example

given the following text

html_str = "<bond def=\"mu\" unique_id=\"zOnak3/Y4WZl\"><input type='range' min='1' max='5' value='3'/><script>const input_el = currentScript.previousElementSibling; const output_el = currentScript.nextElementSibling; const displays = [\"-3\", \"-1\", \"0\", \"1\", \"3\"]; input_el.addEventListener(\"input\", () => {output_el.value = displays[input_el.valueAsNumber - 1]})</script><output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output></bond>"

it works perfectly in HTML, but CommonMark exchanges symbols such that the script does no longer work

"<p><bond def=\"mu\" unique_id=\"zOnak3/Y4WZl\"><input type='range' min='1' max='5' value='3'/><script>const input_el = currentScript.previousElementSibling; const output_el = currentScript.nextElementSibling; const displays = [&quot;-3&quot;, &quot;-1&quot;, &quot;0&quot;, &quot;1&quot;, &quot;3&quot;]; input_el.addEventListener(&quot;input&quot;, () =&gt; {output_el.value = displays[input_el.valueAsNumber - 1]});</script><output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output></bond></p> "

EDIT: the above does not parse because the html tag <bond> is non-standard. It is confusing example. Better look at #66 (comment)

I am using the following parser

import CommonMark
parser = CommonMark.Parser()
CommonMark.enable!(parser, CommonMark.DollarMathRule())
CommonMark.enable!(parser, CommonMark.TableRule())
parser(html_str)

Motivation

I am using CommonMark to create Pluto interface for Python. There the most intuitive way for interpolation is to interpolate valid html text into the plain markdown string. Hence it would be really really great, if <script> tags could be supported (they appear already in PlutoUI.Slider for instance)

@MichaelHatherly
Copy link
Owner

CommonMark (the spec) has got some specific rules for handling of blocks of HTML embedded in markdown, which is what I'vve followed in this implementation. I'd suggest splitting the embedded HTML over several lines, e.g.

julia> html_str_2 = """
       <bond def="mu" unique_id="zOnak3/Y4WZl">
       <input type='range' min='1' max='5' value='3'/>
       <script>
       const input_el = currentScript.previousElementSibling;
       const output_el = currentScript.nextElementSibling;
       const displays = ["-3", "-1", "0", "1", "3"];
       input_el.addEventListener("input", () => {output_el.value = displays[input_el.valueAsNumber - 1]})
       </script>
       <output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output>
       </bond>
       """
"<bond def=\"mu\" unique_id=\"zOnak3/Y4WZl\">\n<input type='range' min='1' max='5' value='3'/>\n<script>\nconst input_el = currentScript.previousElementSibling;\nconst output_el = currentScript.nextElementSibling;\nconst displays = [\"-3\", \"-1\", \"0\", \"1\", \"3\"];\ninput_el.addEventListener(\"input\", () => {output_el.value = displays[input_el.valueAsNumber - 1]})\n</script>\n<output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output>\n</bond>\n"

julia> parser(html_str_2)
 <bond def="mu" unique_id="zOnak3/Y4WZl">
 <input type='range' min='1' max='5' value='3'/>
 <script>
 const input_el = currentScript.previousElementSibling;
 const output_el = currentScript.nextElementSibling;
 const displays = ["-3", "-1", "0", "1", "3"];
 input_el.addEventListener("input", () => {output_el.value = displays[input_el.valueAsNumber - 1]})
 </script>
 <output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output>
 </bond>


julia> html(ans)
"<bond def=\"mu\" unique_id=\"zOnak3/Y4WZl\">\n<input type='range' min='1' max='5' value='3'/>\n<script>\nconst input_el = currentScript.previousElementSibling;\nconst output_el = currentScript.nextElementSibling;\nconst displays = [\"-3\", \"-1\", \"0\", \"1\", \"3\"];\ninput_el.addEventListener(\"input\", () => {output_el.value = displays[input_el.valueAsNumber - 1]})\n</script>\n<output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output>\n</bond>\n"

julia> print(ans)
<bond def="mu" unique_id="zOnak3/Y4WZl">
<input type='range' min='1' max='5' value='3'/>
<script>
const input_el = currentScript.previousElementSibling;
const output_el = currentScript.nextElementSibling;
const displays = ["-3", "-1", "0", "1", "3"];
input_el.addEventListener("input", () => {output_el.value = displays[input_el.valueAsNumber - 1]})
</script>
<output style=' font-family: system-ui; font-size: 15px; margin-left: 3px; transform: translateY(-4px); display: inline-block;'>0</output>
</bond>

which, I think, might be closer to what you want.

@schlichtanders
Copy link
Author

This is surprising indeed!

I just made everything into a single line to be able to interpolate it inside a Markdown Table (which will obviously break with newlines)

@MichaelHatherly
Copy link
Owner

Yeah, behaviour is looking pretty consistent with https://spec.commonmark.org/dingus/, so there's not much that we can really change in this package since it appears (at least from what I've tried) to be the same as the spec.

I just made everything into a single line to be able to interpolate it inside a Markdown Table (which will obviously break with newlines)

Hmm, yeah, tricky. There's really just a limit to the complexity that commonmark can really handle before it ends up not really being commonmark any more.

@MichaelHatherly
Copy link
Owner

https://spec.commonmark.org/0.30/#html-blocks for the link to the HTML block spec that is causing this.

@MichaelHatherly
Copy link
Owner

which will obviously break with newlines

Only solution I see really is multiline cells for markdown tables. I don't really have the need for it, but I would accept any PRs that can manage to implement it as a parser plugin, which I reckon is probably doable, if a bit complex.

@schlichtanders
Copy link
Author

Multiline cells would not work, as they also would get destroyed by a interpolated newline.

I still do not understand why this is not treated as html from CommonMark: Is it because inside markdown tables html tags are never intrpreted as html, because the entire line does not start with an html tag?

@schlichtanders
Copy link
Author

My current understanding of the specs:

  • html leaf blocks need to start with the tag and end with the tag (and a new line sometimes)
  • leaf blocks can be nested inside container blocks
  • markdown table is not specified inside CommonMark, but I guess it is best understood as a container block and not a leaf block

Hence every cell in a markdown table is a leaf block (what else should it be?). But then, the html in the cell should be recognized as html, shouldn't it?

@schlichtanders
Copy link
Author

I just tried wrapping the above into <div></div> and indeed CommonMark does intrprete everything correctly.

but inside the markdown table it breaks

@schlichtanders
Copy link
Author

Here new minimal examples:

Parsed Correctly

"<div> => </div>"

is parsed as

"<div> => </div> "

Parsed Wrongly (?)

"""
| title    |
| -------- |
|<div> => </div>|
"""

is parsed as

"<table><thead><tr><th align=\"left\">title</th></tr></thead><tbody><tr><td align=\"left\"><div> =&gt; </div></td></tr></tbody></table>"

the &gt; is different

@schlichtanders schlichtanders changed the title [Bug?] CommonMark changes text inside <script> tags [Bug?] CommonMark Markdown Table cells are not CommonMark leaf blocks Nov 9, 2023
@MichaelHatherly
Copy link
Owner

The table parser was modeled on the github_markdown one from pandoc, so best check whether that behaviour you're seeing there matches, or not, with it. https://pandoc.org/try/

@schlichtanders
Copy link
Author

As you probably already know, the pandoc github_markdown parser has the same unfortunate behaviour.

Summary:

  • CommonMark allows for markdown table cells to behave like leaf blocks
  • intuitively it makes sense that markdown table cells behave like leaf blocks (this interpretation seems most apt when considering CommonMark spec overall)
  • currently markdown table cells do not behave like leaf blocks (because the parser is modelled after pandoc, which has this behaviour)

Hence I would argue that it makes sense to treat markdown table cells as leaf blocks in CommonMark.jl

@schlichtanders
Copy link
Author

schlichtanders commented Nov 9, 2023

title
=>

Github markdown seems different from pandocs implementation - you can inspect this very webpage - it shows => and no &gt;. (also the div is rendered and not escaped)

@MichaelHatherly
Copy link
Owner

Feel free to submit a PR that adjusts that behaviour. I'm unlikely to get around to it myself since I'm not really in need of embedded HTML inside table cells myself.

@schlichtanders
Copy link
Author

I just tested about 4 other python markdown parsers - and they all have the same problem... probably they all copy pandoc's implementation. Hence I have no easy workaround 😅

I indeed may find resources to create the PullRequest soon. Let's see.

I am glad you agree that this would be a good addition.
Looking forward to contribute.

@schlichtanders
Copy link
Author

@MichaelHatherly a short question - the code mentions that emphasis and links necesarily need to be determined before the TableCells are build up.

Do you know why this is a *must*?

# Low priority since this *must* happen after nested structure of emphasis and
# links is determined. 100 should do fine.
inline_modifier(rule::TableRule) = Rule(100) do parser, block

@MichaelHatherly
Copy link
Owner

In the case where there are | characters inside of emphasis (or other inline syntax), since it needs to ignore those otherwise a | inside of them would be treated as a column separator which isn't the behaviour that is needed.

@schlichtanders
Copy link
Author

Thank you for the explanation.

I read through the code and understood that a leaf block in CommonMark is kind of hardcoded to be multiline

  • leaf blocks consume full lines, sometimes skipping the start (like in lists) but never skipping the end

Hence I gave up 🙂 and instead build a tiny workaround which is much easier to maintain and should be enough for my case currently

using CommonMark
using Crayons

struct HtmlFragmentInlineRule end
struct HtmlFragmentInline <: CommonMark.AbstractInline end
function parse_html_fragment(parser::CommonMark.InlineParser, block::CommonMark.Node)
    m = CommonMark.consume(parser, match(r"<>.*</>", parser))
    m === nothing && return false
    node = CommonMark.Node(HtmlFragmentInline())
    node.literal = @views m.match[begin+length("<>"):end-length("</>")]
    CommonMark.append_child(block, node)
    return true
end
CommonMark.inline_rule(::HtmlFragmentInlineRule) = CommonMark.Rule(parse_html_fragment, 1.5, "<")

function CommonMark.write_term(::HtmlFragmentInline, render, node, enter)
    style = crayon"dark_gray"
    CommonMark.print_literal(render, style)
    CommonMark.push_inline!(render, style)
    CommonMark.print_literal(render, node.literal)
    CommonMark.pop_inline!(render)
    CommonMark.print_literal(render, inv(style))
end
CommonMark.write_html(::HtmlFragmentInline, r, n, ent) = CommonMark.literal(r, r.format.safe ? "<!-- raw HTML omitted -->" : n.literal)
CommonMark.write_latex(::HtmlFragmentInline, w, node, ent) = nothing
CommonMark.write_markdown(::HtmlFragmentInline, w, node, ent) = CommonMark.literal(w, node.literal)

with this I can now do

using CommonMark
parser = CommonMark.Parser()
CommonMark.enable!(parser, CommonMark.TableRule())
CommonMark.enable!(parser, HtmlFragmentInlineRule())

ast = parser("""
| title    |
| -------- |
|<><div> => </div></>|
""")
html(ast)

and it returns the html unchanged

"<table><thead><tr><th align=\"left\">title</th></tr></thead><tbody><tr><td align=\"left\"><div> => </div></td></tr></tbody></table>"

@schlichtanders
Copy link
Author

@MichaelHatherly do you think this is worth adding to the extensions?

@MichaelHatherly
Copy link
Owner

and instead build a tiny workaround which is much easier to maintain and should be enough for my case currently

Glad you've found a solution.

do you think this is worth adding to the extensions?

For the time being I would suggest not making it an official extension and see how it evolves for your usecase. As I'm sure you can tell this package does not change much now so you're not going to run into any issues using some of those internals you needed. Let's just see whether it turns out to be something that a lot of users end up asking for and re-evaluate it in a while.

@MichaelHatherly
Copy link
Owner

m.match[begin+length("<>"):end-length("</>")]

Curious whether using a regex group rather than this work?

@schlichtanders
Copy link
Author

schlichtanders commented Nov 10, 2023

m.match[begin+length("<>"):end-length("</>")]

Curious whether using a regex group rather than this work?

I was cautious not to interfere with the matching and consumption logic which is one reason why I am not using regex groups.
Do you think regex groups would have an advantage?

@MichaelHatherly
Copy link
Owner

I was just curious :)

@schlichtanders
Copy link
Author

schlichtanders commented Nov 10, 2023

I guess they work to and are slightly faster :)

it does not work :D
the consume will return a string as a match and not the regex match itself. The string contains everything (as it probably should).

So the substring seems like the easiest solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants