autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244

nxg · 2022-04-09T18:15:46Z

Consider the following:

import markdown

md = '''
Here are some elements:

  * url <http://example.org>
  * repo url <ssh://example.org>, which is a non-HTTP URL
  * and <urn:foo> is something else
  * ssh url2 <ssh:me@example.org>, handled as an email address
  * misc element <em>boo!</em>
'''

converter = markdown.Markdown()
print(converter.convert(md))

This renders as

<p>Here are some elements:</p>
<ul>
<li>url <a href="http://example.org">http://example.org</a></li>
<li>repo url <ssh://example.org>, which is a non-HTTP URL</li>
<li>and <urn:foo> is something else</li>
<li>ssh url2 <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#115;&#115;&#104;&#58;&#109;&#101;&#64;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46;&#111;&#114;&#103;">&#115;&#115;&#104;&#58;&#109;&#101;&#64;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46;&#111;&#114;&#103;</a>, handled as an email address</li>
<li>misc element <em>boo!</em></li>
</ul>

I think items number 2 and 3 are incorrect, (a) because the behaviour doesn't match two significant Markdown specs, and (b) because they are both invalid XML (yes, <urn:foo> looks like an XML element with a namespace prefix; let's not go there...).

The autolink feature in the Daring Fireball spec is ‘for URLs and email addresses’ (though the only URL in that example is an HTTP URL). The corresponding section in the CommonMark spec says that the autolink should happen for an absolute URI. So the second case should be turned into <a href='ssh://example.org'>ssh://example.org</a>.

What appears to be happening, instead, is that this is being interpreted as literal HTML. The relevant section of Gruber's spec is rather vague, but the corresponding part of the CommonMark spec says that this should happen only to ‘[t]ext between < and > that looks like an HTML tag’, which of course <ssh://example.org> doesn't (CommonMark: ‘A tag name consists of an ASCII letter followed by zero or more ASCII letters, digits, or hyphens (-)’).

Independently of any spec, however, having <ssh://example.org> appear in the output means that that output is syntactically invalid, and I feel this shouldn't happen for any input, however insane.

Suggestion:

When <starttag> consists of something other than [a-zA-Z][a-zA-Z0-9-]*, then it is either a URI, in which case it should be turned into an <a> element, or it is not, in which case it should be included literally in the output, as if the content were instead enclosed in backticks.

This would imply that item 3 should render as <code>urn:foo</code>.

The text was updated successfully, but these errors were encountered:

waylan · 2022-04-09T18:32:29Z

First of all, Python-Markdown is NOT a Commonmark implementation, so I will ignore all references to that spec.

The autolink feature in the Daring Fireball spec is ‘for URLs and email addresses’ (though the only URL in that example is an HTTP URL).

And in fact, according to Babelmark, Markdown.pl only recognizes HTTP URLs. Seems to me we are mostly inline with the reference implementation.

Independently of any spec, however, having <ssh://example.org> appear in the output means that that output is syntactically invalid, and I feel this shouldn't happen for any input, however insane.

This seems to be the motivation behind your report. However, I have a different view. We cannot be responsible for invalid input. That is the responsibility of the document author. In fact, we do not validate any raw HTML for this reason. Any contact wrapped in angle brackets which is not recognized as a auto link is just treated as raw HTML Everything between (and including) the angle brackets is passed thorough unaltered in any way. I have no intention of changing that.

That said, if there are some additional forms of URLs would should clearly be recognized as auto links which are not currently, then I would be willing to review a PR. However, if those URLs are not also recognized by the reference implementation, then they it would be better for support to be provided through a third party extension.

Speaking of third party extensions, you can always modify any part of the parser with an extension. If you want to avoid invalid output, then you are always free to use an extension to do that. Although, I would think a post processor would be better suited to the task. Simply pass the output from Python-Markdown into some HTML tidy library and then use the output of that on your HTML templates (or whatever you are going with it).

nxg · 2022-04-10T13:28:43Z

Thanks for looking at this.

First of all, Python-Markdown is NOT a Commonmark implementation, so I will ignore all references to that spec.

And the wonderful thing about Markdown implementations is that there are so many to choose from! I only mentioned Commonmark because it was one that came to mind, and it seemed useful to mention some spec on top of Gruber's original. I didn't intend to suggest that python-markdown was, or should be, specifically a Commonmark implementation.

And in fact, according to Babelmark, Markdown.pl only recognizes HTTP URLs. Seems to me we are mostly inline with the reference implementation.

True, but I notice that, if we look at the other parsers in that useful list, which produce a result, it's only Markdown.pl and python-markdown that don't do at least something with <ssh://example.org>, and instead pass it through verbatim. Some turn it into a <a> link, others give up and escape the < with < – either of those seems a reasonable strategy.

Independently of any spec, however, having ssh://example.org appear in the output means that that output is syntactically invalid, and I feel this shouldn't happen for any input, however insane.

This seems to be the motivation behind your report.

That is indeed the motivation. I want to consume what python-markdown produces, so I naturally have an interest in it being consumable.

However, I have a different view. We cannot be responsible for invalid input. That is the responsibility of the document author. In fact, we do not validate any raw HTML for this reason. Any contact wrapped in angle brackets which is not recognized as a auto link is just treated as raw HTML Everything between (and including) the angle brackets is passed thorough unaltered in any way.

That's an interesting point of view. Myself, I feel that if there's a ‘spirit of Markdown’, it's that the user is never wrong, and there is no such thing as ‘invalid input’, only input which the processor doesn't recognise, where the user therefore fails to communicate their intention. A Markdown processor is a fairly heuristic tool, and should always have sensible defaults.

That is, Markdown is a DWIM application, and I think we can say, with some degree of confidence, that if the user types <ssh://example.org> then their intention was something other than creating inline HTML. All I'm suggesting is that if python-markdown spots something like that, which is not a mailto, and which is clearly not intended as inline HTML, that it has a third strategy.

That said, if there are some additional forms of URLs would should clearly be recognized as auto links which are not currently, then I would be willing to review a PR. However, if those URLs are not also recognized by the reference implementation, then they it would be better for support to be provided through a third party extension.

I think it would be reasonable to regard essentially any URL inside angle brackets as an autolink candidate -- there's nothing special about HTTP and FTP from Markdown's point of view. That would pass a DWI(probably)M test, for me. In the case of ssh://example.org I don't think there's much useful that could happen if that appears in a web page, but if that turned into an <a> link, then I as a user would not feel surprised or cheated.

Thus, in inlinepatterns.py, change AUTOLINK_RE to

AUTOLINK_RE = r'<([a-zA-Z]+://[^<>]*)>'

Additionally, and whether or not you though the above was a good idea (ie, I can see at least some sort of case for restricting autolink to HTTP/FTP), there might be milage in something like

# According to <https://www.w3.org/TR/REC-xml/#NT-Name>,
# an element start tag is
#
#    '<' Name (S Attribute)* S? '>'
# where
# [4]   	NameStartChar	   ::=   	":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
# [4a]   	NameChar	   ::=   	NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
#
# but we don't have to be so general, if what we're aiming to spot is (the complement of) HTML element names.
# So match an element-start-tag or end-tag
# which has something other than [a-z]+ for the element name
NONHTML_RE = re.compile('(<\/?[a-zA-Z][a-zA-Z0-9.-]*[^ a-zA-Z0-9.-]([^ <>]*)?>)')

That (along with a suitable NonhtmlInlineProcessor class) would identify those <something...> which are obviously not intended to be inline HTML, and could escape the leading < (or put the content in <code>...</code>, if one wanted to be fancy).

Although, I would think a post processor would be better suited to the task. Simply pass the output from Python-Markdown into some HTML tidy library and then use the output of that on your HTML templates (or whatever you are going with it).

Indeed, but element <ssh://example.org> manages to confuse lxml's HTMLParser (and I can't say I really blame it, but a bug report is heading their way...). There's ‘tag soup’ and there's ‘what...!’.

waylan · 2022-04-10T19:28:32Z

Let me restate that everything in angle brackets is considered to be raw HTML unless it is clearly an auto link. That is how the reference implementation works and that is how Python-Markdown will always work by default. If you want to alter that behavior, then you can do so in a third party extension.

However, if you want to explore expanding auto links, then that may be worth considering. However, as with everything else in Markdown, I don't think we want to be super strict and encode the entire requirements of some spec. Something simple like AUTOLINK_RE = r'<([a-zA-Z]+://[^<>]*)>' is probably the right direction. My only concern is whether that could be to general and match things it shouldn't. A good set of tests would likely address any such concern. Therefore, a PR with such and change and a full set of tests would be given serious consideration. However, anything more complex can be handled by a third party extension.

waylan added the more-info-needed More information needs to be provided. label Apr 9, 2022

waylan added feature Feature request. core Related to the core parser code. confirmed Confirmed bug report or approved feature request. someday-maybe Approved low priority request. and removed more-info-needed More information needs to be provided. labels Apr 10, 2022

jwortmann mentioned this issue Feb 23, 2023

Invalid HTML tag in hover response astoff/digestif#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244

autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244

nxg commented Apr 9, 2022

waylan commented Apr 9, 2022 •

edited

nxg commented Apr 10, 2022

waylan commented Apr 10, 2022

autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244

autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244

Comments

nxg commented Apr 9, 2022

waylan commented Apr 9, 2022 • edited

nxg commented Apr 10, 2022

waylan commented Apr 10, 2022

waylan commented Apr 9, 2022 •

edited