Span rules + br can break commonmark standard #405

zombiecalypse · 2022-03-22T04:04:53Z

Implementation here:

Line 209 in 4499b5c

rules.emphasis = {

Reproducing example: turndown("foo ") == "_foo \n_"

https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis

A single _ character can close emphasis iff it is part of a right-flanking delimiter run and either (a) not part of a left-flanking delimiter run or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.

and

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This means that commonmark2html("_foo \n_") = "_foo _", i.e. the  is lost.

The same is true for the other possible span delimiters (*, __, **) and on a leading   in a span element.

As far as I can tell only   is affected. While foobar and similar abominations do trip up the context free replacement, they are fortunately not valid html

The text was updated successfully, but these errors were encountered:

zombiecalypse · 2022-03-26T14:18:35Z

Added a pull request that demonstrates this and other corner cases:

#406

Flashwalker · 2023-01-14T04:16:59Z

Can we avoid it somehow???

1.

Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content?
Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/.+?>',
    replacement: function (content) {
        return ''
    }
})

2.

Line break which breaks markdown's markup:
bla-bla-bla   text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag?
Something like:

turndownService.removeAllBefore('<br>', '</*>')

#423

SARAsBooks · 2023-04-19T12:08:02Z

As far as I can tell only is affected.

Good to hear that, @zombiecalypse. Maybe this single exception can be added to be handled by rules with adding span delimiters (_, *, __, **) before and after   or  ? @Flashwalker, removing
is no good because it should be preserved in the markdown.

My code uses const markdown = convertToMarkdown( article.content.replaceAll(' ', ' ') );, but that is specific to the formating I encountered in one article:

https://github.com/SARAsBooks/html-to-markdown/blob/04e64d6074bd95903c331d167bb6edc869977986/automationWorkflow.js#L45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Span rules + br can break commonmark standard #405

Span rules + br can break commonmark standard #405

zombiecalypse commented Mar 22, 2022 •

edited

zombiecalypse commented Mar 26, 2022

Flashwalker commented Jan 14, 2023 •

edited

SARAsBooks commented Apr 19, 2023

Span rules + br can break commonmark standard #405

Span rules + br can break commonmark standard #405

Comments

zombiecalypse commented Mar 22, 2022 • edited

zombiecalypse commented Mar 26, 2022

Flashwalker commented Jan 14, 2023 • edited

1.

2.

SARAsBooks commented Apr 19, 2023

zombiecalypse commented Mar 22, 2022 •

edited

Flashwalker commented Jan 14, 2023 •

edited