Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span rules + br can break commonmark standard #405

Open
zombiecalypse opened this issue Mar 22, 2022 · 3 comments
Open

Span rules + br can break commonmark standard #405

zombiecalypse opened this issue Mar 22, 2022 · 3 comments

Comments

@zombiecalypse
Copy link

zombiecalypse commented Mar 22, 2022

Implementation here:

rules.emphasis = {

Reproducing example: turndown("<em>foo<br/></em>") == "_foo \n_"

https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis

A single _ character can close emphasis iff it is part of a right-flanking delimiter run and either (a) not part of a left-flanking delimiter run or (b) part of a left-flanking delimiter run followed by a Unicode punctuation character.

and

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This means that commonmark2html("_foo \n_") = "<p>_foo<br/>_</p>", i.e. the <em> is lost.

The same is true for the other possible span delimiters (*, __, **) and on a leading <br/> in a span element.

As far as I can tell only <br/> is affected. While <em><p>foo<p></em>bar and similar abominations do trip up the context free replacement, they are fortunately not valid html

@zombiecalypse
Copy link
Author

Added a pull request that demonstrates this and other corner cases:

#406

@Flashwalker
Copy link

Flashwalker commented Jan 14, 2023

Can we avoid it somehow???

1.

Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[​​](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content?
Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/.+?>',
    replacement: function (content) {
        return ''
    }
})

2.

Line break which breaks markdown's markup:
<strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag?
Something like:

turndownService.removeAllBefore('<br>', '</*>')

#423

@SARAsBooks
Copy link

As far as I can tell only <br/> is affected.

Good to hear that, @zombiecalypse. Maybe this single exception can be added to be handled by rules with adding span delimiters (_, *, __, **) before and after <br> or <br/>? @Flashwalker, removing
is no good because it should be preserved in the markdown.

My code uses const markdown = convertToMarkdown( article.content.replaceAll('<br></em>', '</em><br>') );, but that is specific to the formating I encountered in one article:

https://github.com/SARAsBooks/html-to-markdown/blob/04e64d6074bd95903c331d167bb6edc869977986/automationWorkflow.js#L45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants