Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter or remove rules to filter/remove by regexp/wildcard #423

Open
Flashwalker opened this issue Jan 14, 2023 · 0 comments
Open

Filter or remove rules to filter/remove by regexp/wildcard #423

Flashwalker opened this issue Jan 14, 2023 · 0 comments

Comments

@Flashwalker
Copy link

Flashwalker commented Jan 14, 2023

Can we have filter or remove rules to filter/remove via regexp or wildcard???

E.g.:

1.

Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[​​](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content?
Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/[^<>]+?>',
    replacement: function (content) {
        return ''
    }
})

List of spaces for reference:

Number Character name
\u0020 space
\u00A0 no-break space
\u1680 Ogham space mark
\u180E Mongolian vowel separator
\u2000 en quad
\u2001 em quad
\u2002 en space (nut)
\u2003 em space (mutton)
\u2004 three-per-em space (thick space)
\u2005 four-per-em space (mid space)
\u2006 six-per-em space
\u2007 figure space
\u2008 punctuation space
\u2009 thin space
\u200A hair space
\u200B zero width space
\u202F narrow no-break space
\u205F medium mathematical space
\u3000 ideographic space
\uFEFF zero width no-break space
\uFFFC object replacement Character

2.

Line break which breaks markdown's markup:
<strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag?
Something like:

turndownService.removeAllBefore('<br>', '</*>')

Here is regex examples:

Remove the anchor with zero-width spaces (you can't see them until you paste it in dev console):

selectedHTML='<i>bla</i><b><a href="https://bla-bla-bla">​​​​​​​</a>text-text-text</b><i>bla</i>'
selectedHTML.replace(/<[^<>]+?>[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF\u0020\uFFFC]+<\/[^<>]+?>/gm, '')

Remove the line break that precedes closing tag:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/(<br ?\/?>)+(<\/[^<>]+?>)/gi, '$2')

Swap the line break that precedes closing tag and the closing tag with:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/((<br ?\/?>)+)(<\/[^<>]+?>)/gi, '$3$1')

It would be nice if regex filter will skip the content of code and pre tags.

P.S.
And also:

// Drop anchor html tags which contains only dots, commas
selectedHTML = '<a href="#">,</a>'
selectedHTML.replace(/<a [^<>]+?>[.,]+<\/a>/gim, '')

And

// Drop emoji images, keep emoji unicode (from alt attr)
selectedHTML = '<img src="img-apple-64/1f914.png" class="emoji" alt="🤔">'
selectedHTML.replace(/<img [^<>]+?alt=['"]([\p{Emoji}\u200d]+)['"][^<>]*?\/?>/gimu, '$1')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant