Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character reference replacement results in raw HTML #383

Open
kevinoid opened this issue Mar 27, 2022 · 0 comments
Open

Character reference replacement results in raw HTML #383

kevinoid opened this issue Mar 27, 2022 · 0 comments

Comments

@kevinoid
Copy link

As a result of #109, character and entity references are unconditionally dereferenced. This causes HTML which contains character references representing HTML-like text to be converted to markdown with raw HTML by html2text 2017.10.4 and later:

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown
Horizontal rule is <hr>

To make the problem clearer, consider round-tripping from HTML to Markdown back to HTML:

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown | cmark
<p>Horizontal rule is <!-- raw HTML omitted --></p>

$ echo "<p>Horizontal rule is &lt;hr&gt;</p>" | html2markdown | cmark --unsafe
<p>Horizontal rule is <hr></p>

The conversion to markdown changes the meaning of the content by dereferencing the character references.

To satisfy the request in #109, I suggest preserving character and entity references which would be interpreted as Raw HTML if dereferenced. That would avoid producing unnecessary character references (as requested in #109) and also avoid changing the meaning of the content when it contains HTML-like text.

Thanks for considering,
Kevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant