Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process Unicode scalar values over 255 to HTML entities #32

Closed
wants to merge 1 commit into from
Closed

Process Unicode scalar values over 255 to HTML entities #32

wants to merge 1 commit into from

Conversation

gavineadie
Copy link

@gavineadie gavineadie commented Dec 10, 2019

Characters which are not in the single byte Unicode range are not rendered properly in the output generated by Ink. Characters like em dash (used in the Ink README.md file) are passed through unchanged to the output HTML. There may be browser settings to cope with such extended graphemes but Safari, by default, does not.

This code intercepts such characters and converts them to HTML entities so an em dash, for example, is converted to &#8212. I'm far from certain that I put that intercept in an optimal place, and is a pretty brute-force approach, but I'm sure discussion here will improve that.

Note the correct rendering of a line from the Ink README:

Ink+Unicode

@gavineadie
Copy link
Author

I note that this intercept doesn't cope with extended grapheme clusters.

@steve-h
Copy link
Contributor

steve-h commented Dec 13, 2019

I'm not sure we should escape these characters. https://www.w3.org/International/techniques/authoring-html.en?open=charset#charset

Swift now uses UTF8 for characters within strings. HTML4 and HTML5 force Unicode of some type. And this tool is intended for applications in these new better defined domains.

It is nicer for users working in languages that use these higher codes to be able to read the result and it should be left intact as UTF8.

@steve-h
Copy link
Contributor

steve-h commented Dec 13, 2019

Further statement in the CommonMark spec

Any sequence of characters is a valid CommonMark document.

A character is a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.

This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.

So Ink is conforming if it restricts the encoding to UTF8, and I see no current need to cover more encodings.

@gavineadie
Copy link
Author

Are you arguing that the line in the Ink README.md, that I quoted above, should be rendered in HTML as follows?

That’s it! The resulting HTML can then be displayed as-is, or embedded into some other context — and if that’s all you need Ink for, then no more code is required.

@steve-h
Copy link
Contributor

steve-h commented Dec 13, 2019

Yes it works if placed in proper HTML doc

<!DOCTYPE html>
<html lang="en">
  <head>
  <meta charset="utf-8"></head><body>

Do an inspect on the readme in Safari as rendered by Github and you will see the en-dash etc in native UTF8

Put the header fragment in front of the ink output and it will render fine in Safari.

@john-mueller
Copy link
Contributor

I agree that this is unnecessary, since the generated HTML renders properly when wrapped as steve-h mentions.

@JohnSundell
Copy link
Owner

I agree with @steve-h and @john-mueller - will close this one. Thanks anyway @gavineadie.

@JohnSundell JohnSundell closed this May 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants