feat: Support Unicode (sub|super)script characters #3633

ronkok · 2022-05-16T17:47:33Z

Math-mode Unicode (sub|super)script characters will now render as if you had written regular characters in a subscript or superscript. For instance, A²⁺³ will render the same as A^{2+3}.

Resolves #1218.

ronkok · 2022-05-16T18:13:51Z

I'm having some issues.

yarn start will cause my browser tab to hang. It's waiting for some msn.assets, which seems odd.
Screenshotter is failing. This is suspicious because I have not altered any screenshotter tests and the test definitions do not contain any (sub|super)script characters. The screenshotter tests should be unchanged.
Netlfy preview ~~is also hanging~~ has timed out.

This PR changes only the Parser and the unit tests indicate that the Parser is working just fine. So I don't know what is causing these errors.

Any suggestions would be welcome.

edemaine · 2022-05-17T14:46:46Z

Curious. I can reproduce the behavior you're getting too. Commenting out the changes to Lexer.js makes the problem go away. I would assume this has to do with some weird edge-cases around UTF8 encoding, but I'm not sure whether they're an inherent JavaScript issue or a Webpack bug or what.

I have a different concern, which might also end up helping. I worry that the PR as it is now will have a performance penalty on all of KaTeX, as every lexed symbol needs to be checked against the new regular expression. Here's an alternate proposal:

Don't change the lexer; these characters will be caught by the existing "single codepoint" rule. But now they'll be individual instead of bulk.
Detect such a character using a Set or object, instead of a RegExp. (Actually I don't know which is fastest; maybe worth a benchmark.)
When such a character is detected by the Parser, repeated call fetch() until all such characters have been exhausted, and then process them as you currently do.

This will have an additional benefit, which is better interaction with macros. For example:

\def\SupTwo{²}
\def\SupThree{³}
\SupTwo⁺\SupThree

These won't coalesce with the current approach, but should with repeated calls to fetch(). I haven't tested, but I would guess this is how unicode-math would work (as it's difficult to work any other way in TeX).

What do you think, @ronkok? If you think this is a reasonable idea, but you'd rather not implement it, I could take a stab.

ronkok · 2022-05-17T15:42:32Z

Agree completely regarding the performance penalty to the Lexer. That concerned me, too. And now that I think about it, the fetch and repeat approach would fit nicely into the proposed code block that begins with } else if (unicodeSubsAndSups[lex.text[0]]) {.

That approach would not have fit so nicely in an earlier draft of my code. Hence the revision to the Lexer.

I'll have a go at the fetch and repeat approach. It might be a few days until I get to it.

codecov · 2022-05-17T17:48:22Z

Codecov Report

Merging #3633 (7c7a338) into main (c31256f) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #3633      +/-   ##
==========================================
+ Coverage   93.48%   93.50%   +0.01%     
==========================================
  Files          89       90       +1     
  Lines        6619     6636      +17     
  Branches     1538     1543       +5     
==========================================
+ Hits         6188     6205      +17     
  Misses        400      400              
  Partials       31       31

Impacted Files	Coverage Δ
src/Parser.js	`96.14% <100.00%> (+0.14%)`	⬆️
src/unicodeSupOrSub.js	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c31256f...7c7a338. Read the comment docs.

ronkok · 2022-05-17T17:48:45Z

This PR now acquires tokens by repeated fetch. The Lexer is reverted to its previous state.

I now get good behavior from yarn start, but I still get no response from yarn run test:jest. That Jest problem occurs both in this branch and in my local main branch. A yarn install did not help. Odd.

ronkok · 2022-05-17T17:53:46Z

The Jest test has apparently run successfully in GitHub and the code is executing well in the Netlify preview. I think this PR is ready to go.

universemaster · 2022-05-19T10:17:21Z

This question should not hold up any pending reviews but unicodeSupOrSub.js does not contain the subscripts ₕ , ₖ , ₗ , ₘ , ₙ , ₚ , ₛ , ₜ.
Is there a reason for that? They are in the same U+209x range as

ₐ,ₑ ,ₒ, ₓ.

ₔ and latin-extended-C block ⱼ, and Phonetic Extensions ᵢ ᵣ ᵤ ᵥ and Greek ᵦ ᵧ ᵨ ᵩ ᵪ are also missing, but that's more understandable.

I'm getting these from the wikipedia page. There are some other ones on that page, too.

universemaster · 2022-05-19T10:28:30Z

unicode-math has a much larger list with the unicode values here.

ronkok · 2022-05-19T16:04:35Z

Characters such as ₕ ₖ ᵦ ᵧ will not render properly in NotePad++ with the Consolas font. Yes, there are many editors in which these characters will appear fine. Even in NotePad++, they will render properly if the JuliaMono font is installed and used.

But I think that the NotePad++ / Consolas combination is a frequent entry point for beginning coders and I am a little reluctant to encourage the creation of documents that some authors will not be able to read.

edemaine · 2022-05-19T16:09:42Z

I feel like matching unicode-math's list makes sense. Presumably those who want to use this will use appropriate fonts.

ronkok · 2022-05-19T16:18:46Z

Presumably those who want to use this will use appropriate fonts.

I have in mind the beginner who looks at existing code as a means to learn something.

I do not hold this opinion strongly. I'm going to wait a day before I take any action.

universemaster · 2022-05-19T16:59:28Z

In my experience, people generally understand what's going on when a character doesn't display correctly and know to switch to a font that supports it.

edemaine · 2022-05-19T19:45:50Z

I understand your concern, but I'd lean heavily toward LaTeX compatibility here. Also, the characters render in the web browser (at least my Chrome on Windows), so it's natural to try to copy/paste them. Even if they don't render in the editor, I would expect them to render on the website.

ronkok · 2022-05-19T21:10:40Z

Okay, I've matched the list in unicode-math. This should be ready to go.

edemaine

This is looking great now! I found what I think is one typo, and a possible code simplification.

src/unicodeSupOrSub.js

src/Parser.js

…unicodeSubSup

## [0.15.4](v0.15.3...v0.15.4) (2022-05-20) ### Features * Support Unicode (sub|super)script characters ([#3633](#3633)) ([d8fc35e](d8fc35e))

KaTeX-bot · 2022-05-20T13:54:53Z

🎉 This PR is included in version 0.15.4 🎉

The release is available on:

Your semantic-release bot 📦🚀

feat: Support Unicode (sub|super)script characters

6e507c4

ronkok mentioned this pull request May 16, 2022

Unicode superscript and subscript blocks #1281

Open

Acquire tokens via repeated fetch()

1a0e8b5

ronkok added 2 commits May 19, 2022 13:48

Match more Unicode (sub|super)script characters

81c4deb

Update docs with new characters

ee58b7c

ronkok and others added 2 commits May 19, 2022 14:12

Add Greek characters to RegEx

dbde7d2

Merge branch 'main' into unicodeSubSup

6746f8d

edemaine requested changes May 19, 2022

View reviewed changes

src/unicodeSupOrSub.js Outdated Show resolved Hide resolved

src/Parser.js Outdated Show resolved Hide resolved

ronkok added 2 commits May 19, 2022 20:00

Pick up review comments

7715c33

Merge branch 'unicodeSubSup' of https://github.com/ronkok/KaTeX into …

7c7a338

…unicodeSubSup

edemaine approved these changes May 20, 2022

View reviewed changes

edemaine merged commit d8fc35e into KaTeX:main May 20, 2022

KaTeX-bot added a commit that referenced this pull request May 20, 2022

chore(release): 0.15.4 [ci skip]

575dab7

## [0.15.4](v0.15.3...v0.15.4) (2022-05-20) ### Features * Support Unicode (sub|super)script characters ([#3633](#3633)) ([d8fc35e](d8fc35e))

KaTeX-bot added the released label May 20, 2022

ronkok deleted the unicodeSubSup branch May 20, 2022 15:50

edemaine mentioned this pull request Feb 11, 2024

No character metrics for '³' in style 'Main-Regular' and mode 'text' #3927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Unicode (sub|super)script characters #3633

feat: Support Unicode (sub|super)script characters #3633

ronkok commented May 16, 2022

ronkok commented May 16, 2022 •

edited

edemaine commented May 17, 2022 •

edited

ronkok commented May 17, 2022

codecov bot commented May 17, 2022 •

edited

ronkok commented May 17, 2022 •

edited

ronkok commented May 17, 2022

universemaster commented May 19, 2022

universemaster commented May 19, 2022

ronkok commented May 19, 2022

edemaine commented May 19, 2022

ronkok commented May 19, 2022

universemaster commented May 19, 2022

edemaine commented May 19, 2022

ronkok commented May 19, 2022

edemaine left a comment

KaTeX-bot commented May 20, 2022

feat: Support Unicode (sub|super)script characters #3633

feat: Support Unicode (sub|super)script characters #3633

Conversation

ronkok commented May 16, 2022

ronkok commented May 16, 2022 • edited

edemaine commented May 17, 2022 • edited

ronkok commented May 17, 2022

codecov bot commented May 17, 2022 • edited

Codecov Report

ronkok commented May 17, 2022 • edited

ronkok commented May 17, 2022

universemaster commented May 19, 2022

universemaster commented May 19, 2022

ronkok commented May 19, 2022

edemaine commented May 19, 2022

ronkok commented May 19, 2022

universemaster commented May 19, 2022

edemaine commented May 19, 2022

ronkok commented May 19, 2022

edemaine left a comment

Choose a reason for hiding this comment

KaTeX-bot commented May 20, 2022

ronkok commented May 16, 2022 •

edited

edemaine commented May 17, 2022 •

edited

codecov bot commented May 17, 2022 •

edited

ronkok commented May 17, 2022 •

edited