Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Unicode (sub|super)script characters #3633

Merged
merged 8 commits into from May 20, 2022

Conversation

ronkok
Copy link
Collaborator

@ronkok ronkok commented May 16, 2022

Math-mode Unicode (sub|super)script characters will now render as if you had written regular characters in a subscript or superscript. For instance, A²⁺³ will render the same as A^{2+3}.

Resolves #1218.

@ronkok
Copy link
Collaborator Author

ronkok commented May 16, 2022

I'm having some issues.

  1. yarn start will cause my browser tab to hang. It's waiting for some msn.assets, which seems odd.
  2. Screenshotter is failing. This is suspicious because I have not altered any screenshotter tests and the test definitions do not contain any (sub|super)script characters. The screenshotter tests should be unchanged.
  3. Netlfy preview is also hanging has timed out.

This PR changes only the Parser and the unit tests indicate that the Parser is working just fine. So I don't know what is causing these errors.

Any suggestions would be welcome.

@edemaine
Copy link
Member

edemaine commented May 17, 2022

Curious. I can reproduce the behavior you're getting too. Commenting out the changes to Lexer.js makes the problem go away. I would assume this has to do with some weird edge-cases around UTF8 encoding, but I'm not sure whether they're an inherent JavaScript issue or a Webpack bug or what.

I have a different concern, which might also end up helping. I worry that the PR as it is now will have a performance penalty on all of KaTeX, as every lexed symbol needs to be checked against the new regular expression. Here's an alternate proposal:

  • Don't change the lexer; these characters will be caught by the existing "single codepoint" rule. But now they'll be individual instead of bulk.
  • Detect such a character using a Set or object, instead of a RegExp. (Actually I don't know which is fastest; maybe worth a benchmark.)
  • When such a character is detected by the Parser, repeated call fetch() until all such characters have been exhausted, and then process them as you currently do.

This will have an additional benefit, which is better interaction with macros. For example:

\def\SupTwo{²}
\def\SupThree{³}
\SupTwo\SupThree

These won't coalesce with the current approach, but should with repeated calls to fetch(). I haven't tested, but I would guess this is how unicode-math would work (as it's difficult to work any other way in TeX).

What do you think, @ronkok? If you think this is a reasonable idea, but you'd rather not implement it, I could take a stab.

@ronkok
Copy link
Collaborator Author

ronkok commented May 17, 2022

Agree completely regarding the performance penalty to the Lexer. That concerned me, too. And now that I think about it, the fetch and repeat approach would fit nicely into the proposed code block that begins with } else if (unicodeSubsAndSups[lex.text[0]]) {.

That approach would not have fit so nicely in an earlier draft of my code. Hence the revision to the Lexer.

I'll have a go at the fetch and repeat approach. It might be a few days until I get to it.

@codecov
Copy link

codecov bot commented May 17, 2022

Codecov Report

Merging #3633 (7c7a338) into main (c31256f) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3633      +/-   ##
==========================================
+ Coverage   93.48%   93.50%   +0.01%     
==========================================
  Files          89       90       +1     
  Lines        6619     6636      +17     
  Branches     1538     1543       +5     
==========================================
+ Hits         6188     6205      +17     
  Misses        400      400              
  Partials       31       31              
Impacted Files Coverage Δ
src/Parser.js 96.14% <100.00%> (+0.14%) ⬆️
src/unicodeSupOrSub.js 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c31256f...7c7a338. Read the comment docs.

@ronkok
Copy link
Collaborator Author

ronkok commented May 17, 2022

This PR now acquires tokens by repeated fetch. The Lexer is reverted to its previous state.

I now get good behavior from yarn start, but I still get no response from yarn run test:jest. That Jest problem occurs both in this branch and in my local main branch. A yarn install did not help. Odd.

@ronkok
Copy link
Collaborator Author

ronkok commented May 17, 2022

The Jest test has apparently run successfully in GitHub and the code is executing well in the Netlify preview. I think this PR is ready to go.

@universemaster
Copy link

This question should not hold up any pending reviews but unicodeSupOrSub.js does not contain the subscripts ₕ , ₖ , ₗ , ₘ , ₙ , ₚ , ₛ , ₜ.
Is there a reason for that? They are in the same U+209x range as

ₐ,ₑ ,ₒ, ₓ.

and latin-extended-C block ⱼ, and Phonetic Extensions ᵢ ᵣ ᵤ ᵥ and Greek ᵦ ᵧ ᵨ ᵩ ᵪ are also missing, but that's more understandable.

I'm getting these from the wikipedia page. There are some other ones on that page, too.

@universemaster
Copy link

unicode-math has a much larger list with the unicode values here.

@ronkok
Copy link
Collaborator Author

ronkok commented May 19, 2022

Characters such as ₕ ₖ ᵦ ᵧ will not render properly in NotePad++ with the Consolas font. Yes, there are many editors in which these characters will appear fine. Even in NotePad++, they will render properly if the JuliaMono font is installed and used.

But I think that the NotePad++ / Consolas combination is a frequent entry point for beginning coders and I am a little reluctant to encourage the creation of documents that some authors will not be able to read.

@edemaine
Copy link
Member

I feel like matching unicode-math's list makes sense. Presumably those who want to use this will use appropriate fonts.

@ronkok
Copy link
Collaborator Author

ronkok commented May 19, 2022

Presumably those who want to use this will use appropriate fonts.

I have in mind the beginner who looks at existing code as a means to learn something.

I do not hold this opinion strongly. I'm going to wait a day before I take any action.

@universemaster
Copy link

In my experience, people generally understand what's going on when a character doesn't display correctly and know to switch to a font that supports it.

@edemaine
Copy link
Member

I understand your concern, but I'd lean heavily toward LaTeX compatibility here. Also, the characters render in the web browser (at least my Chrome on Windows), so it's natural to try to copy/paste them. Even if they don't render in the editor, I would expect them to render on the website.

@ronkok
Copy link
Collaborator Author

ronkok commented May 19, 2022

Okay, I've matched the list in unicode-math. This should be ready to go.

Copy link
Member

@edemaine edemaine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great now! I found what I think is one typo, and a possible code simplification.

src/unicodeSupOrSub.js Outdated Show resolved Hide resolved
src/Parser.js Outdated Show resolved Hide resolved
@edemaine edemaine merged commit d8fc35e into KaTeX:main May 20, 2022
KaTeX-bot added a commit that referenced this pull request May 20, 2022
## [0.15.4](v0.15.3...v0.15.4) (2022-05-20)

### Features

* Support Unicode (sub|super)script characters ([#3633](#3633)) ([d8fc35e](d8fc35e))
@KaTeX-bot
Copy link
Member

🎉 This PR is included in version 0.15.4 🎉

The release is available on:

Your semantic-release bot 📦🚀

@ronkok ronkok deleted the unicodeSubSup branch May 20, 2022 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants