Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol unicode replacement doesn’t work #243

Closed
flying-sheep opened this issue Jun 9, 2015 · 28 comments · Fixed by #1123
Closed

Symbol unicode replacement doesn’t work #243

flying-sheep opened this issue Jun 9, 2015 · 28 comments · Fixed by #1123
Assignees
Labels

Comments

@flying-sheep
Copy link

test case

the trivial katex.renderToString('σ') throws ParseError: KaTeX parse error: Unexpected character: 'σ' at position 0: ̲σ

@sophiebits
Copy link
Contributor

I thought we had a tracking issue for this already, but I guess not. Brief discussion on #73 and #59.

@flying-sheep
Copy link
Author

well, i didn’t even try to understand this code, but wouldn’t it be possible to loop symbols, and create an index for all literals that maps back to the relevant part of the structure?

let unicodeSymbols = {}
for (let mode of ['math', 'text']) {
    unicodeSymbols[mode] = {}
    for (let macro in symbols[mode]) {
        let spec = symbols[mode][macro]
        if (spec.replace) {
            unicodeSymbols[mode][spec.replace] = { font: spec.font, group: spec.group }
        }
    }
}

and then, using this, we could search for the symbol before the end of the lexer tests:

if (input[pos] in unicodeSymbols.text || input[pos] in unicodeSymbols.math)
    return new Token(input[pos], ...)

and later we make sure that parseSymbol is called for those tokens (if it wouldn’t already) and extend the lines

} else if (symbols[mode][nucleus.text]) {
...
    new ParseNode(symbols[mode][nucleus.text].group, nucleus.text, mode),

to

let spec = null
...
} else if ((spec = symbols[mode][nucleus.text]) || (spec = unicodeSymbols[mode][nucleus.text])) {
...
    new ParseNode(spec.group, nucleus.text, mode),

as alternative to unicodeSymbols, we could simply alias the elements in symbols to be both keyed by the macro (e.g. "\\sigma") and their “.replace” (e.g. "σ"):

let symbols = {
    ...
    "\\sigma": { ... }
    "σ": { ...(same as above)... }
    ...
}

@sophiebits
Copy link
Contributor

It's not quite that simple because (I believe) the character used in the math fonts we use might not necessarily align with the character that's appropriate in the input string.

@flying-sheep
Copy link
Author

you believe? how to be sure?

@qbolec
Copy link

qbolec commented Jun 13, 2015

I have problems with existing codebase, which contains following symbols:
'−'
'⟨'
'⋅'
'≤'
'≥'
'α'

@kevinbarabash
Copy link
Member

@flying-sheep I like the idea of auto-generating symbols based on existing "replace" symbols and just adding them to the symbols dictionary. It doesn't cover every unicode char, but it's a good start. Would you be able to create a pull request that automatically adds entries to symbols based on existing entries? I think it's more maintainable than adding aliases manually, people don't have to grab the unicode character when adding a new entry.

@flying-sheep
Copy link
Author

you’ll first have to check @spicyj’s claim that it may not always be a bidirectional mapping

@kevinbarabash
Copy link
Member

I created a jsfiddle that displays the glyphs for all of the "replace" symbols using the unicode character specified in symbols.js: https://jsfiddle.net/047yzexz/1/. It seems like most are the correct character, but a few are showing up as boxes which is probably because the default font doesn't support those.

@flying-sheep
Copy link
Author

umm, that’s called textContent, not innerText

also i added some names… is there some better unicodedata out there that has everything in JSON or so?

@kevinbarabash
Copy link
Member

http://www.unicode.org/Public/UCD/latest/ucd/Index.txt should contain all of the names for every unicode entry. Did you want the data to verify those glyphs are being displayed as missing? I'm confident that everything will check out, but it's good to be sure.

@flying-sheep
Copy link
Author

Did you want the data to verify those glyphs are being displayed as missing

yeah, to check @spicyj’ claim. and i’m already using that list.

@flying-sheep
Copy link
Author

about @qbolec’s case: we have all those symbols, except “−”, which is the real mathematical minus, and is encoded in TeX as the “-” aka hyphen-minus, the thing everyone has on the keyboard.

@kevinbarabash
Copy link
Member

@flying-sheep the hyphen is in the jsfiddle and maps correctly to the minus sign. In terms of verify that everything maps correctly, including the missing glyphs, it might be easier to just add the code to do the mapping and then programmatically create KaTeX layouts for each glyph in that list, add them to a page, and check that each pair has identical symbols.

@flying-sheep
Copy link
Author

seems that i overlooked it. great!

@gagern
Copy link
Collaborator

gagern commented Jun 20, 2015

The tool I recently commited in gagern@5e127ba can be used to display KaTeX fonts in browser, together with the corresponding rendering in system default fonts. As far as I could tell from skimming the lists, the symbols all match up except for symbols from the private use area. There are a few of these in the Size1 and Size4 font, apparently for horizontal braces or something like that. And the Typewriter font has typographic single quotation marks at \u07E2 and \u07e3 which is incorrect.

@gagern
Copy link
Collaborator

gagern commented Jun 20, 2015

@spicyj wrote:

I thought we had a tracking issue for this already, but I guess not.

Were you perhaps referring to #16? That bug has no discussion to it, but it does have an asignee.

In trying to sort through the various unicode bugs, this one here seems to have the most momentum to it just now. I'm conducting a short survey, trying to see what else might be useful.

  • Apparently pull requests Add unicode support #59 and unicode support (including ∑∏∐ ∫ and √) #73 were closed by the contributor, perhaps due to lack of positive feedback? There the approach was to list unicode symbols with common characteristics (same font, same group type) in a string literal, which is an alternative approach to the auto-generation discussed above. I guess they might still be used to cross-check anything that's been done here, and identify contradictions to be investigated further.
  • Incorporate existing symbol mappings #49 suggests a tool which maps unicode to LaTeX. We might use that to check any results here for completeness, although I think it would be easier to check whether there are any symbols, particularly in the Main and AMS fonts, which are not accessible by one of our symbol definitions.
  • I just opened Accept unicode from mathematical alphabets on input #260, to cover unicode symbols which we map to latin letters from some special font.
  • problems with unicode #65 mentions accented characters in the original statement. I believe that there might be a distinction there for text vs. math. For text it could have been handled by allowing unicode in text fragments, as Support unicode in text #15 suggests. For math mode we'd probably need more information, and a font to render it, so that's way more complicated.

flying-sheep added a commit to flying-sheep/KaTeX that referenced this issue Jun 20, 2015
@flying-sheep
Copy link
Author

fixed in #261

@kevinbarabash
Copy link
Member

I think we just need to exclude those from the list because they have special meaning. If we come across a _ or ^ it should not be parsed as a symbol.

@flying-sheep
Copy link
Author

done. at first i was confused since ^ didn’t appear to be in the symbols, but then i remembered that my texteditor’s search interprets things as regex, so ofc it was there!

@sophiebits
Copy link
Contributor

Were you perhaps referring to #16?

@gagern Yes, thanks.

flying-sheep added a commit to flying-sheep/KaTeX that referenced this issue Jun 21, 2015
@qbolec
Copy link

qbolec commented Oct 13, 2015

The problem still exists. It also concerns angle brackets: '⟩' and '⟨' .

@kevinbarabash
Copy link
Member

@qbolec I was going to create a pull request containing the those symbols which appear in only a single family, but I haven't got around to it. Unfortunately '⋅' appears a few, see #261 (comment) for details.

@gagern
Copy link
Collaborator

gagern commented Oct 14, 2015

@kevinbarabash Adding the extra symbols would be easy, now that we have d423bec to get them past the lexer without hassle.

What has me worried is the opt-in nature of this feature, which we agreed upon in response to #261 (comment). We need that configuration information available and taken into account in all the places where we make use of the symbols table, and I haven't yet decided on the most elegant way to achieve that.

@whykushal93
Copy link

I found one resource containing the mappings between Unicode characters and the corresponding Macros, probably could be useful. http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/unicode-math/unimath-symbols.pdf

@kalvdans
Copy link

kalvdans commented May 2, 2024

I'd like to reopen this issue since simple greek letters like µ generates two warnings:

LaTeX-incompatible input and strict mode is set to 'warn': Unrecognized Unicode character "µ" (181) [unknownSymbol] katex.min.js:1:5587
No character metrics for 'µ' in style 'Main-Regular' and mode 'text'

According to the documentation at https://katex.org/docs/supported.html#letters-and-unicode , unicode versions of greek letters "will render properly in any KaTeX rendering mode".

@edemaine
Copy link
Member

edemaine commented May 2, 2024

This sounds like a documentation issue. I believe they are only supported (and supposed to work) in math mode. I believe this is how LaTeX behaves as well.

@kalvdans
Copy link

kalvdans commented May 2, 2024

@edemaine thanks for your reply. I digged deeper and found out that I've used µ U+00B5 MICRO SIGN, instead of μ U+03BC GREEK SMALL LETTER MU that katex handles just fine. Mystery solved! Stupid Unicode with its duplicates.

@flying-sheep
Copy link
Author

flying-sheep commented May 3, 2024

Stupid Unicode trying to encode all of human written language is an interoperable way 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants