Should len(string) return the number of graphemes, or the number of codepoints? #967

christianp · 2022-11-21T12:12:52Z

In unicode, graphemes might be represented by a sequence of several codepoints. For example, the emoji 🫶 is two codepoints: \ud83e\udef6.

Should the length of a string in JME count graphemes or codepoints? I think the least-surprising answer from a human's perspective is graphemes, but that means that all the methods for indexing and slicing strings need to be grapheme-aware.

The text was updated successfully, but these errors were encountered:

christianp · 2023-01-17T15:55:05Z

Blog posts on how this is dealt with in different languages:

Libraries to deal with grapheme clusters:

Python - https://pypi.org/project/grapheme/
JavaScript - https://github.com/orling/grapheme-splitter

There is a proposal to add an Intl.Segmenter interface to JS to deal with this.

christianp added the Needs thinking about label Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should len(string) return the number of graphemes, or the number of codepoints? #967

Should len(string) return the number of graphemes, or the number of codepoints? #967

christianp commented Nov 21, 2022

christianp commented Jan 17, 2023

Should len(string) return the number of graphemes, or the number of codepoints? #967

Should len(string) return the number of graphemes, or the number of codepoints? #967

Comments

christianp commented Nov 21, 2022

christianp commented Jan 17, 2023