Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

Open
laithshadeed opened this issue Aug 19, 2021 · 1 comment
Labels

Comments

@laithshadeed
Copy link

laithshadeed commented Aug 19, 2021

For example, the native split function '😀-hi-🐅'.split('') will break your string compared to lodash _.'😀-hi-🐅' because it failed to recognize emojis as a single symbol and instead splits its surrogate pairs into two pieces. It is the same reason why calling length on emojis returns two instead of one '😀'.length

Lodash takes special care if your string has non-BMP symbols for example emojis. To correctly split '😀-hi-🐅'; you can use the spread operator: [...'😀-hi-🐅']

But even the spread operator does not handle grapheme clusters. For that, you need the Unicode Text Segmentation algorithm. Chrome already implemented the algorithm in Intl.Segmenter in 87. You can use the algorithm like this:

[...(new Intl.Segmenter).segment('😀-hi-🐅')].map(x => x.segment)

More about Unicode issues in Javascript in: https://mathiasbynens.be/notes/javascript-unicode

Happy passing emojis around 😀

@laithshadeed laithshadeed changed the title Replacing lodash string functions with native one requires special care for Unicode strings Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols Aug 19, 2021
@cht8687 cht8687 added the Tips label Sep 25, 2021
@mrienstra
Copy link

Comparison of some methods:
https://stackblitz.com/edit/stackblitz-typescript-lrag9u?devToolsHeight=90&file=index.ts

const str = '🐅-👨‍👩‍👧-நி-깍-葛󠄀';

naive, split

str.split('');
// (20) ["\ud83d", '\udc05', '-', '\ud83d', '\udc68', '‍', '\ud83d', '\udc69', '‍', '\ud83d', '\udc67', '-', 'ந', 'ி', '-', '깍', '-', '葛', '\udb40', '\udd00']

slightly better, spread operator

[...str]
// (15) ["🐅", '-', '👨', '‍', '👩', '‍', '👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀']

In supported browsers, Intl.Segmenter

[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["🐅", '-', '👨‍👩‍👧', '-', 'நி', '-', '깍', '-', '葛󠄀']

graphemer 1.4.0

import Graphemer from 'graphemer';
const splitter = new Graphemer();
splitter.splitGraphemes(str);
// (9) ["🐅", '-', '👨‍👩‍👧', '-', 'நி', '-', '깍', '-', '葛󠄀']

lodash 4.17.10

import _ from 'lodash';
_.split(str, '');
// (11) ["🐅", '-', '👨‍👩‍👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀']

fabric.js v6.0.0-beta10 graphemeSplit (internal function)

import { graphemeSplit } from './fabric_graphemeSplit';
graphemeSplit(str);
// (15) ["🐅", '-', '👨', '‍', '👩', '‍', '👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀']

@formatjs Intl.Segmenter 11.4.2 polyfill

await import('@formatjs/intl-segmenter/polyfill-force');
[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["🐅", '-', '👨‍👩‍👧', '-', 'நி', '-', '깍', '-', '葛󠄀']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants