Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

laithshadeed · 2021-08-19T12:19:53Z

For example, the native split function '😀-hi-🐅'.split('') will break your string compared to lodash _.'😀-hi-🐅' because it failed to recognize emojis as a single symbol and instead splits its surrogate pairs into two pieces. It is the same reason why calling length on emojis returns two instead of one '😀'.length

Lodash takes special care if your string has non-BMP symbols for example emojis. To correctly split '😀-hi-🐅'; you can use the spread operator: [...'😀-hi-🐅']

But even the spread operator does not handle grapheme clusters. For that, you need the Unicode Text Segmentation algorithm. Chrome already implemented the algorithm in Intl.Segmenter in 87. You can use the algorithm like this:

[...(new Intl.Segmenter).segment('😀-hi-🐅')].map(x => x.segment)

More about Unicode issues in Javascript in: https://mathiasbynens.be/notes/javascript-unicode

Happy passing emojis around 😀

The text was updated successfully, but these errors were encountered:

mrienstra · 2023-07-27T07:14:21Z

Comparison of some methods:
https://stackblitz.com/edit/stackblitz-typescript-lrag9u?devToolsHeight=90&file=index.ts

const str = '🐅-👨‍👩‍👧-நி-깍-葛󠄀';

naive, split

str.split('');
// (20) ["\ud83d", '\udc05', '-', '\ud83d', '\udc68', '‍', '\ud83d', '\udc69', '‍', '\ud83d', '\udc67', '-', 'ந', 'ி', '-', '깍', '-', '葛', '\udb40', '\udd00']

slightly better, spread operator

[...str]
// (15) ["🐅", '-', '👨', '‍', '👩', '‍', '👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀']

In supported browsers, Intl.Segmenter

[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["🐅", '-', '👨‍👩‍👧', '-', 'நி', '-', '깍', '-', '葛󠄀']

graphemer 1.4.0

import Graphemer from 'graphemer';
const splitter = new Graphemer();
splitter.splitGraphemes(str);
// (9) ["🐅", '-', '👨‍👩‍👧', '-', 'நி', '-', '깍', '-', '葛󠄀']

lodash 4.17.10

import _ from 'lodash';
_.split(str, '');
// (11) ["🐅", '-', '👨‍👩‍👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀']

fabric.js v6.0.0-beta10 graphemeSplit (internal function)

import { graphemeSplit } from './fabric_graphemeSplit';
graphemeSplit(str);
// (15) ["🐅", '-', '👨', '‍', '👩', '‍', '👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀']

@formatjs Intl.Segmenter 11.4.2 polyfill

await import('@formatjs/intl-segmenter/polyfill-force');
[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["🐅", '-', '👨‍👩‍👧', '-', 'நி', '-', '깍', '-', '葛󠄀']

laithshadeed changed the title ~~Replacing lodash string functions with native one requires special care for Unicode strings~~ Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols Aug 19, 2021

cht8687 added the Tips label Sep 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

laithshadeed commented Aug 19, 2021 •

edited

mrienstra commented Jul 27, 2023

Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321

Comments

laithshadeed commented Aug 19, 2021 • edited

mrienstra commented Jul 27, 2023

naive, split

slightly better, spread operator

In supported browsers, Intl.Segmenter

graphemer 1.4.0

lodash 4.17.10

fabric.js v6.0.0-beta10 graphemeSplit (internal function)

@formatjs Intl.Segmenter 11.4.2 polyfill

laithshadeed commented Aug 19, 2021 •

edited