emDic.csv is a UTF-8 to CESU-8 conversion file #1

Crissov · 2017-05-09T19:46:26Z

Your emDic.csv, adopted from the same file by @today-is-a-good-day, has the Unicode Name in the first column labelled Description, UTF-8 bytes in C hexadecimal notation (i.e. leading \x) in the second column labelled Bytes and CESU-8 hexadecimal bytes in angle brackets in the third column labelled R-encoding. CESU-8 is (almost) identical to UTF-8 in the BMP, i.e. for U+0000 through U+FFFF, but differs for the astral planes, in one of which (the SMP) most emojis live.

This seems rather static and as you acknowledge in your blog post, the emoji data is severely outdated. I assume there is a simpler and more flexible way to achieve the same result in R (but I’m new at it). I only know that R also supports \U######## and, for the BMP, \u#### notation, e.g. \U0001F600 (leading zeros mandatory) for U+1F600 😀. Maybe the Unicode package would be helpful.

If you don't want to parse the emoji data files released by Unicode directly (which are not simple CSVs for the most part) to identify emojis (or, worse, emoji sequences), there probably is an automatically updated project on Github that provides the respective CSV files – alas, I haven’t found it yet.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emDic.csv is a UTF-8 to CESU-8 conversion file #1

emDic.csv is a UTF-8 to CESU-8 conversion file #1

Crissov commented May 9, 2017

emDic.csv is a UTF-8 to CESU-8 conversion file #1

emDic.csv is a UTF-8 to CESU-8 conversion file #1

Comments

Crissov commented May 9, 2017