You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your emDic.csv, adopted from the same file by @today-is-a-good-day, has the Unicode Name in the first column labelled Description, UTF-8 bytes in C hexadecimal notation (i.e. leading \x) in the second column labelled Bytes and CESU-8 hexadecimal bytes in angle brackets in the third column labelled R-encoding. CESU-8 is (almost) identical to UTF-8 in the BMP, i.e. for U+0000 through U+FFFF, but differs for the astral planes, in one of which (the SMP) most emojis live.
This seems rather static and as you acknowledge in your blog post, the emoji data is severely outdated. I assume there is a simpler and more flexible way to achieve the same result in R (but I’m new at it). I only know that R also supports \U######## and, for the BMP, \u#### notation, e.g. \U0001F600 (leading zeros mandatory) for U+1F600 😀. Maybe the Unicode package would be helpful.
If you don't want to parse the emoji data files released by Unicode directly (which are not simple CSVs for the most part) to identify emojis (or, worse, emoji sequences), there probably is an automatically updated project on Github that provides the respective CSV files – alas, I haven’t found it yet.
The text was updated successfully, but these errors were encountered:
Your
emDic.csv
, adopted from the same file by @today-is-a-good-day, has the Unicode Name in the first column labelledDescription
, UTF-8 bytes in C hexadecimal notation (i.e. leading\x
) in the second column labelledBytes
and CESU-8 hexadecimal bytes in angle brackets in the third column labelledR-encoding
. CESU-8 is (almost) identical to UTF-8 in the BMP, i.e. for U+0000 through U+FFFF, but differs for the astral planes, in one of which (the SMP) most emojis live.This seems rather static and as you acknowledge in your blog post, the emoji data is severely outdated. I assume there is a simpler and more flexible way to achieve the same result in R (but I’m new at it). I only know that R also supports
\U########
and, for the BMP,\u####
notation, e.g.\U0001F600
(leading zeros mandatory) for U+1F600 😀. Maybe the Unicode package would be helpful.If you don't want to parse the emoji data files released by Unicode directly (which are not simple CSVs for the most part) to identify emojis (or, worse, emoji sequences), there probably is an automatically updated project on Github that provides the respective CSV files – alas, I haven’t found it yet.
The text was updated successfully, but these errors were encountered: