Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emDic.csv is a UTF-8 to CESU-8 conversion file #1

Open
Crissov opened this issue May 9, 2017 · 0 comments
Open

emDic.csv is a UTF-8 to CESU-8 conversion file #1

Crissov opened this issue May 9, 2017 · 0 comments

Comments

@Crissov
Copy link

Crissov commented May 9, 2017

Your emDic.csv, adopted from the same file by @today-is-a-good-day, has the Unicode Name in the first column labelled Description, UTF-8 bytes in C hexadecimal notation (i.e. leading \x) in the second column labelled Bytes and CESU-8 hexadecimal bytes in angle brackets in the third column labelled R-encoding. CESU-8 is (almost) identical to UTF-8 in the BMP, i.e. for U+0000 through U+FFFF, but differs for the astral planes, in one of which (the SMP) most emojis live.

This seems rather static and as you acknowledge in your blog post, the emoji data is severely outdated. I assume there is a simpler and more flexible way to achieve the same result in R (but I’m new at it). I only know that R also supports \U######## and, for the BMP, \u#### notation, e.g. \U0001F600 (leading zeros mandatory) for U+1F600 😀. Maybe the Unicode package would be helpful.

If you don't want to parse the emoji data files released by Unicode directly (which are not simple CSVs for the most part) to identify emojis (or, worse, emoji sequences), there probably is an automatically updated project on Github that provides the respective CSV files – alas, I haven’t found it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant