-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add isemoji function to Unicode stdlib and export it #38458
Closed
Closed
Changes from 10 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
3a5c8d7
Add isemoji function and export it
archermarx ca8f747
Add emoji tests
archermarx 8029e2d
Fix tests involving unexported zero-width joiner
archermarx ae86600
Update stdlib/Unicode/src/Unicode.jl
archermarx 773799d
Update stdlib/Unicode/src/Unicode.jl
archermarx 6154f0e
Update isemoji documentation
archermarx 8c6bd6c
Update stdlib/Unicode/test/runtests.jl
archermarx 3bc23ce
Redo isemoji to fix missed corner cases
archermarx 22c43ed
Add new emoji tests
archermarx 1382b12
Update documentation
archermarx cd5092e
Download when parsing emoji file
archermarx 7176501
Update isemoji function and add emoji_ranges file
archermarx f80e7ab
Remove incorrect emoji ranges
archermarx b9b54a1
Support two-character keycap pattern
archermarx b0d0156
Fix test, but one test in precompile is failing
archermarx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we want to store this entire string in the compiled library. You should just download it when you parse the data, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your old code that just looked at certain code point blocks was a lot more compact and independent of the Unicode version. I’m just not sure if it is standard conforming?
Could we just use the old code combined with checking the category code to see if the code point is assigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I still think we should use the data file, but only for tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i looked more into the emoji codepoints, i found some corner cases which the old version didn't catch. It would be possible to patch those, but the secondary issue was that there's a bunch of unassigned emoji in the blocks in the old version. If we're ok with saying "yes this is an emoji" to characters in those larger blocks which are not (yet) emoji, then we can go with the old system. If not, we would either need to restrict those ranges to those codepoints which are currently assigned manually or via parsing the emoji_data file like i do here.
i do agree the old version was a lot simpler, so i'm not sure what the best solution is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should return
false
for currently unassigned codepoints, but we can check for that simply by returningfalse
ifUnicode.category_code(char) == Unicode.UTF8PROC_CATEGORY_CN
(Or even be more restrictive: only allow category So or Sk.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat, didn't know about that. I'll get back with a new version later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking at this, there are things in the emoji data file which fall under unicode category Sm and Po (◾ and ‼, respectively), and things in So which do not qualify as emoji (Ⓕ). To be honest, I'm not sure why some of these symbols get called emoji and some don't, but I think the only way to be complete here is to use the full list of ranges. We don't have to download it necessarily (could take the output from the file and manually include that in the unicode.jl file, but that's a 702-element array), but I'm not sure of another way to make sure we catch all emoji without false positives