Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isemoji function to Unicode stdlib and export it #38458

Closed
wants to merge 15 commits into from
Closed
84 changes: 83 additions & 1 deletion stdlib/Unicode/src/Unicode.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

module Unicode

export graphemes
export graphemes, isemoji

"""
Unicode.normalize(s::AbstractString; keywords...)
Expand Down Expand Up @@ -89,4 +89,86 @@ letter combined with an accent mark is a single grapheme.)
"""
graphemes(s::AbstractString) = Base.Unicode.GraphemeIterator{typeof(s)}(s)

const emoji_data = download("https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we want to store this entire string in the compiled library. You should just download it when you parse the data, maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your old code that just looked at certain code point blocks was a lot more compact and independent of the Unicode version. I’m just not sure if it is standard conforming?

Could we just use the old code combined with checking the category code to see if the code point is assigned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I still think we should use the data file, but only for tests.)

Copy link
Contributor Author

@archermarx archermarx Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i looked more into the emoji codepoints, i found some corner cases which the old version didn't catch. It would be possible to patch those, but the secondary issue was that there's a bunch of unassigned emoji in the blocks in the old version. If we're ok with saying "yes this is an emoji" to characters in those larger blocks which are not (yet) emoji, then we can go with the old system. If not, we would either need to restrict those ranges to those codepoints which are currently assigned manually or via parsing the emoji_data file like i do here.

i do agree the old version was a lot simpler, so i'm not sure what the best solution is.

Copy link
Member

@stevengj stevengj Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should return false for currently unassigned codepoints, but we can check for that simply by returning false if Unicode.category_code(char) == Unicode.UTF8PROC_CATEGORY_CN

(Or even be more restrictive: only allow category So or Sk.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat, didn't know about that. I'll get back with a new version later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this, there are things in the emoji data file which fall under unicode category Sm and Po (◾ and ‼, respectively), and things in So which do not qualify as emoji (Ⓕ). To be honest, I'm not sure why some of these symbols get called emoji and some don't, but I think the only way to be complete here is to use the full list of ranges. We don't have to download it necessarily (could take the output from the file and manually include that in the unicode.jl file, but that's a 702-element array), but I'm not sure of another way to make sure we catch all emoji without false positives


"""
extract_emoji_column(emoji_data, column = 1; type_field = "")
Read the selected column from a provided unicode emoji data file
(i.e. https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt).
Optionally select only columns beginning with `type_field`
"""
function extract_emoji_column(emoji_data, column = 1; type_field = "")
lines = readlines(emoji_data)
filter!(line -> !isempty(line) && !startswith(line, "#"), lines)
splitlines = [strip.(split(line, ";")) for line in lines]
first_col = [splitline[column] for splitline in splitlines if startswith(splitline[2], type_field)]
end

# parse a string of the form "AAAA...FFFF" into 0xAAAA:0xFFFF
parse_unicode_range_str(range_str) = let s = split(range_str, "..")
if length(s) > 2 || length(s) < 1
return nothing
else
s1 = tryparse(UInt32, "0x" * s[1])
s1 === nothing && return nothing
if length(s) == 1
return s1:s1
else
s2 = tryparse(UInt32, "0x" * s[2])
s2 === nothing && return nothing
return s1:s2
end
end
end

# Get all ranges containing valid single emoji from file
const EMOJI_RANGES = parse_unicode_range_str.(extract_emoji_column(emoji_data, 1, type_field = "Emoji"))
const ZWJ = '\u200d' # Zero-width joiner
const VAR_SELECTOR = '\uFE0F' # Variation selector
# Handle England, Scotland, Wales flags and keycaps
const SPECIAL_CASES = ["🏴󠁧󠁢󠁥󠁮󠁧󠁿", "🏴󠁧󠁢󠁳󠁣󠁴󠁿", "🏴󠁧󠁢󠁷󠁬󠁳󠁿", "#️⃣", "*️⃣", "0️⃣", "1️⃣", "2️⃣", "3️⃣", "4️⃣", "5️⃣", "6️⃣", "7️⃣", "8️⃣", "9️⃣"]

"""
isemoji(Union{AbstractChar, AbstractString}) -> Bool

Test whether a character is an emoji, or whether all elements in a given string are emoji. Includes identifying composite emoji.
Empty strings return `true` as they contain no characters which aren't emoji.
Combined emoji sequences separated by the zero-width joiner character `'\u200d'`
such as 👨‍❤️‍👨 `['👨', '\u200d', '❤', '\uFE0F', '\u200d', '👨']` are supported, though this function cannot determine whether a
given sequence of emoji and zero-width joiners would result in a valid composite emoji.
"""
function isemoji(c::AbstractChar)
u = UInt32(c)
@inbounds for emojiset in EMOJI_RANGES
u in emojiset && return true
end
return false
end

function isemoji(s::AbstractString)
archermarx marked this conversation as resolved.
Show resolved Hide resolved
s in SPECIAL_CASES && return true
isempty(s) && return true
s[end] == ZWJ && return false
ZWJ_allowed = false
VAR_SELECTOR_allowed = false
emoji_allowed = true
# make sure string follows sequence of basic emoji chars
# separated by ZWJ and VAR_SELECTOR characters
@inbounds for c in s
if c == ZWJ
!ZWJ_allowed && return false
ZWJ_allowed = false
VAR_SELECTOR_allowed = false
elseif c == VAR_SELECTOR
!VAR_SELECTOR_allowed && return false
VAR_SELECTOR_allowed = false
else
!isemoji(c) && return false
ZWJ_allowed = true
VAR_SELECTOR_allowed = true
end
end
return true
end

end
53 changes: 53 additions & 0 deletions stdlib/Unicode/test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -404,3 +404,56 @@ end
@test prod(["*" for i in 1:3]) == "***"
@test prod(["*" for i in 1:0]) == ""
end

@testset "Emoji tests" begin
# parse a string of the form "AAAA BBBB CCCC" into [0xAAAA, 0xBBBB, 0xCCCC]
function parse_sequence_str(seq_str)
s = split(seq_str)
res = tryparse.(UInt32, "0x" .* s)
if all(isnothing.(res))
return nothing
else
return res
end
end

# Parse string containing range (i.e. AAAA..FFFF) or sequence (AAAA BBBB CCCC) of unicode codepoints into an array of strings
# Ranges are parsed as independent characters ("AAAA...FFFF") -> ["\uAAAA", "\uBBBB", ..., "\u"FFFF"]
# Sequences are parsed as a single string ("AAAA BBBB") -> ["\uAAAA\uBBBB"]
function parse_col_entry(seq_str)
s = parse_sequence_str(seq_str)
if s === nothing
s = "" .* Char.(Unicode.parse_unicode_range_str(seq_str) |> collect)
else
s = [Char.(s) |> String]
end
return s
end

function extract_emoji_sequences(emoji_data)
codepoints = Unicode.extract_emoji_column(emoji_data)
emojis = parse_col_entry.(codepoints)
vcat(emojis...)
end

# See if all emojis are caught by the isemoji function
emoji_sequences = download("https://www.unicode.org/Public/emoji/13.1/emoji-sequences.txt")
emoji_zwj_sequences = download("https://www.unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt")
all_emojis = [extract_emoji_sequences(emoji_sequences) ; extract_emoji_sequences(emoji_zwj_sequences)]
@test all(isemoji.(all_emojis))

@test !isemoji('A')
@test !isemoji("🔹 some text bounded by emojis 🔹")
@test !isemoji("🚍 some text after an emoji")
@test !isemoji("some text before an emoji 🚘")
@test !isemoji("😮 😥 😨 😩 😪") # There are spaces between the emojis
@test !isemoji("No emojis here")

# Test emoji sequences
@test isemoji("😈😘")
@test isemoji("🚴🏿")
@test !isemoji("👨‍👧" * Unicode.ZWJ)
@test isemoji("🛌" * Unicode.ZWJ * '😎')
@test !isemoji("🤦🏽" * Unicode.ZWJ * Unicode.ZWJ * '😎')
archermarx marked this conversation as resolved.
Show resolved Hide resolved
@test isemoji("")
end