Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect(graphemes("πŸ€¦πŸΌβ€β™‚οΈ")) results in ERROR: ArgumentError: destination has fewer elements than required #37680

Closed
kdheepak opened this issue Sep 21, 2020 · 16 comments Β· Fixed by #38551
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings kind:bug Indicates an unexpected problem or unintended behavior

Comments

@kdheepak
Copy link
Contributor

kdheepak commented Sep 21, 2020

I believe there's an issue with the implementation of iterate on the result of graphemes("πŸ€¦πŸΌβ€β™‚οΈ").

The following works as expected:

julia> s = "πŸ€¦πŸΌβ€β™‚οΈ"
"🀦🏼\u200d♂️"

julia> length(s) == 5
true

julia> [s[i] for i in eachindex(s)]
5-element Array{Char,1}:
 '🀦': Unicode U+1F926 (category So: Symbol, other)
 '🏼': Unicode U+1F3FC (category Sk: Symbol, modifier)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 'β™‚': Unicode U+2642 (category So: Symbol, other)
 '️': Unicode U+FE0F (category Mn: Mark, nonspacing)

julia> using Unicode

julia> length(graphemes(s)) == 1
true

However, this is the error I get when I try to collect the result of the Iterator in the latest stable release of Julia (v1.5.1).

julia> collect(graphemes(s))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[5]:1

The implementation of length appears to be correct, but I think there's a bug in the implementation of iterate:

julia> for (i, c) in enumerate(graphemes(s))
       println(i, " ", c)
       end
1 πŸ€¦πŸΌβ€
2 ♂️

I believe the correct behavior in this case is to return just the first element. iterate seems to be returning the Symbol β™‚ as well.

@simeonschaub
Copy link
Member

I can reproduce this on master.

@simeonschaub
Copy link
Member

simeonschaub commented Sep 21, 2020

I think the problem is that the state passed to isgraphemebreak! needs to be initialized on the first iteration with the null character. The following fixes the issue for me. Is this the correct way to do it?

diff --git a/base/strings/unicode.jl b/base/strings/unicode.jl
index 235f85184d..029787b3c1 100644
--- a/base/strings/unicode.jl
+++ b/base/strings/unicode.jl
@@ -671,7 +671,7 @@ function length(g::GraphemeIterator{S}) where {S}
     return n
 end
 
-function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
+function iterate(g::GraphemeIterator{S}, i_=(Int32(0),firstindex(g.s))) where {S}
     s = g.s
     statei, i = i_
     state = Ref{Int32}(statei)
@@ -679,6 +679,7 @@ function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
     y = iterate(s, i)
     y === nothing && return nothing
     c0, k = y
+    i == firstindex(g.s) && isgraphemebreak!(state, eltype(S)(0x00000000), c0)
     while k <= ncodeunits(s) # loop until next grapheme is s[i:j]
         c, β„“ = iterate(s, k)
         isgraphemebreak!(state, c0, c) && break

@JeffBezanson JeffBezanson added kind:bug Indicates an unexpected problem or unintended behavior domain:strings "Strings!" labels Sep 21, 2020
@Keno
Copy link
Member

Keno commented Sep 21, 2020

I don't think that should be required. If it is, that potentially seems like a bug in utf8proc

@simeonschaub
Copy link
Member

No, it seems to be documented: https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html#aae83bdcabf3a97c1046c0700ba353640. Apparently, we don't need to pass the state around though, we can just always set it to zero initially.

@Keno
Copy link
Member

Keno commented Sep 21, 2020

I wrote that documentation :). Yes, we don't need to pass the state around. Setting it to zero is fine. If that's required though that's a bug in utf8proc where it should reset the state by itself.

@simeonschaub
Copy link
Member

I just thought this was intended because that's how it's done in length(::GraphemeIterator), but perhaps that was just a workaround.

@Keno
Copy link
Member

Keno commented Sep 21, 2020

It's definitely not supposed to be needed because eltype(S)(0x00000000) is not a character that occurs in the source. Something else is wrong (maybe a missing reset of the state in utf8proc). And of course it should be refactored to not need to useless carry around the state.

@kdheepak
Copy link
Contributor Author

kdheepak commented Sep 22, 2020

Thanks for the comment @Keno and @simeonschaub for looking into this!

I am going through the source of utf8proc.c and utf8proc.h to see if I can narrow it down.

UAX#29 defines the grapheme cluster boundary rules in this section. Specifically there's this rule: http://www.unicode.org/reports/tr29/#GB11

Do not break within emoji modifier 					sequences or emoji zwj sequences.
--
GB11 | \p{Extended_Pictographic} 						Extend* ZWJ | Γ— | \p{Extended_Pictographic}

These lines in grapheme_break_simple manage the rule for this:

(lbc == UTF8PROC_BOUNDCLASS_E_ZWG &&              // GB11 (requires additional handling below)
 tbc == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) ? false : // ----

My understanding is that this is implemented as a state machine.
The grapheme_break_simple function that contains the code above is called from grapheme_break_extended which manages the state machine.

I think it is this part of the "state machine" that is causing the issue.

else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
  if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
    *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
  else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
    *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
  else
    *state = tbc;
}

I did some print style debugging to figure out what is going on.

Here is the state machine transition in the following format:

tbc (tbc -> boundclass) : from_state -> to_state = break_permitted
'🏼'     (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_START                 -> UTF8PROC_BOUNDCLASS_EXTEND                = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ)                   : UTF8PROC_BOUNDCLASS_EXTEND                -> UTF8PROC_BOUNDCLASS_ZWJ                   = false
'β™‚'      (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_ZWJ                   -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = true
'️'       (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false

You can see that when the boundclass is UTF8PROC_BOUNDCLASS_ZWJ and the state goes from UTF8PROC_BOUNDCLASS_EXTEND to UTF8PROC_BOUNDCLASS_ZWJ. I believe this state should be UTF8PROC_BOUNDCLASS_E_ZWG instead. The current implementation of the state machine "loses" the information that it was part of an extending grapheme cluster.

I made the following change

    if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
      *state = UTF8PROC_BOUNDCLASS_OTHER;
    // Special support for GB11 (emoji extend* zwj / emoji)
    else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
      if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
        *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
      else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
        *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
      else
        *state = tbc;
    }
+   else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+       *state = UTF8PROC_BOUNDCLASS_E_ZWG;       
    else
      *state = tbc;

And now the state machine goes from UTF8PROC_BOUNDCLASS_EXTEND to UTF8PROC_BOUNDCLASS_E_ZWG.

'🏼'     (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_START                 -> UTF8PROC_BOUNDCLASS_EXTEND                = false
'\u200d' (UTF8PROC_BOUNDCLASS_ZWJ)                   : UTF8PROC_BOUNDCLASS_EXTEND                -> UTF8PROC_BOUNDCLASS_E_ZWG                 = false
'β™‚'      (UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) : UTF8PROC_BOUNDCLASS_E_ZWG                 -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false
'️'       (UTF8PROC_BOUNDCLASS_EXTEND)                : UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC -> UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC = false

I ran make check in the utf8proc folder after making this change and the tests pass.


Edit: Ignore the text below in this comment. See next comment for more information.

However, I am still getting behavior that doesn't make sense.

Specifically, there's this rule that says:

Break at the start and end of text, unless the text is empty.

I don't understand why the specification says "break at the start" or what "unless the text is empty" means. What "text" is being referred to here?

This is the implementation for this rule:

(lbc == UTF8PROC_BOUNDCLASS_START) ? true :       // GB1

This returns true at every UTF8PROC_BOUNDCLASS_START. Is that a correct? Maybe I'm misinterpreting the specification but that doesn't seem like what it should do, right?

I changed that line to the following:

(lbc == UTF8PROC_BOUNDCLASS_START) ? false 

This is obviously a violation of the specification and with this change make check also fails.

But with this change and with the fix to the state machine that I mentioned above I'm getting the behavior I expect. This also seems to work for every other case I experiment with. I didn't run a full test suite though and this is my first foray into the Unicode specification and I've only read this section.

Any suggestions here would be very welcome.

@kdheepak
Copy link
Contributor Author

kdheepak commented Sep 22, 2020

I took a second stab at this just to explore again what is going on. It turns out I had a bug in my implementation! (I had re-implemented the state machine in Julia for my testing and messed up one line there.)

This is the only change I needed to make to resolve this issue, exactly the same as the previous comment.

    if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
      *state = UTF8PROC_BOUNDCLASS_OTHER;
    // Special support for GB11 (emoji extend* zwj / emoji)
    else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
      if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji
        *state = UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC;
      else if (tbc == UTF8PROC_BOUNDCLASS_ZWJ)
        *state = UTF8PROC_BOUNDCLASS_E_ZWG; // state to record emoji+zwg combo
      else
        *state = tbc;
    }
+   else if (*state == UTF8PROC_BOUNDCLASS_EXTEND && tbc == UTF8PROC_BOUNDCLASS_ZWJ)
+       *state = UTF8PROC_BOUNDCLASS_E_ZWG;       
    else
      *state = tbc;

Ignore the part about the UTF8PROC_BOUNDCLASS_START state from the previous comment.

The only question I have is what other states should transition to UTF8PROC_BOUNDCLASS_E_ZWG?

@stevengj stevengj added the domain:unicode Related to unicode characters and encodings label Sep 22, 2020
archermarx added a commit to archermarx/utf8proc that referenced this issue Nov 19, 2020
@archermarx
Copy link
Contributor

archermarx commented Nov 19, 2020

I've been experimenting with this to solve a similar problem, and it seems that all of these cases are only broken if the emoji cluster appears at the beginning of the string

For example:

julia> collect(graphemes("πŸ€¦πŸΌβ€β™‚οΈ"))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[3]:1

julia> collect(graphemes("a" * "πŸ€¦πŸΌβ€β™‚οΈ"))
2-element Array{SubString{String},1}:
 "a"
 "🀦🏼\u200d♂️"

julia> collect(graphemes("πŸ‡ΈπŸ‡ͺπŸ‡ΈπŸ‡ͺ"))
2-element Array{SubString{String},1}:
 "πŸ‡ΈπŸ‡ͺπŸ‡Έ"
 "πŸ‡ͺ"

julia> collect(graphemes("a"*"πŸ‡ΈπŸ‡ͺπŸ‡ΈπŸ‡ͺ"))
3-element Array{SubString{String},1}:
 "a"
 "πŸ‡ΈπŸ‡ͺ"
 "πŸ‡ΈπŸ‡ͺ"

julia> collect(graphemes("πŸ‘¨πŸ»β€πŸ€β€πŸ‘¨πŸ½"))
ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
 [1] copyto!(::Array{SubString{String},1}, ::Base.Unicode.GraphemeIterator{String}) at ./abstractarray.jl:734
 [2] _collect at ./array.jl:630 [inlined]
 [3] collect(::Base.Unicode.GraphemeIterator{String}) at ./array.jl:624
 [4] top-level scope at REPL[4]:1

julia> collect(graphemes("a"*"πŸ‘¨πŸ»β€πŸ€β€πŸ‘¨πŸ½"))
2-element Array{SubString{String},1}:
 "a"
 "πŸ‘¨πŸ»\u200d🀝\u200dπŸ‘¨πŸ½"

@kdheepak
Copy link
Contributor Author

kdheepak commented Nov 19, 2020

Thanks for looking into it! Does it fail even with the patch?

it seems that all of these cases are only broken if the emoji cluster appears at the beginning of the string

It probably enters the state machine into this incorrect state only in some cases.

@archermarx
Copy link
Contributor

archermarx commented Nov 19, 2020

Just did some more tests. The patch fixes "πŸ€¦πŸΌβ€β™‚οΈ" and ""πŸ‘¨πŸ»β€πŸ€β€πŸ‘¨πŸ½" but not "πŸ‡ΈπŸ‡ͺπŸ‡ΈπŸ‡ͺ", while prepending a letter fixes all three. There must be something to do with the initialization of the state

@kdheepak
Copy link
Contributor Author

kdheepak commented Nov 19, 2020

I have a patch for this that fixes the two (or multiple flags) case πŸ‡ΈπŸ‡ͺπŸ‡ΈπŸ‡ͺ:

diff --git a/utf8proc.c b/utf8proc.c
index 6591976..086f430 100644
--- a/utf8proc.c
+++ b/utf8proc.c
@@ -301,6 +301,8 @@ static utf8proc_bool grapheme_break_extended(int lbc, int tbc, utf8proc_int32_t
     // forbidden by a different rule such as GB9).
     if (*state == tbc && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
       *state = UTF8PROC_BOUNDCLASS_OTHER;
+    else if (*state == UTF8PROC_BOUNDCLASS_START && lbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR && tbc == UTF8PROC_BOUNDCLASS_REGIONAL_INDICATOR)
+      *state = UTF8PROC_BOUNDCLASS_OTHER;
     // Special support for GB11 (emoji extend* zwj / emoji)
     else if (*state == UTF8PROC_BOUNDCLASS_EXTENDED_PICTOGRAPHIC) {
       if (tbc == UTF8PROC_BOUNDCLASS_EXTEND) // fold EXTEND codepoints into emoji

This patch special cases when there's a regional indicator in the start.

@archermarx
Copy link
Contributor

Oh sick. I can add that to my fork and submit a PR. I've got some additional tests added to the graphemetest.c file

@kdheepak
Copy link
Contributor Author

Thanks for pushing it along!

@stevengj
Copy link
Member

stevengj commented Nov 23, 2020

Update: the underlying bug here was that the initial state = 0 needs to be converted internally to the bound class of the first character. That was already done by utf8proc_map, which is why utf8proc passed the Unicode grapheme tests previously, but it wasn't being done when you called isgraphemebreak! (utf8proc_grapheme_break_stateful) manually. (Our updated tests now check both APIs.)

Once the tests on the latest utf8proc PR are green, I'll merge and then post a PR to make Julia use the updated utf8proc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants