Make endof() robust to invalid UTF-8 (#17276)

When an invalid string contains only continuation bytes, endof() tried to index the underlying array at position 0. Instead of relying on bounds checking, explicitly check for > 0. Returning 0 when only continuation bytes where encountered is consistent with the definition of endof(), which gives the last valid index. This also allows removing the i == 0 check. The new code appears to be slightly faster than the old one.
JuliaLang · Jul 6, 2016 · fa5af23 · fa5af23 · nanosoldier · Jul 6, 2016
1 parent 7b3f529
commit fa5af23
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 2 deletions.
diff --git a/base/strings/string.jl b/base/strings/string.jl
@@ -36,8 +36,7 @@ const utf8_trailing = [
 function endof(s::String)
     d = s.data
     i = length(d)
-    i == 0 && return i
-    while is_valid_continuation(d[i])
+    @inbounds while i > 0 && is_valid_continuation(d[i])
         i -= 1
     end
     i

diff --git a/test/strings/basic.jl b/test/strings/basic.jl
@@ -477,3 +477,7 @@ foobaz(ch) = reinterpret(Char, typemax(UInt32))
 @test typeof(ascii(GenericString("Hello, world"))) == String
 @test_throws ArgumentError ascii("Hello, ∀")
 @test_throws ArgumentError ascii(GenericString("Hello, ∀"))
+
+# issue #17271: endof() doesn't throw an error even with invalid strings
+@test endof(String(b"\x90")) == 0
+@test endof(String(b"\xce")) == 1