Skip to content

Commit

Permalink
Make endof() robust to invalid UTF-8 (#17276)
Browse files Browse the repository at this point in the history
When an invalid string contains only continuation bytes, endof() tried to
index the underlying array at position 0. Instead of relying on bounds
checking, explicitly check for > 0. Returning 0 when only continuation bytes
where encountered is consistent with the definition of endof(), which gives
the last valid index.

This also allows removing the i == 0 check. The new code appears to be
slightly faster than the old one.
  • Loading branch information
nalimilan authored and StefanKarpinski committed Jul 6, 2016
1 parent 7b3f529 commit fa5af23
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 2 deletions.
3 changes: 1 addition & 2 deletions base/strings/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,7 @@ const utf8_trailing = [
function endof(s::String)
d = s.data
i = length(d)
i == 0 && return i
while is_valid_continuation(d[i])
@inbounds while i > 0 && is_valid_continuation(d[i])
i -= 1
end
i
Expand Down
4 changes: 4 additions & 0 deletions test/strings/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -477,3 +477,7 @@ foobaz(ch) = reinterpret(Char, typemax(UInt32))
@test typeof(ascii(GenericString("Hello, world"))) == String
@test_throws ArgumentError ascii("Hello, ∀")
@test_throws ArgumentError ascii(GenericString("Hello, ∀"))

# issue #17271: endof() doesn't throw an error even with invalid strings
@test endof(String(b"\x90")) == 0
@test endof(String(b"\xce")) == 1

7 comments on commit fa5af23

@nanosoldier
Copy link
Collaborator

@nanosoldier nanosoldier commented on fa5af23 Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing the daily benchmark build, I will reply here when finished:

@nanosoldier runbenchmarks(ALL, isdaily = true)

EDIT by @jrevels: master broke JLD, which BenchmarkTools/Nanosoldier currently uses for (de)serialization of benchmark parameters/results. See JuliaCI/BenchmarkTools.jl#15.

@jrevels
Copy link
Member

@jrevels jrevels commented on fa5af23 Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deployed the recent JLD fix (JuliaIO/JLD.jl#78) to nanosoldier, let's see if things work now:

@nanosoldier runbenchmarks(ALL, isdaily = true)

@nanosoldier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@nalimilan
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing related to strings at least.

@timholy
Copy link
Member

@timholy timholy commented on fa5af23 Jul 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strings benchmarks do not appear to be particularly thorough. The total "test" content of BaseBenchmarks/src/string/StringBenchmarks.jl is:

str = join(samerand('a':'d', 10^4))

SUITE["replace"] = @benchmarkable replace($str, "a", "b")
SUITE["join"] = @benchmarkable join($str, $str)

It would be great to have more complete benchmarks, especially given all the churn in the world of strings and some reports (which I can't find right now) of performance regressions.

@jrevels
Copy link
Member

@jrevels jrevels commented on fa5af23 Jul 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have more complete benchmarks

Both strings and IO are woefully under-tested by our existing suite, especially compared to our indexing/linalg benchmarks.

@nalimilan
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't take this as a proof of anything. Actually, I didn't expect any regressions, as that method only has a very well-defined use, and in my tests all scenarios were faster. Just wanted to note that among the "possible regressions" nothing was string-related (which would have been a problem).

Please sign in to comment.