Make endof() robust to invalid UTF-8 #17276

nalimilan · 2016-07-05T07:56:05Z

When an invalid string contains only continuation bytes, endof() tried to
index the underlying array at position 0. Instead of relying on bounds
checking, explicitly check for > 0. Returning 0 when only continuation bytes
where encountered is consistent with the definition of endof(), which gives
the last valid index.

This also allows removing the i == 0 check. The new code appears to be
slightly faster than the old one.

When an invalid string contains only continuation bytes, endof() tried to index the underlying array at position 0. Instead of relying on bounds checking, explicitly check for > 0. Returning 0 when only continuation bytes where encountered is consistent with the definition of endof(), which gives the last valid index. This also allows removing the i == 0 check. The new code appears to be slightly faster than the old one.

nalimilan · 2016-07-05T08:18:17Z

Note this is also more consistent, as before this change String(b"\x61\x90") printed as "a" while String(b"\x90") raised an error.

When an invalid string contains only continuation bytes, endof() tried to index the underlying array at position 0. Instead of relying on bounds checking, explicitly check for > 0. Returning 0 when only continuation bytes where encountered is consistent with the definition of endof(), which gives the last valid index. This also allows removing the i == 0 check. The new code appears to be slightly faster than the old one.

nalimilan mentioned this pull request Jul 5, 2016

bug in printing invalid char #17271

Closed

tkelman added the unicode Related to unicode characters and encodings label Jul 5, 2016

StefanKarpinski merged commit fa5af23 into master Jul 6, 2016

StefanKarpinski deleted the nl/endof branch July 6, 2016 00:27

nalimilan mentioned this pull request Sep 14, 2016

checkstring() accepts invalid UTF-8 string #14557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make endof() robust to invalid UTF-8 #17276

Make endof() robust to invalid UTF-8 #17276

nalimilan commented Jul 5, 2016

nalimilan commented Jul 5, 2016

Make endof() robust to invalid UTF-8 #17276

Make endof() robust to invalid UTF-8 #17276

Conversation

nalimilan commented Jul 5, 2016

nalimilan commented Jul 5, 2016