Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make endof() robust to invalid UTF-8 #17276

Merged
merged 1 commit into from
Jul 6, 2016
Merged

Make endof() robust to invalid UTF-8 #17276

merged 1 commit into from
Jul 6, 2016

Conversation

nalimilan
Copy link
Member

When an invalid string contains only continuation bytes, endof() tried to
index the underlying array at position 0. Instead of relying on bounds
checking, explicitly check for > 0. Returning 0 when only continuation bytes
where encountered is consistent with the definition of endof(), which gives
the last valid index.

This also allows removing the i == 0 check. The new code appears to be
slightly faster than the old one.

When an invalid string contains only continuation bytes, endof() tried to
index the underlying array at position 0. Instead of relying on bounds
checking, explicitly check for > 0. Returning 0 when only continuation bytes
where encountered is consistent with the definition of endof(), which gives
the last valid index.

This also allows removing the i == 0 check. The new code appears to be
slightly faster than the old one.
@nalimilan
Copy link
Member Author

Note this is also more consistent, as before this change String(b"\x61\x90") printed as "a" while String(b"\x90") raised an error.

@tkelman tkelman added the unicode Related to unicode characters and encodings label Jul 5, 2016
@StefanKarpinski StefanKarpinski merged commit fa5af23 into master Jul 6, 2016
@StefanKarpinski StefanKarpinski deleted the nl/endof branch July 6, 2016 00:27
mfasi pushed a commit to mfasi/julia that referenced this pull request Sep 5, 2016
When an invalid string contains only continuation bytes, endof() tried to
index the underlying array at position 0. Instead of relying on bounds
checking, explicitly check for > 0. Returning 0 when only continuation bytes
where encountered is consistent with the definition of endof(), which gives
the last valid index.

This also allows removing the i == 0 check. The new code appears to be
slightly faster than the old one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants