Browse files

Speed up UTF-8 validation checking on modern perls

Perl 5.26 introduced infrastructure in the core that can be used by
Encode to check UTF-8 stream validity much faster than before.

It is not clear when or if this functionality will be backported into
Devel::PPPort, in part because there is no one available currently who
knows how to do it, and in part because it may be that everyone else
relies on Encode, so it's not needed generally to be backported.

This commit replaces the current scheme for checking UTF-8 validity if
the infrastructure is availabe, by one in which normal processing
doesn't require having to decode the UTF-8 into code points.  The
copying of characters individually from the input to the output is
changed to be a single operation for each entire span of valid input at

Thus in the normal case, what ends up happening is a tight loop to
check the validity, and then a memmove of the entire input to the
output, then return.

If an error is found, it copies all the valid input before the error,
then handles the character in error, then positions to the next input
position, and repeats the whole process starting from there.

Thus, this does not need to know about the intricacies of UTF-8
malformations, relying on the core to handle this.

There are currently some problems with Encode on EBCDIC platforms.  The
infrastructure is known to correctly work there, so I'm hopeful this
will solve these portability issues.
  • Loading branch information...
khwilliamson committed Dec 28, 2017
1 parent b7d2d47 commit bb0ee14d508c60aec474882c9de7b9366e9b9fd1
Showing with 55 additions and 0 deletions.
  1. +55 −0 cpan/Encode/Encode.xs
@@ -379,6 +379,13 @@ strict_utf8(pTHX_ SV* sv)
return SvTRUE(*svp);
/* Modern perls have the capability to do this more efficiently and portably */
#ifdef is_utf8_string_loc_flags
@@ -463,6 +470,9 @@ process_utf8(pTHX_ SV* dst, U8* s, U8* e, SV *check_sv,
STRLEN dlen;
char esc[UTF8_MAXLEN * 6 + 1];
const U32 flags = (strict)
: 0;
if (SvROK(check_sv)) {
/* croak("UTF-8 decoder doesn't support callback CHECK"); */
@@ -483,6 +493,41 @@ process_utf8(pTHX_ SV* dst, U8* s, U8* e, SV *check_sv,
stop_at_partial = stop_at_partial || (check & ENCODE_STOP_AT_PARTIAL);
while (s < e) {
#ifdef CAN_USE_BASE_PERL /* Use the much faster, portable implementation if
available */
/* If there were no errors, this will be 'e'; otherwise it will point
* to the first byte of the erroneous input */
const U8* e_or_where_failed;
bool valid = is_utf8_string_loc_flags(s, e - s, &e_or_where_failed, flags);
STRLEN len = e_or_where_failed - s;
/* Copy as far as was successful */
Move(s, d, len, U8);
d += len;
s = (U8 *) e_or_where_failed;
/* Are done if it was valid, or we are accepting partial characters and
* the only error is that the final bytes form a partial character */
if ( LIKELY(valid)
|| ( stop_at_partial
&& is_utf8_valid_partial_char_flags(s, e, flags)))
/* Here, was not valid. If is 'strict', and is legal extended UTF-8,
* we know it is a code point whose value we can calculate, just not
* one accepted under strict. Otherwise, it is malformed in some way.
* In either case, the system function can calculate either the code
* point, or the best substitution for it */
uv = utf8n_to_uvchr(s, e - s, &ulen, UTF8_ALLOW_ANY);
#else /* Use code for earlier perls */
if (UTF8_IS_INVARIANT(*s)) {
*d++ = *s++;
@@ -532,6 +577,16 @@ process_utf8(pTHX_ SV* dst, U8* s, U8* e, SV *check_sv,
ulen = 1;
#endif /* The two versions for processing come back together here, for the
* error handling code.
* Here, we are looping through the input and found an error.
* 'uv' is the code point in error if calculable, or the REPLACEMENT
* CHARACTER if not.
* 'ulen' is how many bytes of input this iteration of the loop
* consumes */
for (i=0; i<ulen; ++i) sprintf(esc+4*i, "\\x%02X", s[i]);
if (check & ENCODE_DIE_ON_ERR){

0 comments on commit bb0ee14

Please sign in to comment.