Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add core function valid_utf8_length()
This function assumes that the input is valid UTF-8. It uses a different algorithm which does counting word-at-a-time, very much like variant_under_utf8_count(), leading to significant performance improvements, with longer strings getting more relative improvement. On a 32-bit system, the number of failed branch predictions declines to half as many, with everything else staying about equal. 32-bit UV's; string length 24 characters; 2 bytes per character bytecount wordcount --------- --------- Ir 100.00 100.72 Dr 100.00 100.82 Dw 100.00 101.10 COND 100.00 100.00 IND 100.00 100.00 COND_m 100.00 200.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 The results are similar for longer strings, and for code points represented by different numbers of bytes. The results on a 64-bit platform also have the branch prediction improve by 200%, but at some short string lengths, the number of branches worsens slightly: 64-bit UV's; string length 4 characters; 3 bytes per character byteutf8_length wordutf8_length --------------- --------------- Ir 100.00 96.08 Dr 100.00 100.88 Dw 100.00 100.00 COND 100.00 97.24 IND 100.00 100.00 COND_m 100.00 200.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 For longer strings things improve: 64-bit UV's; string length 24 characters; 2 bytes per character byteutf8_length wordutf8_length --------------- --------------- Ir 100.00 103.97 Dr 100.00 112.35 Dw 100.00 100.00 COND 100.00 110.27 IND 100.00 100.00 COND_m 100.00 300.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 64-bit UV's; string length 24 characters; 3 bytes per character byteutf8_length wordutf8_length --------------- --------------- Ir 100.00 99.73 Dr 100.00 111.37 Dw 100.00 100.00 COND 100.00 108.05 IND 100.00 100.00 COND_m 100.00 150.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 At very long strings 64-bit UV's; string length 10000000 characters; 2 bytes per character byteutf8_length wordutf8_length --------------- --------------- Ir 100.00 160.00 Dr 100.00 799.91 Dw 100.00 100.00 COND 100.00 399.98 IND 100.00 100.00 COND_m 100.00 150.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 Performance actually worsens on strings with code points that occupy 7 or 13 bytes per code point. These are not in common use, as the maximum that Unicode recognizes occupies 4 bytes.
- Loading branch information