-
Notifications
You must be signed in to change notification settings - Fork 543
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add variant_under_utf8_count() core function
This function takes a string that isn't encoded in UTF-8 (hence is assumed to be in Latin1), and counts how many of the bytes therein would change if it were to be translated into UTF-8. Each such byte will occupy two UTF-8 bytes. This function is useful for calculating the expansion factor precisely when converting to UTF-8, so as to know how much to malloc. This function uses a non-obvious method to do the calculations word-at-a-time, as opposed to the byte-at-a-time method used now, and hence should be much faster than the current methods. The function is slightly more costly for strings that have fewer bytes per word, with approximately 1.5% more conditionals. But once the string is at least one word long, there is a savings which increases proportionately to the length of the string. On a 64-bit machine, the number of conditional approaches 1/8 of the per-byte algorithm (in other words, a 800% improvement). Here are results from Porting/bench.pl for a 10,000 byte string: per-byte per-word ------ ------ Ir 100.00 434.45 Dr 100.00 785.11 Dw 100.00 102.22 COND 100.00 793.81 IND 100.00 100.00 COND_m 100.00 100.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 99.81 Dw_m1 100.00 133.33 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 whereas the savings are less for a 24-byte string: per-byte per-word ------ ------ Ir 100.00 112.50 Dr 100.00 108.36 Dw 100.00 102.22 COND 100.00 115.79 IND 100.00 100.00 COND_m 100.00 - IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 but rising with the length of the string. Here are 96-byte results per-byte per-word ------ ------ Ir 100.00 147.54 Dr 100.00 130.28 Dw 100.00 102.22 COND 100.00 165.85 IND 100.00 100.00 COND_m 100.00 100.00 IND_m 100.00 100.00 Ir_m1 100.00 100.00 Dr_m1 100.00 100.00 Dw_m1 100.00 100.00 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 The timings are slightly worse for strings that aren't multiples of the word length, but not appreciably so. I found this trick on the internet many years ago, but I can't seem to find it again to give them credit.
- Loading branch information
1 parent
7a4b369
commit 47c620c
Showing
6 changed files
with
222 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters