Skip to content

Commit

Permalink
Refactor UTF_START_MASK()
Browse files Browse the repository at this point in the history
A slight change to this very low level macro (hence called a lot)
removes the need for a conditional, and causes it to work on single-byte
UTF-8 characters on ASCII platforms
  • Loading branch information
khwilliamson committed Jun 14, 2021
1 parent 78d5655 commit 6d12be3
Showing 1 changed file with 13 additions and 4 deletions.
17 changes: 13 additions & 4 deletions utf8.h
Expand Up @@ -461,10 +461,19 @@ uppercase/lowercase/titlecase/fold into.
*/
#define UTF_START_MARK(len) ((U8) ~(0xFF >> (len)))

/* Masks out the initial one bits in a start byte, leaving the real data ones.
* Doesn't work on an invariant byte. 'len' is the number of bytes in the
* multi-byte sequence that comprises the character. */
#define UTF_START_MASK(len) (UNLIKELY((len) >= 7) ? 0x00 : (0x1F >> ((len)-2)))
/* Masks out the initial one bits in a start byte, leaving the following 0 bit
* and the real data bits. 'len' is the number of bytes in the multi-byte
* sequence that comprises the character.
*
* To illustrate: len = 2 => 0b0011_1111 works on start byte 110xxxxx
* 6 => 0b0000_0011 works on start byte 1111110x
* >= 7 => There are no data bits in the start byte
* Note that on ASCII platforms, this can be passed a len=1 byte; and all the
* real data bits will be returned:
len = 1 => 0b0111_1111
* This isn't true on EBCDIC platforms, where some len=1 bytes are of the form
* 0b101x_xxxx, so this can't be used there on single-byte characters. */
#define UTF_START_MASK(len) (0xFF >> (len))

/* Adds a UTF8 continuation byte 'new' of information to a running total code
* point 'old' of all the continuation bytes so far. This is designed to be
Expand Down

0 comments on commit 6d12be3

Please sign in to comment.