The previous fix for CVE-2024-46954 was still failing to
spot a certain subset of 2 byte sequences as being overlong.
1 byte sequences (0xxxxxxx) encode 7 payload bits.
2 byte sequences (110xxxxx 10xxxxxx) only manage to encode 6
payload bits in the second (lowest) byte.
Thus the test for an overlong 2 byte encoding is not "is the
value of the payload bits in the first byte 0", but rather
"is the value of the payload bits in the first byte smaller
than 2".
Credit for spotting the problem and the initial version of the
fix is due to truff (https://x.com/truffzor).
Another issue spotted, and fixed here, is that it's illegal to
encode high/low surrogates within UTF-8 (as the values they
represent should be encoded directly).
Finally, we need 21 bits of coverage to get all possible
unicode values. 4 byte UTF-8 encodings give us 21 bits of data
as required, but there are values within this 21 bit range
that are not valid unicode chars. So spot these and reject
them too.