Skip to content

Commit

Permalink
utf8.h: Remove an EBCDIC dependency
Browse files Browse the repository at this point in the history
The symbol introduced in the previous commit allows this internal macro
to only need a single version, suitable for either EBCDIC or ASCII.
  • Loading branch information
khwilliamson committed Jun 14, 2021
1 parent bd78ed1 commit 259cba4
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 4 deletions.
19 changes: 17 additions & 2 deletions utf8.h
Expand Up @@ -270,15 +270,30 @@ are in the character. */
* some #ifdefs. */
# define ONE_IF_EBCDIC_ZERO_IF_NOT 0

#define UNICODE_IS_PERL_EXTENDED(uv) UNLIKELY((UV) (uv) > 0x7FFFFFFF)

#endif /* EBCDIC vs ASCII */

/* Since the significant bits in a continuation byte are stored in the
* least-significant positions, we often find ourselves shifting by that
* amount. This is a clearer name in such situations */
#define UTF_ACCUMULATION_SHIFT UTF_CONTINUATION_BYTE_INFO_BITS

/* Perl extends Unicode so that it is possible to encode (as extended UTF-8 or
* UTF-EBCDIC) any 64-bit value. No standard known to khw ever encoded higher
* than a 31 bit value. On ASCII platforms this just meant arbitrarily saying
* nothing could be higher than this. On these the start byte FD gets you to
* 31 bits, and FE and FF are forbidden as start bytes. On EBCDIC platforms,
* FD gets you only to 26 bits; adding FE to mean 7 total bytes gets you to 30
* bits. To get to 31 bits, they treated an initial FF byte idiosyncratically.
* It was considered to be the start byte FE meaning it had 7 total bytes, and
* the final 1 was treated as an information bit, getting you to 31 bits.
*
* Perl used to accept this idiosyncratic interpretation of FF, but now rejects
* it in order to get to being able to encode 64 bits. The bottom line is that
* anything that requires more than 31 bits to represent on ASCII platforms
* uses a Perl extension; 30 bits on EBCDIC. */
#define UNICODE_IS_PERL_EXTENDED(uv) \
UNLIKELY((UV) (uv) > nBIT_UMAX(31 - ONE_IF_EBCDIC_ZERO_IF_NOT))

/* 2**info_bits - 1. This masks out all but the bits that carry
* real information in a continuation byte. This turns out to be 0x3F in
* UTF-8, 0x1F in UTF-EBCDIC. */
Expand Down
2 changes: 0 additions & 2 deletions utfebcdic.h
Expand Up @@ -226,8 +226,6 @@ explicitly forbidden, and the shortest possible encoding should always be used
* for more */
#define QUESTION_MARK_CTRL LATIN1_TO_NATIVE(0x9F)

#define UNICODE_IS_PERL_EXTENDED(uv) UNLIKELY((UV) (uv) > 0x3FFFFFFF)

#define ONE_IF_EBCDIC_ZERO_IF_NOT 1

/*
Expand Down

0 comments on commit 259cba4

Please sign in to comment.