Skip to content

Commit

Permalink
Dumper.xs: Output orphaned EBCDIC control as octal
Browse files Browse the repository at this point in the history
This makes the code simpler, and removes the need to worry about and
comment on EBCDIC.

On ASCII machines there are the C0 controls, the C1 controls, and DEL,
which isn't technically in either set.  The C0 and DEL controls are
treated as low ordinal, and output using octal notation.  This commit
has no behavior changes on ASCII platforms.

On EBCDIC machines, there are 1-1 mappings to the entire set of 65 ASCII
controls.  All but one are in a single block and have been output using
octal.  This commit doesn't change the behavior of the 64 single-block
controls.

There is a lone control that isn't adjacent to the others, orphaned.
This commit's only effect is to cause it to be displayed using octal
instead of hex.  I believe the simplification of the code warrants this
change.

On extant EBCDIC platforms that Perl supports, this control is 0xFF,
named EO or EIGHT ONES, and is somewhat like DEL on ASCII platforms,
which we already display as octal, even though it is much higher ordinal
than any other control displayed as octal.
  • Loading branch information
khwilliamson committed May 25, 2021
1 parent 5334c4b commit 030107e
Showing 1 changed file with 11 additions and 20 deletions.
31 changes: 11 additions & 20 deletions dist/Data-Dumper/Dumper.xs
Expand Up @@ -254,13 +254,10 @@ esc_q_utf8(pTHX_ SV* sv, const char *src, STRLEN slen, I32 do_utf8, I32 useqq)
normal++;
}
}
else if (! isASCII(k) && k > ' ') {
/* High ordinal non-printable code point. (The test that k is
* above SPACE should be optimized out by the compiler on
* non-EBCDIC platforms; otherwise we could put an #ifdef around
* it, but it's better to have just a single code path when
* possible. All but one of the non-ASCII EBCDIC controls are low
* ordinal; that one is the only one above SPACE.)
else if (! UTF8_IS_INVARIANT(k)) {
/* We treat as low ordinal any code point whose representation is
* the same under UTF-8 as not. Thus, this is a high ordinal code
* point.
*
* If UTF-8, output as hex, regardless of useqq. This means there
* is an overhead of 4 chars '\x{}'. Then count the number of hex
Expand Down Expand Up @@ -329,18 +326,10 @@ esc_q_utf8(pTHX_ SV* sv, const char *src, STRLEN slen, I32 do_utf8, I32 useqq)
U8 c0 = *(U8 *)s;
UV k;

if (do_utf8
&& ! isASCII(c0)
/* Exclude non-ASCII low ordinal controls. This should be
* optimized out by the compiler on ASCII platforms; if not
* could wrap it in a #ifdef EBCDIC, but better to avoid
* #if's if possible */
&& c0 > ' '
) {

/* When in UTF-8, we output all non-ascii chars as \x{}
* reqardless of useqq, except for the low ordinal controls on
* EBCDIC platforms */
if (do_utf8 && ! UTF8_IS_INVARIANT(c0)) {

/* In UTF-8, we output as \x{} all chars that require more than
* a single byte in UTF-8 to represent. */
k = utf8_to_uvchr_buf((U8*)s, (U8*) send, NULL);

/* treat invalid utf8 byte by byte. This loop iteration gets the
Expand Down Expand Up @@ -602,7 +591,9 @@ dump_regexp(pTHX_ SV *retval, SV *val)
k = *p;
}

if ((k == '/' && !saw_backslash) || (do_utf8 && ! isASCII(k) && k > ' ')) {
if ((k == '/' && !saw_backslash) || ( do_utf8
&& ! UTF8_IS_INVARIANT(k)))
{
STRLEN to_copy = p - (U8 *) rval;
if (to_copy) {
/* If saw_backslash is true, this will copy the \ for us too. */
Expand Down

0 comments on commit 030107e

Please sign in to comment.