tr''' does not work properly #15853

p5pRT · 2017-01-31T02:24:38Z

Migrated from rt.perl.org#130679 (status was 'rejected')

Searchable as RT130679$

p5pRT · 2017-01-31T02:24:38Z

From @khwilliamson

This is a bug report for perl from khw@cpan.org,
generated with the help of perlbug 1.40 running under perl 5.25.7.

Using a single quote as a delimiter in tr or y does not work properly.

The parsing for these commands is split between toke.c and op.c, with
the portion in toke.c mostly in S_scan_const. This function does not
get called when the delimiter is a single quote; presumably it is
expected to be only a double-quotish thing to do. scan_const creates an
intermediate form that op.c uses to finish things up. But in this case,
there is no intermediate form, and op.c is getting the raw input. This
is part of the underlying bug in
#130675. op.c is expecting the intermediate form; and the first part of
its input matches the intermediate form, right down to the single
reserved byte used by that form (which the fuzzer has managed to find).
But that reserved byte signals that there should be more to come, and
this defective input has none.

It so happens that the intermediate form is unchanged from the input for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's amazing
to me that it has gone undiscovered for so long, requiring a fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing. It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would have to
be pulled out into a separate function, called as well from whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall operation,
suggestions are welcome. Or volunteers.

Flags:
category=core
severity=low

This perlbug was built using Perl 5.25.9 - Thu Jan 19 08:46:21 MST 2017
It is being executed now by Perl 5.25.7 - Fri Nov 18 08:29:04 MST 2016.

Site configuration information for perl 5.25.7:

Configured by khw at Fri Nov 18 08:29:04 MST 2016.

Summary of my perl5 (revision 5 version 25 subversion 7) configuration:
Commit id: 51d89e3
Platform:
osname=linux
osvers=4.8.0-27-generic
archname=x86_64-linux-thread-multi-ld
uname='linux khw-inspiron-3558 4.8.0-27-generic #29-ubuntu smp thu
oct 20 21:03:13 utc 2016 x86_64 x86_64 x86_64 gnulinux '
config_args='-des -Uversiononly -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-ggdb3' -A'optimize=-ggdb3' -A'optimize=-O0'
-Accflags='-DPERL_BOOL_AS_CHAR' -Accflags='-DPERL_EXTERNAL_GLOB'
-Dman1dir=none -Dman3dir=none -DDEBUGGING -Dcc=g++ -Dusemorebits
-Dusethreads -Accflags='-DNO_MATHOMS''
hint=recommended
useposix=true
d_sigaction=define
useithreads=define
usemultiplicity=define
use64bitint=define
use64bitall=define
uselongdouble=define
usemymalloc=n
bincompat5005=undef
Compiler:
cc='g++'
ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_BOOL_AS_CHAR
-DPERL_EXTERNAL_GLOB -DNO_MATHOMS -fwrapv -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
optimize=' -ggdb3 -O0'
cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_BOOL_AS_CHAR
-DPERL_EXTERNAL_GLOB -DNO_MATHOMS -fwrapv -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
ccversion=''
gccversion='6.2.0 20161005'
gccosandvers=''
intsize=4
longsize=8
ptrsize=8
doublesize=8
byteorder=12345678
doublekind=3
d_longlong=define
longlongsize=8
d_longdbl=define
longdblsize=16
longdblkind=3
ivtype='long'
ivsize=8
nvtype='long double'
nvsize=16
Off_t='off_t'
lseeksize=8
alignbytes=16
prototype=define
Linker and Libraries:
ld='g++'
ldflags =' -fstack-protector-strong -L/usr/local/lib'
libpth=/usr/include/c++/6 /usr/include/x86_64-linux-gnu/c++/6
/usr/include/c++/6/backward /usr/local/lib
/usr/lib/gcc/x86_64-linux-gnu/6/include-fixed
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
libc=libc-2.24.so
so=so
useshrplib=false
libperl=libperl.a
gnulibc_version='2.24'
Dynamic Linking:
dlsrc=dl_dlopen.xs
dlext=so
d_dlsymun=undef
ccdlflags='-Wl,-E'
cccdlflags='-fPIC'
lddlflags='-shared -ggdb3 -ggdb3 -O0 -L/usr/local/lib
-fstack-protector-strong'

@INC for perl 5.25.7:
/home/khw/blead/lib/perl5/site_perl/5.25.7/x86_64-linux-thread-multi-ld
/home/khw/blead/lib/perl5/site_perl/5.25.7
/home/khw/blead/lib/perl5/5.25.7/x86_64-linux-thread-multi-ld
/home/khw/blead/lib/perl5/5.25.7
/home/khw/blead/lib/perl5/site_perl

Environment for perl 5.25.7:
HOME=/home/khw
LANG=en_US.UTF-8
LANGUAGE=en_US
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/lib/ccache:/home/khw/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/usr/local/games
PERL5OPT=-w
PERL_BADLANG (unset)
PERL_DIFF_TOOL=wgdiff
PERL_POD_PEDANTIC=1
SHELL=/bin/ksh

p5pRT · 2017-01-31T04:04:28Z

From @tonycoz

On Mon, 30 Jan 2017 18:24:38 -0800, khw wrote:

It so happens that the intermediate form is unchanged from the input
for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's
amazing
to me that it has gone undiscovered for so long, requiring a fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing. It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would have
to
be pulled out into a separate function, called as well from whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall operation,
suggestions are welcome. Or volunteers.

We'd need to modify S_tokeq() to handle ranges in single-quotish tr'''.

Possibly it could check for broken unicode to avoid #130675.

I don't think tr''' should support escapes beyond \'

Tony

p5pRT · 2017-01-31T04:04:28Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2017-02-07T01:05:19Z

From @khwilliamson

On Mon, 30 Jan 2017 20:04:28 -0800, tonyc wrote:

On Mon, 30 Jan 2017 18:24:38 -0800, khw wrote:

It so happens that the intermediate form is unchanged from the input
for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's
amazing
to me that it has gone undiscovered for so long, requiring a fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing. It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would have
to
be pulled out into a separate function, called as well from whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall operation,
suggestions are welcome. Or volunteers.

We'd need to modify S_tokeq() to handle ranges in single-quotish tr'''.

Possibly it could check for broken unicode to avoid #130675.

Agreed

I don't think tr''' should support escapes beyond \'

Agreed

Tony

I don't know why the handling of tr/// is split between op.c and toke.c. I'm now thinking that most of it should be moved to op.c, with the toke.c part merely expanding the single- or double-quotish things. Then there would be no need to have an intermediate form, and I believe the code would be cleaner.
--
Karl Williamson

p5pRT · 2017-02-08T17:26:58Z

From @khwilliamson

On Mon, 06 Feb 2017 17:05:19 -0800, khw wrote:

On Mon, 30 Jan 2017 20:04:28 -0800, tonyc wrote:

On Mon, 30 Jan 2017 18:24:38 -0800, khw wrote:

It so happens that the intermediate form is unchanged from the
input
for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's
amazing
to me that it has gone undiscovered for so long, requiring a
fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing.
It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would
have
to
be pulled out into a separate function, called as well from
whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall
operation,
suggestions are welcome. Or volunteers.

We'd need to modify S_tokeq() to handle ranges in single-quotish
tr'''.

Possibly it could check for broken unicode to avoid #130675.

Agreed

I don't think tr''' should support escapes beyond \'

Agreed

Tony

I don't know why the handling of tr/// is split between op.c and
toke.c. I'm now thinking that most of it should be moved to op.c,
with the toke.c part merely expanding the single- or double-quotish
things. Then there would be no need to have an intermediate form,
and I believe the code would be cleaner.

I have figured out why it is split between op.c and tr.c. If op.c didn't get it before it was fully parsed, it wouldn't be able to distinguish between things like an interior hyphen was input as a literal, and hence should be treated as a metacharacter; or if it was the expansion of something like \x2D, in which case it isn't a metacharacter. We have the same problem, and have (mostly) solved it, for regex patterns that are compiled separately, but it's work, and has presented a bunch of bugs over the years
--
Karl Williamson

p5pRT · 2018-07-17T14:11:59Z

From @khwilliamson

This is actually working as designed. I added doc clarifications in
8efc02f.

See also http://nntp.perl.org/group/perl.perl5.porters/251544

--
Karl Williamson

p5pRT · 2018-07-17T14:11:59Z

@khwilliamson - Status changed from 'open' to 'rejected'

p5pRT closed this as completed Jul 17, 2018

p5pRT added the Severity Low label Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tr''' does not work properly #15853

tr''' does not work properly #15853

p5pRT commented Jan 31, 2017

p5pRT commented Jan 31, 2017

p5pRT commented Jan 31, 2017

p5pRT commented Jan 31, 2017

p5pRT commented Feb 7, 2017

p5pRT commented Feb 8, 2017

p5pRT commented Jul 17, 2018

p5pRT commented Jul 17, 2018

tr''' does not work properly #15853

tr''' does not work properly #15853

Comments

p5pRT commented Jan 31, 2017

p5pRT commented Jan 31, 2017

From @khwilliamson

p5pRT commented Jan 31, 2017

From @tonycoz

p5pRT commented Jan 31, 2017

p5pRT commented Feb 7, 2017

From @khwilliamson

p5pRT commented Feb 8, 2017

From @khwilliamson

p5pRT commented Jul 17, 2018

From @khwilliamson

p5pRT commented Jul 17, 2018