Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected tainting via regex using locale #13452

Closed
p5pRT opened this issue Dec 3, 2013 · 31 comments
Closed

Unexpected tainting via regex using locale #13452

p5pRT opened this issue Dec 3, 2013 · 31 comments
Labels

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Dec 3, 2013

Migrated from rt.perl.org#120675 (status was 'resolved')

Searchable as RT120675$

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 3, 2013

From Martin.vGagern@gmx.net

Created by Martin.vGagern@gmx.net

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S should
be affected. So either the above regular expression should not taint the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.18.1:

Configured by Gentoo at Fri Sep  6 12:56:43 CEST 2013.

Summary of my perl5 (revision 5 version 18 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=3.10.10-gentoo, archname=x86_64-linux-thread-multi
    uname='linux server 3.10.10-gentoo #1 smp preempt sat aug 31 14:25:33 cest 2013 x86_64 amd phenom(tm) ii x4 945 processor authenticamd gnulinux '
    config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-march=amdfam10 -O2 -ggdb -pipe -Dldflags=-Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib64/perl5/5.18.1 -Darchlib=/usr/lib64/perl5/5.18.1/x86_64-linux-thread-multi -Dsitelib=/usr/local/lib64/perl5/5.18.1 -Dsitearch=/usr/local/lib64/perl5/5.18.1/x86_64-linux-thread-multi -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.1 -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.1/x86_64-linux-thread-multi -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.1 -Dlocincpth=/usr/include  -Dglibpth=/lib64 /usr/lib64  -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@localhost -Di
 nstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=-g -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0  -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dnoextensions=ODBM_File'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-march=amdfam10 -O2 -ggdb -pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector'
    ccversion='', gccversion='4.7.3', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,--as-needed -fstack-protector'
    libpth=/usr/local/lib64 /lib64 /usr/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.17.so, so=so, useshrplib=true, libperl=libperl.so.5.18.1
    gnulibc_version='2.17'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/5.18.1/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -march=amdfam10 -O2 -ggdb -pipe -fstack-protector'

Locally applied patches:
    


@INC for perl 5.18.1:
    /home/mvg/perl/lib/perl5/site_perl
    /home/mvg/perl/lib/perl5
    /home/mvg/perl/lib/perl5/site_perl
    /home/mvg/perl/lib/perl5
    /usr/local/lib64/perl5/5.18.1/x86_64-linux-thread-multi
    /usr/local/lib64/perl5/5.18.1
    /usr/lib64/perl5/vendor_perl/5.18.1/x86_64-linux-thread-multi
    /usr/lib64/perl5/vendor_perl/5.18.1
    /usr/lib64/perl5/5.18.1/x86_64-linux-thread-multi
    /usr/lib64/perl5/5.18.1
    /usr/local/lib64/perl5
    /usr/lib64/perl5/vendor_perl
    .


Environment for perl 5.18.1:
    HOME=/home/mvg
    LANG=de_DE.utf8
    LANGUAGE=
    LC_CTYPE=de_DE.utf8
    LC_MESSAGES=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/mvg/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.3:/usr/x86_64-pc-linux-gnu/mingw32/gcc-bin/4.7.3:/opt/stuffit/bin:/usr/x86_64-pc-linux-gnu/gnat-gcc-bin/4.5:/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5:/opt/android-sdk-update-manager/tools:/opt/android-sdk-update-manager/platform-tools:/opt/nvidia-cg-toolkit/bin:/usr/games/bin:/opt/vmware/bin
    PERL5LIB=/home/mvg/perl/lib/perl5/site_perl:/home/mvg/perl/lib/perl5:/home/mvg/perl/lib/perl5/site_perl:/home/mvg/perl/lib/perl5
    PERLDOC_PAGER=less -R
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 4, 2013

From @jkeenan

On Tue Dec 03 11​:54​:14 2013, Martin.vGagern@​gmx.net wrote​:

This is a bug report for perl from <Martin.vGagern@​gmx.net>,
generated with the help of perlbug 1.39 running under perl 5.18.1.

-----------------------------------------------------------------
[Please describe your issue here]

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input is
not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S
should
be affected. So either the above regular expression should not taint
the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

I have confirmed that this is present in blead (5ea8618).

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 4, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 4, 2013

From @iabyn

On Tue, Dec 03, 2013 at 11​:54​:15AM -0800, Martin.vGagern@​gmx.net wrote​:

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S should
be affected. So either the above regular expression should not taint the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

It's the docs that were unclear, which I've amened with commit

  9fc477b

Locale info taints the whole regex, not just individual captures, since it
may affect other captures. As a trivial example, in

  /(\w+)(.)/

$2 needs to be tainted, since the tainted \w+ may eat fewer or more
characters than you expect, altering what $2 captures.

PS - your example appears to succeed in 5.16.0 and earlier, but this was
due to a bug in Scalar​::Util​::tainted(). Using quotes​: tainted("$1")
makes it fail "properly".

--
"Foul and greedy Dwarf - you have eaten the last candle."
  -- "Hordes of the Things", BBC Radio.

@p5pRT p5pRT closed this Dec 4, 2013
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 4, 2013

@iabyn - Status changed from 'open' to 'resolved'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 4, 2013

From @khwilliamson

On 12/04/2013 10​:14 AM, Dave Mitchell wrote​:

On Tue, Dec 03, 2013 at 11​:54​:15AM -0800, Martin.vGagern@​gmx.net wrote​:

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S should
be affected. So either the above regular expression should not taint the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

It's the docs that were unclear, which I've amened with commit

 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c

Locale info taints the whole regex, not just individual captures, since it
may affect other captures. As a trivial example, in

 /\(\\w\+\)\(\.\)/

$2 needs to be tainted, since the tainted \w+ may eat fewer or more
characters than you expect, altering what $2 captures.

PS - your example appears to succeed in 5.16.0 and earlier, but this was
due to a bug in Scalar​::Util​::tainted(). Using quotes​: tainted("$1")
makes it fail "properly".

There is a bug in regexec.c that I'm fixing that fixes this sample
program. Details to follow

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 5, 2013

From @khwilliamson

On 12/04/2013 10​:23 AM, Karl Williamson wrote​:

On 12/04/2013 10​:14 AM, Dave Mitchell wrote​:

On Tue, Dec 03, 2013 at 11​:54​:15AM -0800, Martin.vGagern@​gmx.net wrote​:

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input
is not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S
should
be affected. So either the above regular expression should not taint
the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

It's the docs that were unclear, which I've amened with commit

 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c

Locale info taints the whole regex, not just individual captures,
since it
may affect other captures. As a trivial example, in

 /\(\\w\+\)\(\.\)/

$2 needs to be tainted, since the tainted \w+ may eat fewer or more
characters than you expect, altering what $2 captures.

PS - your example appears to succeed in 5.16.0 and earlier, but this was
due to a bug in Scalar​::Util​::tainted(). Using quotes​: tainted("$1")
makes it fail "properly".

There is a bug in regexec.c that I'm fixing that fixes this sample
program. Details to follow

This is now fixed by commit b99851e

What was happening before is that in any bracketed character class, if a
target character did not match exactly one of those characters
explicitly in the class, the result was tainted. This tainting should
not have been turned on unless there was an attempt to match
case-insensitively or against a POSIX class, both of which are affected
by locale. But if the character class doesn't include those
possibilities, as in the sample program from the ticket​:

  "foo.bar_baz" =~ /^(.*)[._](.*?)$/

then tainting shouldn't be turned on. The point is that this matches
regardless of whatever locale is in effect. Thus a defective or
malicious locale plays no part in determining if this matches or not,
and so the result should not be tainted. (And it now isn't in blead.)

Having now looked at the taint code in regexec.c, I see some other bugs
and inconsistencies with the documentation.

First, the documentation says that tainting is only done for things that
use \w, \s, and their complements. In fact, tainting is done for any
POSIX class. That seems correct to me, and I think the documentation
should change. Why taint on \w and not on [​:word​:], for example; they
mean the exact same thing. And why not taint on \d or [​:digit​:]? It's
true that most locales don't define them other than [0-9], but it's
legal to do so, some locales do define them otherwise, and Perl relies
on those definitions. So I think tainting should happen for all POSIX
classes; this is what is currently implemented. Any disagreement?

Second, this is tainted​:

  "abc" =~ /(abc)/i

simply because of the /i. I don't think it should be. Similarly
backreferences like \1 will always be tainted under /i even if they
match regardless of the effective locale. Do people think these are
worth fixing?

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 5, 2013

From Martin.vGagern@gmx.net

On 05.12.2013 04​:12, karl williamson via RT wrote​:

This tainting should
not have been turned on unless there was an attempt to match
case-insensitively or against a POSIX class, both of which are affected
by locale.

I guess character ranges depend on collation and therefore on locale.

First, the documentation says that tainting is only done for things that
use \w, \s, and their complements. In fact, tainting is done for any
POSIX class. That seems correct to me, and I think the documentation
should change.

I agree. The doc should reliably describe actual behaviour. I even
wonder whether you should add a note that up to the current version, any
bracketed character list would taint, to help portability across
versions. Not sure whether I've ever seen anything like this i perl
docs, but I know some projects do this kind of thing.

Second, this is tainted​:

"abc" =~ /(abc)/i

simply because of the /i. I don't think it should be. Similarly
backreferences like \1 will always be tainted under /i even if they
match regardless of the effective locale. Do people think these are
worth fixing?

I'd say keep it as is. If I were to write such a /(abc)/i line, I'd
probably think that users may well enter "ABC", but during the
development I'd only test with "abc". So I wouldn't notice the taint
during development, and would be surprised if users reported problems.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 7, 2013

From @rjbs

* Martin von Gagern <Martin.vGagern@​gmx.net> [2013-12-05T01​:19​:19]

Second, this is tainted​:

"abc" =~ /(abc)/i

simply because of the /i. I don't think it should be. Similarly
backreferences like \1 will always be tainted under /i even if they
match regardless of the effective locale. Do people think these are
worth fixing?

I'd say keep it as is. If I were to write such a /(abc)/i line, I'd
probably think that users may well enter "ABC", but during the
development I'd only test with "abc". So I wouldn't notice the taint
during development, and would be surprised if users reported problems.

I'm not sure I understand this line of argument.

--
rjbs

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 7, 2013

From @dmcbride

On Friday December 6 2013 9​:06​:15 PM Ricardo Signes wrote​:

* Martin von Gagern <Martin.vGagern@​gmx.net> [2013-12-05T01​:19​:19]

Second, this is tainted​:
"abc" =~ /(abc)/i

simply because of the /i. I don't think it should be. Similarly
backreferences like \1 will always be tainted under /i even if they
match regardless of the effective locale. Do people think these are
worth fixing?

I'd say keep it as is. If I were to write such a /(abc)/i line, I'd
probably think that users may well enter "ABC", but during the
development I'd only test with "abc". So I wouldn't notice the taint
during development, and would be surprised if users reported problems.

I'm not sure I understand this line of argument.

If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does leave
$1 tainted, this may be a surprise. Since most of us will test with lower
case abc, but fewer of us will remember to test some/all upper case, this
surprise can result in tainting at runtime that we're not prepared for.

If, however, the /i makes $1 always tainted under locale, even if the input is
a raw match without requiring i, then all tests will get the tainting and the
developer will be forced to deal with tainting all the time, including their
own tests. This makes it harder to miss the tainting of /i if it always
taints under locale rather than just sometimes.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 7, 2013

From @ap

* Darin McBride <dmcbride@​cpan.org> [2013-12-07 03​:45]​:

* Ricardo Signes <perl.p5p@​rjbs.manxome.org> [2013-12-07 03​:10]​:

* Martin von Gagern <Martin.vGagern@​gmx.net> [2013-12-05 19​:00]​:

* Karl Williamson <public@​khwilliamson.com> [2013-12-05 04​:15]​:

Second, this is tainted​:
"abc" =~ /(abc)/i

simply because of the /i. I don't think it should be. Similarly
backreferences like \1 will always be tainted under /i even if
they match regardless of the effective locale. Do people think
these are worth fixing?

I'd say keep it as is. If I were to write such a /(abc)/i line,
I'd probably think that users may well enter "ABC", but during the
development I'd only test with "abc". So I wouldn't notice the
taint during development, and would be surprised if users reported
problems.

I'm not sure I understand this line of argument.

If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does
leave $1 tainted, this may be a surprise.

Err. Except the only way I can read Karl’s proposal is that he thinks
that using /i itself should never taint captures, only locale-dependent
classes should do that. Nor did I see him mention input string casing at
all, let alone propose that tainting should depend on it. I cannot even
imagine how it would seem plausible to anyone that he proposed that… but
maybe I am therefore misreading him? Because if not,

Since most of us will test with lower case abc, but fewer of us will
remember to test some/all upper case, this surprise can result in
tainting at runtime that we're not prepared for.

If, however, the /i makes $1 always tainted under locale, even if the
input is a raw match without requiring i, then all tests will get the
tainting and the developer will be forced to deal with tainting all
the time, including their own tests. This makes it harder to miss the
tainting of /i if it always taints under locale rather than just
sometimes.

… this argument is all quite moot, no?

--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 7, 2013

From @khwilliamson

On 12/07/2013 01​:07 AM, Aristotle Pagaltzis wrote​:

* Darin McBride <dmcbride@​cpan.org> [2013-12-07 03​:45]​:

* Ricardo Signes <perl.p5p@​rjbs.manxome.org> [2013-12-07 03​:10]​:

* Martin von Gagern <Martin.vGagern@​gmx.net> [2013-12-05 19​:00]​:

* Karl Williamson <public@​khwilliamson.com> [2013-12-05 04​:15]​:

Second, this is tainted​:
"abc" =~ /(abc)/i

simply because of the /i. I don't think it should be. Similarly
backreferences like \1 will always be tainted under /i even if
they match regardless of the effective locale. Do people think
these are worth fixing?

I'd say keep it as is. If I were to write such a /(abc)/i line,
I'd probably think that users may well enter "ABC", but during the
development I'd only test with "abc". So I wouldn't notice the
taint during development, and would be surprised if users reported
problems.

I'm not sure I understand this line of argument.

If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does
leave $1 tainted, this may be a surprise.

Err. Except the only way I can read Karl’s proposal is that he thinks
that using /i itself should never taint captures, only locale-dependent
classes should do that. Nor did I see him mention input string casing at
all, let alone propose that tainting should depend on it. I cannot even
imagine how it would seem plausible to anyone that he proposed that…

And I wasn't proposing that. What I was suggesting is that​: only if
folding were actually needed to do the match should there be tainting.
How it would work would be to try an exact match, and only if that
failed would a fold match be tried and tainting turned on (regardless of
the outcome of that fold match). I hope this is clearer.
  but

maybe I am therefore misreading him? Because if not,

Since most of us will test with lower case abc, but fewer of us will
remember to test some/all upper case, this surprise can result in
tainting at runtime that we're not prepared for.

If, however, the /i makes $1 always tainted under locale, even if the
input is a raw match without requiring i, then all tests will get the
tainting and the developer will be forced to deal with tainting all
the time, including their own tests. This makes it harder to miss the
tainting of /i if it always taints under locale rather than just
sometimes.

… this argument is all quite moot, no?

So the argument is not moot, and I think has much to recommend it. To
put it more succinctly (in my understanding)​: If there exists a target
string for which a given pattern match requires locale information in
order to determine if that string matches or not, then the results are
tainted for every string regardless of whether or not this particular
string actually invoked the portions of the pattern where locales make a
difference.

But that's not how it works now. Taken to its logical conclusion,
tainting pattern matches becomes much simpler. It's turned on at the
beginning of matching even if we don't actually attempt the match
because the optimizer says the match cannot succeed because of, say, we
know that no string shorter than 15 bytes can match and the current
string is 14.

It has worked for some time (perhaps forever; I haven't done the
digging) that if a string matches a character in a bracketed character
class, tainting is not turned on. My proposal was to have the same
behavior for matches outside character classes, so that things were
consistent. However it is simpler and faster to just say that if the /l
pattern contains any POSIX classes (including things like \w) or /i
matching, that tainting is turned on for any attempted match of it.

Because the behavior is currently inconsistent, something should be
changed to make it consistent. I think Martin's approach is the one
that leads to the fewest surprises; but will change the behavior that
existing programs might have come to rely on.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 7, 2013

From @ap

* Karl Williamson <public@​khwilliamson.com> [2013-12-07 17​:30]​:

What I was suggesting is that​: only if folding were actually needed to
do the match should there be tainting. How it would work would be to
try an exact match, and only if that failed would a fold match be
tried and tainting turned on (regardless of the outcome of that fold
match). I hope this is clearer.

Ah. Then I am surprised you would propose that, and I have to agree with
the others.

So the argument is not moot, and I think has much to recommend it. To
put it more succinctly (in my understanding)​: If there exists a target
string for which a given pattern match requires locale information in
order to determine if that string matches or not, then the results are
tainted for every string regardless of whether or not this particular
string actually invoked the portions of the pattern where locales make
a difference.

Yes. Tainting of captures would be a property of the pattern determined
at (pattern) compile time, and would then apply to each and every match
the regexp is used in.

But that's not how it works now. Taken to its logical conclusion,
tainting pattern matches becomes much simpler. It's turned on at the
beginning of matching even if we don't actually attempt the match
because the optimizer says the match cannot succeed because of, say,
we know that no string shorter than 15 bytes can match and the current
string is 14.

Indeed. It’s a property of the regexp – nothing about the input matters.

It has worked for some time (perhaps forever; I haven't done the
digging) that if a string matches a character in a bracketed character
class, tainting is not turned on. My proposal was to have the same
behavior for matches outside character classes, so that things were
consistent. However it is simpler and faster to just say that if the
/l pattern contains any POSIX classes (including things like \w) or /i
matching, that tainting is turned on for any attempted match of it.

(I was thinking /i didn’t necessarily have to cause tainting because for
some reason I was thinking there are characters whose casing rules does
not change between locales, e.g. “a”. Then I realised, this may be true
in practice (among sane locales) but it is not *true by definition*, so
patterns must always taint under /li.)

Because the behavior is currently inconsistent, something should be
changed to make it consistent. I think Martin's approach is the one
that leads to the fewest surprises; but will change the behavior that
existing programs might have come to rely on.

If a program depends on the current behaviour, can it be correct?

If not, how much of the time is it likely to exhibit its brokenness?
(Or conversely​: how much of the time is it likely to accidentally do
the right thing regardless?)

How widespread is the breakage likely to be?

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 11, 2013

From @khwilliamson

On 12/07/2013 10​:17 AM, Aristotle Pagaltzis wrote​:

(I was thinking /i didn’t necessarily have to cause tainting because for
some reason I was thinking there are characters whose casing rules does
not change between locales, e.g. “a”. Then I realised, this may be true
in practice (among sane locales) but it is not*true by definition*, so
patterns must always taint under /li.)

http​://en.wikipedia.org/wiki/ISO/IEC_646

lists several locales which exist where casing rules do change Here is
their text​:

"There are also some 7-bit character sets that are not officially part
of the ISO 646 standard. Examples include​:

  7-bit Greek, ELOT 927. The Greek alphabet is mapped to positions
0x61–0x71 and 0x73–0x79, on top of the Latin lowercase letters.
  7-bit Cyrillic, KOI-7 or Short KOI. The Cyrillic characters are
mapped to positions 0x60–0x7E, on top of the Latin lowercase letters.
Superseded by the KOI-8 variants.
  7-bit Hebrew, SI 960. The Hebrew alphabet is mapped to positions
0x60–0x7A, on top of the lowercase Latin letters (and grave accent for
aleph). 7-bit Hebrew was always stored in visual order. This mapping
with the high bit set, i.e. with the Hebrew letters in 0xE0–0xFA, is ISO
8859-8.
  7-bit Arabic, ASMO 449. The Arabic alphabet is mapped to positions
0x41–0x5A and 0x60–0x6A, on top of both uppercase and lowercase Latin
letters. This mapping with the high bit set is ISO 8859-6."

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 15, 2014

From @khwilliamson

On 12/07/2013 10​:17 AM, Aristotle Pagaltzis wrote​:

Yes. Tainting of captures would be a property of the pattern determined
at (pattern) compile time, and would then apply to each and every match
the regexp is used in.

But that's not how it works now. Taken to its logical conclusion,
tainting pattern matches becomes much simpler. It's turned on at the
beginning of matching even if we don't actually attempt the match
because the optimizer says the match cannot succeed because of, say,
we know that no string shorter than 15 bytes can match and the current
string is 14.
Indeed. It’s a property of the regexp – nothing about the input matters.

I am now uncertain about how tainting should behave. What Aristotle
(and others) say makes sense to me, but it isn't the way I've coded a
bunch of stuff, which was based on how it looked to me how things
behaved. I have now reexamined some of the source code for 5.8.9, and
tainting in regexes in the places I looked at is turned on based on the
target string contents, rather than being a property of the regex.
Also, a regex can have portions where locale matters, and portions where
it doesn't.

So I'd like to get more of a consensus before changing how things work.

Here's my current quandary. It used to be that if you had a string
encoded in some locale, say Latin7 -- Greek, and somehow, say through
calling a function in some module, the UTF-8 flag gets turned on.
Suddenly everything changes, your SMALL CHI becomes a DIVISION SIGN, for
example, without notice, even though the underlying code point hasn't
budged.

A UTF-8 encoded string was not considered locale, and upper or
lowercasing such a string never tainted, even if the string was
originally derived from locale rules, hence taintable.

I mostly solved this problem by partitioning the code point space into
two ranges​: 0-255 and 256-infinity. Hence now, under locale rules, the
code points 0-255 are considered to be what the current locale says they
should be, regardless of the string's UTF8ness. The code points 256 and
up represent what Unicode says they should represent. (The POSIX locale
system does not have names for code points, so Perl doesn't know that
your SMALL CHI is that character, but it does know from the libc
functions that it is a lowercase letter, and what the code point of its
uppercase version is, etc.) This solution isn't perfect, but if your
application is well-behaved and isn't mixing Unicode and non-Unicode
characters, it does completely solve the issue of your strings utf8 flag
unwittingly getting set.

The reason I bring this up is the way I made tainting work on these
strings. When operating on a string, the result is tainted if and only
if it contains a code point in the 0-255 range. And that is
content-related. Previously no UTF-8 encoded string was checked, and
that is also effectively content related.

So now, I'm working on extending Perl to work with UTF-8 locales, and as
long as it thinks the locale is UTF-8, it won't actually look at the
locale's rules, which could therefore be malicious or just plain screwed
up. Instead, it uses the known Unicode rules built into Perl, so one
could argue that it shouldn't taint. On the other hand, it should taint
in any other locale type, so tainting or not becomes dependent on the
underlying locale. (This is really no different than the situation I
inherited where tainting or not was dependent on the utf8 flag). Or one
could say that tainting should happen for every locale, for consistency
and for developers to have tests that are not locale-dependent.

Opinions?

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 16, 2014

From zefram@fysh.org

Karl Williamson wrote​:

                                     Hence now\, under locale

rules, the code points 0-255 are considered to be what the current
locale says they should be, regardless of the string's UTF8ness. The
code points 256 and up represent what Unicode says they should
represent.

This sounds broken​: in your example Latin-7 locale you now have two
representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
and no representations of U+f7 division sign. I think we should have a
big notice in the documentation to the effect that non-UTF-8 locales and
Unicode don't mix. Possibly more sensible behaviour for this situation
would be that 0x0 to 0xff get locale behaviour and 0x100 upwards get
null behaviour (don't match any properties, case convert to self, as if
unassigned by Unicode).

                                              On the other hand\,

it should taint in any other locale type, so tainting or not becomes
dependent on the underlying locale.

What determined the choice of locale? An environment variable? Taint!

Fleshing out a bit​: logically the choice of locale may be tainted.
Normally the choice will come from an environment variable, which
is itself tainted, so the locale selection is tainted. Then any
decision that depends on the choice of locale, including whether to
use Unicode character semantics based on whether it's a UTF-8 locale,
is itself tainted. So in this system, with a tainted locale every
string operation that depends on character properties introduces taint,
even if the properties themselves are clean. But in theory a program
could set the locale environment variables itself from a clean source,
get an untainted choice of locale, and then use character properties
without tainting otherwise-clean data.

This would lead to a lot of tainting. You might think it looks like too
much tainting. But any reduction, such as treating Unicode character
properties used due to a UTF-8 locale as untainted, amounts to adding
an explicit exception to the taint rules. Perl has a small number of
well-defined exceptions to tainting, and increasing the set sounds like a
recipe for security problems. Programmers can always explicitly untaint
the locale choice if they want that behaviour, so the downside of being
surprised by broader tainting is limited.

            Or one could say that tainting should happen for

every locale, for consistency and for developers to have tests that
are not locale-dependent.

This is effectively the behaviour that falls out of my analysis.
Locale-dependent character properties are tainted (if the choice of
locale is tainted) regardless of which properties those actually are.

-zefram

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 16, 2014

From perl5-porters@perl.org

Zefram wrote​:

This sounds broken​: in your example Latin-7 locale you now have two
representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
and no representations of U+f7 division sign.

But that is the only sane model that sufficiently preserves backward
compatibility.

I think we should have a
big notice in the documentation to the effect that non-UTF-8 locales and
Unicode don't mix.

Agreed. But we cannot expect people to prevent strings from being
upgraded, since perl does that transparently. (E.g., put your locale
and Unicode string in the same data store, and then extract them.
Flags could flip either way depending on how the data are stored.)

Possibly more sensible behaviour for this situation
would be that 0x0 to 0xff get locale behaviour and 0x100 upwards get
null behaviour (don't match any properties, case convert to self, as if
unassigned by Unicode).

That would be something completely new, which would likely break
existing programs.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 16, 2014

From @tsee

On 01/16/2014 03​:27 PM, Father Chrysostomos wrote​:

Zefram wrote​:

This sounds broken​: in your example Latin-7 locale you now have two
representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
and no representations of U+f7 division sign.

But that is the only sane model that sufficiently preserves backward
compatibility.

Agreed as far as I can tell. There's a sane model that doesn't preserve
backward compatibility, though. Lose taint support.

--Steffen

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 16, 2014

From @Abigail

On Wed, Dec 04, 2013 at 05​:14​:18PM +0000, Dave Mitchell wrote​:

On Tue, Dec 03, 2013 at 11​:54​:15AM -0800, Martin.vGagern@​gmx.net wrote​:

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S should
be affected. So either the above regular expression should not taint the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

It's the docs that were unclear, which I've amened with commit

9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c

Locale info taints the whole regex, not just individual captures, since it
may affect other captures. As a trivial example, in

/\(\\w\+\)\(\.\)/

$2 needs to be tainted, since the tainted \w+ may eat fewer or more
characters than you expect, altering what $2 captures.

In fact, whether the expression matches or fails may depend on the locale.

And hence, there's something to be said that despite it not having
capturing parenthesis, /\w/ should taint $1 and friends. After all, an
unexpected locale may cause the pattern to fail, leaving $1 to whatever
it was, instead of setting it to undefined.

Abigail

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 22, 2014

From @khwilliamson

On 01/16/2014 11​:27 AM, Abigail wrote​:

On Wed, Dec 04, 2013 at 05​:14​:18PM +0000, Dave Mitchell wrote​:

On Tue, Dec 03, 2013 at 11​:54​:15AM -0800, Martin.vGagern@​gmx.net wrote​:

When "use locale" is in effect, then the regular expression match
m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not
tainted. The perllocale manual mentions locale-dependent regular
expressions which taint their capture groups, but the way I read that
documentation, only regular expressions using one of \w, \W, \s or \S should
be affected. So either the above regular expression should not taint the
group, or the documentation is incomplete and should be more precise.

Here is a small self-contained complete example​:

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar​::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

It's the docs that were unclear, which I've amened with commit

 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c

Locale info taints the whole regex, not just individual captures, since it
may affect other captures. As a trivial example, in

 /\(\\w\+\)\(\.\)/

$2 needs to be tainted, since the tainted \w+ may eat fewer or more
characters than you expect, altering what $2 captures.

In fact, whether the expression matches or fails may depend on the locale.

And hence, there's something to be said that despite it not having
capturing parenthesis, /\w/ should taint $1 and friends. After all, an
unexpected locale may cause the pattern to fail, leaving $1 to whatever
it was, instead of setting it to undefined.

Abigail

In looking at this in more detail, I see more issues. I think the best
way to do things, is to follow something like​:

"If there exists input to an operation which would have locale-dependent
results, that operation taints regardless of the current input"

But as I said before, I haven't coded things that way, because the way
it worked often was based on the particular input. In looking at what
it would take to get the casing operations to change to this rule, I ran
across another example. We don't change if we try to change the case of
an empty string. But the rule I just stated would indicate we should.

Also, we don't taint TRUE/FALSE results. I don't know the logic behind
that decision, except probably implementation issues, both in the core
and the perl program.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 22, 2014

From perl5-porters@perl.org

Karl Williamson wrote​:

Also, we don't taint TRUE/FALSE results. I don't know the logic behind
that decision, except probably implementation issues, both in the core
and the perl program.

AFAIK, Perl has never had tainted booleans. (I have some code that
assumes that.)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 25, 2014

From @rjbs

* Karl Williamson <public@​khwilliamson.com> [2014-01-22T15​:05​:57]

In looking at this in more detail, I see more issues. I think the
best way to do things, is to follow something like​:

"If there exists input to an operation which would have
locale-dependent results, that operation taints regardless of the
current input"

I think I agree with you, but want to clarify.

Also, I should state that I'm a not a heavy user of locales *or* taint mode, so
I'm open to having my underlying assumptions corrected!

This means, in part, that this only happens in the scope of the locale pragma.
That is, because the operation can't rely on locale without that pragma in
effect, it can't taint. (I *think* that the /l modifier would also make this
condition true, even outside `use locale`, right?)

In other words, if you use locale, you're likely to start seeing a lot more
taint. Then again, taint mode looks for conditions resulting from user input,
not any ol' thing. If I enable locales and set the local by hand with a
constant string, then the locale should not introduce taint, should it?

Perhaps the best paradigm for this is that the locale itself can be tainted​: if
our locale comes from ENV, it is tainted. If it's set, with setlocale, *with a
tainted parameter*, it is tainted. Then, when the locale setting can affect
the expression, the result is tainted.

This is my intuition on the whole thing.

--
rjbs

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 25, 2014

From @khwilliamson

On 01/25/2014 12​:46 PM, Ricardo Signes wrote​:

* Karl Williamson <public@​khwilliamson.com> [2014-01-22T15​:05​:57]

In looking at this in more detail, I see more issues. I think the
best way to do things, is to follow something like​:

"If there exists input to an operation which would have
locale-dependent results, that operation taints regardless of the
current input"

I think I agree with you, but want to clarify.

Also, I should state that I'm a not a heavy user of locales *or* taint mode, so
I'm open to having my underlying assumptions corrected!

This means, in part, that this only happens in the scope of the locale pragma.
That is, because the operation can't rely on locale without that pragma in
effect, it can't taint. (I *think* that the /l modifier would also make this
condition true, even outside `use locale`, right?)

Yes.

In other words, if you use locale, you're likely to start seeing a lot more
taint. Then again, taint mode looks for conditions resulting from user input,
not any ol' thing. If I enable locales and set the local by hand with a
constant string, then the locale should not introduce taint, should it?

Perhaps the best paradigm for this is that the locale itself can be tainted​: if
our locale comes from ENV, it is tainted. If it's set, with setlocale, *with a
tainted parameter*, it is tainted. Then, when the locale setting can affect
the expression, the result is tainted.

This is my intuition on the whole thing.

rjbs, and I clarified things on irc. This quote added in 1998 from
perllocale gives the rationale

"Locales--particularly on systems that allow unprivileged users to build
their own locales--are untrustworthy. A malicious (or just plain broken)
locale can make a locale-aware application give unexpected results."

The bottom line is we are moving to the policy that tainting is based on
the operation and being in locale, without regard to the particular
operand's contents passed this time to the operation. This means
simpler core code and more consistent tainting results. And it lessens
the likelihood that there are paths in the core that should taint but don't

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 25, 2014

From @ap

* Father Chrysostomos <sprout@​cpan.org> [2014-01-16 15​:30]​:

Zefram wrote​:

This sounds broken​: in your example Latin-7 locale you now have two
representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
and no representations of U+f7 division sign.

But that is the only sane model that sufficiently preserves backward
compatibility.

So far, I agree.

I think we should have a big notice in the documentation to the
effect that non-UTF-8 locales and Unicode don't mix.

Agreed. But we cannot expect people to prevent strings from being
upgraded, since perl does that transparently. (E.g., put your locale
and Unicode string in the same data store, and then extract them.
Flags could flip either way depending on how the data are stored.)

No, we cannot expect people to prevent strings from being upgraded. But
*can* expect them not to perform locale-based processing on character
strings (rather than octet strings). Now we cannot know what strings are
octet strings, we do have at least one class of strings that cannot
possibly be octet strings​: those that contain elements > 0xFF. So when
a locale-based regexp sees one of those, something is unambiguously
broken.

Possibly more sensible behaviour for this situation would be that
0x0 to 0xff get locale behaviour and 0x100 upwards get null
behaviour (don't match any properties, case convert to self, as if
unassigned by Unicode).

That would be something completely new, which would likely break
existing programs.

I agree this isn’t feasible. But maybe we can follow the precedent of
other octet-based functions​: have locale-based regexps throw a “wide
character” warning any time they encounter one.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 27, 2014

From @khwilliamson

On 01/25/2014 04​:06 PM, Aristotle Pagaltzis wrote​:

* Father Chrysostomos <sprout@​cpan.org> [2014-01-16 15​:30]​:

Zefram wrote​:

This sounds broken​: in your example Latin-7 locale you now have two
representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
and no representations of U+f7 division sign.

But that is the only sane model that sufficiently preserves backward
compatibility.

So far, I agree.

I think we should have a big notice in the documentation to the
effect that non-UTF-8 locales and Unicode don't mix.

Agreed. But we cannot expect people to prevent strings from being
upgraded, since perl does that transparently. (E.g., put your locale
and Unicode string in the same data store, and then extract them.
Flags could flip either way depending on how the data are stored.)

No, we cannot expect people to prevent strings from being upgraded. But
*can* expect them not to perform locale-based processing on character
strings (rather than octet strings). Now we cannot know what strings are
octet strings, we do have at least one class of strings that cannot
possibly be octet strings​: those that contain elements > 0xFF. So when
a locale-based regexp sees one of those, something is unambiguously
broken.

Possibly more sensible behaviour for this situation would be that
0x0 to 0xff get locale behaviour and 0x100 upwards get null
behaviour (don't match any properties, case convert to self, as if
unassigned by Unicode).

That would be something completely new, which would likely break
existing programs.

I agree this isn’t feasible. But maybe we can follow the precedent of
other octet-based functions​: have locale-based regexps throw a “wide
character” warning any time they encounter one.

That seems reasonable to me.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 27, 2014

From @demerphq

On 16 January 2014 22​:27, Father Chrysostomos <sprout@​cpan.org> wrote​:

Agreed. But we cannot expect people to prevent strings from being
upgraded, since perl does that transparently.

FWIW Figuring out a way to mark a string as not being upgradable, and
causing Perl to die if one tries is on my todo list right now.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 27, 2014

From zefram@fysh.org

demerphq wrote​:

FWIW Figuring out a way to mark a string as not being upgradable,

Sounds like a bad idea. What exactly do you mean by "a string" here?
Is this a status that would be preserved across copying such as scalar
assignment? If so, adding any new hidden flag is a bad idea, but one
that prevents using what is otherwise always a legitimate representation
of a string is especially bad.

-zefram

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 29, 2014

From @khwilliamson

On 01/27/2014 11​:30 AM, Karl Williamson wrote​:

On 01/25/2014 04​:06 PM, Aristotle Pagaltzis wrote​:

* Father Chrysostomos <sprout@​cpan.org> [2014-01-16 15​:30]​:

Zefram wrote​:

This sounds broken​: in your example Latin-7 locale you now have two
representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
and no representations of U+f7 division sign.

But that is the only sane model that sufficiently preserves backward
compatibility.

So far, I agree.

I think we should have a big notice in the documentation to the
effect that non-UTF-8 locales and Unicode don't mix.

Agreed. But we cannot expect people to prevent strings from being
upgraded, since perl does that transparently. (E.g., put your locale
and Unicode string in the same data store, and then extract them.
Flags could flip either way depending on how the data are stored.)

No, we cannot expect people to prevent strings from being upgraded. But
*can* expect them not to perform locale-based processing on character
strings (rather than octet strings). Now we cannot know what strings are
octet strings, we do have at least one class of strings that cannot
possibly be octet strings​: those that contain elements > 0xFF. So when
a locale-based regexp sees one of those, something is unambiguously
broken.

Possibly more sensible behaviour for this situation would be that
0x0 to 0xff get locale behaviour and 0x100 upwards get null
behaviour (don't match any properties, case convert to self, as if
unassigned by Unicode).

That would be something completely new, which would likely break
existing programs.

I agree this isn’t feasible. But maybe we can follow the precedent of
other octet-based functions​: have locale-based regexps throw a “wide
character” warning any time they encounter one.

That seems reasonable to me.

Now done in 613abc6

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 29, 2014

From @iabyn

On Mon, Dec 29, 2014 at 01​:53​:52PM -0700, Karl Williamson wrote​:

Now done in 613abc6

Karl, that commit's failing very badly on my system on debugging builds.
Several test are dying with assertion failures. I can reduce charset.t
to this code​:

  use POSIX 'locale_h';

  my $l = 'zh_HK.big5hkscs';
  setlocale(&POSIX​::LC_CTYPE(), $l) or die;
  use locale;
  no warnings 'locale'; # We may be trying out a weird locale

  setlocale(&POSIX​::LC_CTYPE(), $l) or die;
  my $c = chr utf8​::unicode_to_native(0xdf);
  CORE​::fc($c);

which dies like this​:

  perl​: util.c​:1900​: Perl_ck_warner​: Assertion `pat' failed.

Its dying while doing the fc().

pp_fc() does this​:

  _CHECK_AND_WARN_PROBLEMATIC_LOCALE;

which is calling Perl_ck_warner() with a null pat. This macro is defined as​:

# define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \
  STMT_START { \
  if (PL_warn_locale) { \
  /*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \
  Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \
  SvPVX(PL_warn_locale), \
  0 /* dummy to avoid comp warning */ ); \
  /* GCC_DIAG_RESTORE; */ \
  SvREFCNT_dec_NN(PL_warn_locale); \
  PL_warn_locale = NULL; \
  } \
  } STMT_END

and at this point, PL_warn_locale looks like​:

SV = PVNV(0xaa2ee8) at 0xac42e0
  REFCNT = 1
  FLAGS = (TEMP,IOK,NOK,pIOK,pNOK)
  IV = 223
  NV = 223
  PV = 0

so SvPVX(PL_warn_locale) is null.

I don't understand the locale system well enough for the fix to be obvious
to me.

--
Any [programming] language that doesn't occasionally surprise the
novice will pay for it by continually surprising the expert.
  -- Larry Wall

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 29, 2014

From @khwilliamson

On 12/29/2014 03​:15 PM, Dave Mitchell wrote​:

On Mon, Dec 29, 2014 at 01​:53​:52PM -0700, Karl Williamson wrote​:

Now done in 613abc6

Karl, that commit's failing very badly on my system on debugging builds.
Several test are dying with assertion failures. I can reduce charset.t
to this code​:

 use POSIX 'locale\_h';

 my $l = 'zh\_HK\.big5hkscs';
 setlocale\(&POSIX&#8203;::LC\_CTYPE\(\)\, $l\) or die;
 use locale;
 no warnings 'locale'; \# We may be trying out a weird locale

 setlocale\(&POSIX&#8203;::LC\_CTYPE\(\)\, $l\) or die;
 my $c = chr utf8&#8203;::unicode\_to\_native\(0xdf\);
 CORE&#8203;::fc\($c\);

which dies like this​:

 perl&#8203;: util\.c&#8203;:1900&#8203;: Perl\_ck\_warner&#8203;: Assertion \`pat' failed\.

Its dying while doing the fc().

pp_fc() does this​:

         \_CHECK\_AND\_WARN\_PROBLEMATIC\_LOCALE;

which is calling Perl_ck_warner() with a null pat. This macro is defined as​:

# define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \
STMT_START { \
if (PL_warn_locale) { \
/*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \
Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \
SvPVX(PL_warn_locale), \
0 /* dummy to avoid comp warning */ ); \
/* GCC_DIAG_RESTORE; */ \
SvREFCNT_dec_NN(PL_warn_locale); \
PL_warn_locale = NULL; \
} \
} STMT_END

and at this point, PL_warn_locale looks like​:

SV = PVNV(0xaa2ee8) at 0xac42e0
REFCNT = 1
FLAGS = (TEMP,IOK,NOK,pIOK,pNOK)
IV = 223
NV = 223
PV = 0

so SvPVX(PL_warn_locale) is null.

I don't understand the locale system well enough for the fix to be obvious
to me.

I'll look into it. I can reproduce it on dromedary. I note that
smoke-me did not find any problem.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Dec 30, 2014

From @khwilliamson

On 12/29/2014 03​:15 PM, Dave Mitchell wrote​:

On Mon, Dec 29, 2014 at 01​:53​:52PM -0700, Karl Williamson wrote​:

Now done in 613abc6

Karl, that commit's failing very badly on my system on debugging builds.
Several test are dying with assertion failures. I can reduce charset.t
to this code​:

 use POSIX 'locale\_h';

 my $l = 'zh\_HK\.big5hkscs';
 setlocale\(&POSIX&#8203;::LC\_CTYPE\(\)\, $l\) or die;
 use locale;
 no warnings 'locale'; \# We may be trying out a weird locale

 setlocale\(&POSIX&#8203;::LC\_CTYPE\(\)\, $l\) or die;
 my $c = chr utf8&#8203;::unicode\_to\_native\(0xdf\);
 CORE&#8203;::fc\($c\);

which dies like this​:

 perl&#8203;: util\.c&#8203;:1900&#8203;: Perl\_ck\_warner&#8203;: Assertion \`pat' failed\.

Its dying while doing the fc().

pp_fc() does this​:

         \_CHECK\_AND\_WARN\_PROBLEMATIC\_LOCALE;

which is calling Perl_ck_warner() with a null pat. This macro is defined as​:

# define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \
STMT_START { \
if (PL_warn_locale) { \
/*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \
Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \
SvPVX(PL_warn_locale), \
0 /* dummy to avoid comp warning */ ); \
/* GCC_DIAG_RESTORE; */ \
SvREFCNT_dec_NN(PL_warn_locale); \
PL_warn_locale = NULL; \
} \
} STMT_END

and at this point, PL_warn_locale looks like​:

SV = PVNV(0xaa2ee8) at 0xac42e0
REFCNT = 1
FLAGS = (TEMP,IOK,NOK,pIOK,pNOK)
IV = 223
NV = 223
PV = 0

so SvPVX(PL_warn_locale) is null.

I don't understand the locale system well enough for the fix to be obvious
to me.

This should now be fixed by 215c513.

@p5pRT p5pRT added the documentation label Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.