Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint mode still breaks utf8 handling #8114

Closed
p5pRT opened this issue Sep 15, 2005 · 6 comments
Closed

Taint mode still breaks utf8 handling #8114

p5pRT opened this issue Sep 15, 2005 · 6 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Sep 15, 2005

Migrated from rt.perl.org#37170 (status was 'resolved')

Searchable as RT37170$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 15, 2005

From christian@pflanze.mine.nu

Created by christian.jaeger@ethlife.ethz.ch

I'm in the process of "porting" a perl web app (fastcgi, running with
-T flag) from perl 5.005_03 to current releases.

I first had problems with 5.8.4​: when I read in a block of data using
read, about like this​:

use Encode;
open F,"some/file/containing_utf8_text" or die $!;
my $buf;
read F,$buf,10,1000 or die $!;
my $str= Encode​::decode_utf8($buf);

gave a $str which still had the utf8 byte sequences as characters

(and
print "utf8?​: ", Encode​::is_utf8($str) ? "yes" : "no", "\n";
gave "no" iirc)

(I'm actually using my own wrappers around open and read, so I didn't
test the exact code as above).

I did narrow those down to the usage of the -T flag. I found that one
of either of the following would make the decoding work correctly​:

- switching off tainting mode
- detainting $buf before decoding it, like​:
  $buf=~ /(.*)/s or die;
  my $str= Encode​::decode_utf8($1);
- upgrading to perl 5.8.7 (5.8.7-3 from Debian testing)

"Fine, it has been fixed" I thought.

But now I realized that something else still doesn't work under taint
mode. Sorry that I'm a bit vague below, I'm under pressure to finish
the project; please contact me if you need more information. For now
I'm simply turning of taint mode.

(What I'm doing is, I write a list of strings to one file, first
writing the lengths of each, so that I know how to split the file
contents into the strings agan when reading back in​:

  my $d= [ list of strings or string refs ];
  my $f= ... filehandle to new output file, blessed to a class which has an xprint method.

  my @​is_utf8;
  for(@​$d) {
  my $rft;
  my $is_utf8;
  #
  if (defined($rft=Scalar​::Util​::reftype($_)) and $rft eq "SCALAR") {
  $is_utf8= Encode​::is_utf8($$_);
  Eile->log("reference ".($is_utf8 ? "is" : "is not")." utf8");
  Encode​::_utf8_off($$_) if $is_utf8;
  $f->xprint(pack('l',length($$_)),
  ($is_utf8 ? "1" : "0")
  );
  } else {
  $is_utf8= Encode​::is_utf8($_);
  Eile->log("string ".($is_utf8 ? "is" : "is not")." utf8");
  Encode​::_utf8_off($_) if $is_utf8;
  $f->xprint(pack('l',length($_)),
  ($is_utf8 ? "1" : "0")
  );
  }
  push @​is_utf8,$is_utf8;
  }
  $f->xprint(pack('l',-1),"|");# "|" is choosen arbitrarily, it's not used anywhere.
  for(@​$d) {
  my $is_utf8= shift @​is_utf8;
  my $rft;
  if (defined($rft=Scalar​::Util​::reftype($_)) and $rft eq "SCALAR") {
  $f->xprint($$_);
  Encode​::_utf8_on($$_) if $is_utf8;
  } else {
  $f->xprint($_);
  Encode​::_utf8_on($_) if $is_utf8;
  }
  }

)

The problem is that sometimes Encode​::is_utf8 reports false on a
string, even when I know it must contain unicode characters​:

- the file being written to disk *does* contain utf8 sequences.
- the flag being written to disk is false. (Encode​::is_utf8 gave false)
- the length being written into the header is too short (which
  means that the length builtin reported the length in unicode code
  points, not bytes -- how can this be if Encode​::is_utf8 is false?).

As I said, again switching off taint mode seems to make it work fine.
(The strings being written above were coming from LWP (from HTTP get
requests) -- maybe they were tainted for this reason.)

Thanks for your works,
Christian.

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl v5.8.7:

Configured by Debian Project at Thu Jun  9 00:28:22 EST 2005.

Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
  Platform:
    osname=linux, osvers=2.4.27-ti1211, archname=i386-linux-thread-multi
    uname='linux kosh 2.4.27-ti1211 #1 sun sep 19 18:17:45 est 2004 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.7 -Dsitearch=/usr/local/lib/perl/5.8.7 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.7 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='3.3.6 (Debian 1:3.3.6-6)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so.5.8.7
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.7:
    /etc/perl
    /usr/local/lib/perl/5.8.7
    /usr/local/share/perl/5.8.7
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    /usr/local/lib/perl/5.8.4
    /usr/local/share/perl/5.8.4
    /usr/local/lib/perl/5.8.3
    /usr/local/share/perl/5.8.3
    .


Environment for perl v5.8.7:
    HOME=/home/chris
    LANG=de_CH
    LANGUAGE (unset)
    LC_CTYPE=de_CH
    LC_NUMERIC=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/Gambit-C/bin:/opt/j2sdk_nb/j2sdk1.4.2/bin/:/home/chris/local/bin:/home/chris/bin:/root/local/bin:/root/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/bin/X11:/usr/local/sbin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 15, 2005

From dankogai@dan.co.jp

Christian and Porters,

Thanks for your report.

On Sep 15, 2005, at 18​:53 , Christian Jaeger (via RT) wrote​:

- the file being written to disk *does* contain utf8 sequences.
- the flag being written to disk is false. (Encode​::is_utf8 gave
false)
- the length being written into the header is too short (which
means that the length builtin reported the length in unicode code
points, not bytes -- how can this be if Encode​::is_utf8 is false?).

I could not duplicate the symptom on perl 5.8.7 but on 5.8.6 I did.

#
use strict;
use Encode;
my $fn = 'test.txt';
sub readwrite{
  my $str = shift;
  open my $fh, ">​:utf8", $fn or die "$fn : $!";
  print $fh $str;
  close $fh;
  open my $fh, "<​:raw", $fn or die "$fn : $!";
  read $fh, my $buf, -s $fn;
  close $fh; unlink $fn;
  return $buf;
}
sub checkstr{
  my $str = shift;
  print "Encode​::is_utf8(\$str) = ", Encode​::is_utf8($str), "\n";
  print "utf8​::is_utf8(\$str) = ", utf8​::is_utf8($str), "\n";
}
my $ascii = join '', map { chr $_ } 0x20..0x7e; # only ascii
my $utf8 = join '', map { chr $_ } 0x2020..0x207e; # now Unicode;
checkstr(decode_utf8(readwrite $ascii));
checkstr(decode_utf8(readwrite $utf8));
__END__

you run the code as follows (on my Mac OS X v10.4.2);

% /usr/bin/perl utf8flag.pl
Perl Version is 5.008006, Encode Version is 2.08
Encode​::is_utf8($str) =
utf8​::is_utf8($str) =
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
% /usr/bin/perl -T utf8flag.pl
Perl Version is 5.008006, Encode Version is 2.08
Encode​::is_utf8($str) =
utf8​::is_utf8($str) =
Encode​::is_utf8($str) =
utf8​::is_utf8($str) = 1
% perl utf8flag.pl
Perl Version is 5.008007, Encode Version is 2.10
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
% perl -T utf8flag.pl
Perl Version is 5.008007, Encode Version is 2.10
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1

As you see, on 5.8.6 utf8​::is_utf8() works fine while Encode​::is_utf8
() does not. Also note on 5.8.7 the flag is set UNCONDITIONALLY,
whether the string contains U+100 and above or not.

/* universal.c */
XS(XS_utf8_is_utf8)
{
  dXSARGS;
  if (items != 1)
  Perl_croak(aTHX_ "Usage​: utf8​::is_utf8(sv)");
  {
  SV * sv = ST(0);
  {
  if (SvUTF8(sv))
  XSRETURN_YES;
  else
  XSRETURN_NO;
  }
  }
  XSRETURN_EMPTY;
}
/* end of code */

/* ext/Encode/Encode.xs */
bool
is_utf8(sv, check = 0)
SV * sv
int check
CODE​:
{
  if (SvGMAGICAL(sv)) /* it could be $1, for example */
  sv = newSVsv(sv); /* GMAGIG will be done */
  if (SvPOK(sv)) {
  RETVAL = SvUTF8(sv) ? TRUE : FALSE;
  if (RETVAL &&
  check &&
  !is_utf8_string((U8*)SvPVX(sv), SvCUR(sv)))
  RETVAL = FALSE;
  } else {
  RETVAL = FALSE;
  }
  if (sv != ST(0))
  SvREFCNT_dec(sv); /* it was a temp copy */
}
OUTPUT​:
  RETVAL

/* end of code */

Though not harmful, the behavior of 5.8.7 is not as documented as in
Encode. Should I fix the pod accordingly or did it just reveal
undocumented bug?

Dan the Encode Maintainer

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 15, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 15, 2005

From christian@pflanze.mine.nu

Hello

Thanks for your reply.

At 5​:29 Uhr -0700 15.09.2005, Dan Kogai via RT wrote​:

I could not duplicate the symptom on perl 5.8.7 but on 5.8.6 I did.
...
you run the code as follows (on my Mac OS X v10.4.2);

With my perl 5.8.7 I'm getting​:
chris@​elvis-5 chris > perl ./bugreport-test1
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
chris@​elvis-5 chris > perl -T ./bugreport-test1
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1
Encode​::is_utf8($str) = 1
utf8​::is_utf8($str) = 1

(thus the same as you with that version)

As you see, on 5.8.6 utf8​::is_utf8() works fine while Encode​::is_utf8
() does not.

Interesting, I will try my app with -T again with utf8​::is_utf8.

Also note on 5.8.7 the flag is set UNCONDITIONALLY,
whether the string contains U+100 and above or not.

yes, but that's fine for me.

Your test case can't explain the second mentioned problem I'm seeing
-- I somehow had a case where, before writing to the file, I had a
string (originating from LWP) which gave false from Encode​::is_utf8
but still gave a shorter length() (thus I would have guessed
indicating utf8 flag is on) than the byte length in the file it is
then written to.

One thing to note​: I'm not opening files with ">​:utf8" or "<​:raw", but​:
  sysopen($fh,$path, O_EXCL|O_CREAT|O_RDWR, $mode) for writing,
and
  open $fh,"<",$path for reading.

That's the reason why I'm toggling off the utf8 flag of strings which
have it manually (as shown in the code I pasted in my bug report) for
the duration of the write, and using decode_utf8 later. I think I
can't use ">​:utf8", because not all strings I write have the utf8
flag on (some of the strings are binary data).

Christian.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented May 3, 2006

From Mark.Martinec@ijs.si

This looks like the same bug as reported in​:

  #32687​: Encode​::is_utf8 on tainted UTF8 string returns false

...still unresolved in 5.8.8.

  Mark

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented May 22, 2008

p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.