Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

Closed
p5pRT opened this issue Mar 20, 2008 · 8 comments
Closed

UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

p5pRT opened this issue Mar 20, 2008 · 8 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Mar 20, 2008

Migrated from rt.perl.org#51918 (status was 'resolved')

Searchable as RT51918$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 20, 2008

From chris.hall@highwayman.com

Created by chris.hall@highwayman.com

Encode​::encode('UTF-8', $foo) and Encode​::decode('UTF-8', $bar) detect the
Unicode 'non-character' U+FFFF and treat it as an error.

There are 65 other Unicode non-characters​:

  U+FFFE
  U+01FFFE, U+02FFFE, U+03FFFE, ... U+10FFFE
  U+01FFFF, U+02FFFF, U+03FFFF, ... U+10FFFF
  U+FDD0..U+FDEF

which one would expect to be treated the same as U+FFFF.

They aren't. They are accepted as normal characters.

This appears to be a bug.

It's the same under Perl 5.10.0.

(Alternatively, one could argue that detecting the 0xFFFF non-character
is less than useful -- this is a perfectly good character, and has uses
internally. Perhaps Encode should have an option to allow
non-characters ? Whichever way you cut it, all non-characters should be
handled the same way.)

Perl Info

Flags:
     category=library
     severity=low

This perlbug was built using Perl v5.8.8 in the Red Hat build system.
It is being executed now by Perl v5.8.8 - Mon Nov 26 14:25:50 EST 2007.

Site configuration information for perl v5.8.8:

Configured by Red Hat, Inc. at Mon Nov 26 14:25:50 EST 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
   Platform:
     osname=linux, osvers=2.6.20-1.3001.fc6xen, archname=x86_64-linux-thread-multi
     uname='linux xenbuilder4.fedora.phx.redhat.com 2.6.20-1.3001.fc6xen #1 smp thu aug 9 16:18:42 edt 2007 x86_64 x86_64 x86_64 gnulinux '
     config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -
mtune=generic -Dversion=5.8.8 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -
Dprefix=/usr -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dprivlib=/usr/lib/perl5/5.8.8 -Dsitelib=/usr/lib/perl5/site_perl/5.8.8 -Dvendorlib=/us
r/lib/perl5/vendor_perl/5.8.8 -Darchlib=/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi -Dsitearch=/usr/lib64/perl5/site_perl/5.8.8/x86_64-linu
x-thread-multi -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi -Darchname=x86_64-linux -Dvendorprefix=/usr -
Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -
Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -
Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -
Dinc_version_list=5.8.7 5.8.6 5.8.5 -Dscriptdir=/usr/bin'
     hint=recommended, useposix=true, d_sigaction=define
     usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
     use64bitint=define use64bitall=define uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -
D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
     optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/usr/include/gdbm'
     ccversion='', gccversion='4.1.2 20070925 (Red Hat 4.1.2-33)', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='gcc', ldflags =''
     libpth=/usr/local/lib64 /lib64 /usr/lib64
     libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
     perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     libc=, so=so, useshrplib=true, libperl=libperl.so
     gnulibc_version='2.7'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -
m64 -mtune=generic'

Locally applied patches:



@INC for perl v5.8.8:
     /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
     /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi
     /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi
     /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi
     /usr/lib/perl5/site_perl/5.8.8
     /usr/lib/perl5/site_perl/5.8.7
     /usr/lib/perl5/site_perl/5.8.6
     /usr/lib/perl5/site_perl/5.8.5
     /usr/lib/perl5/site_perl
     /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
     /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi
     /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi
     /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi
     /usr/lib/perl5/vendor_perl/5.8.8
     /usr/lib/perl5/vendor_perl/5.8.7
     /usr/lib/perl5/vendor_perl/5.8.6
     /usr/lib/perl5/vendor_perl/5.8.5
     /usr/lib/perl5/vendor_perl
     /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi
     /usr/lib/perl5/5.8.8
     .


Environment for perl v5.8.8:
     HOME=/home/GMCH
     LANG=en_GB.UTF-8
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
     PERL_BADLANG (unset)
     SHELL=/bin/bash

-- 
Chris Hall               highwayman.com

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 20, 2008

From jgmyers@proofpoint.com

This is related/duplicate to bugs 38722 and 43294. 43294 has a proposed
fix.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 20, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 21, 2008

From chris.hall@highwayman.com

On Thu Mar 20 13​:53​:50 2008, jgmyers@​proofpoint.com wrote​:

This is related/duplicate to bugs 38722 and 43294. 43294 has a
proposed
fix.

Related, except for the confusion between strict UTF-8 and more general
string handling.

My understanding is that the utf8​::valid() and utf8​::decode() functions
are related to Perl's internal character handling -- which happens to
be based on utf8. All I expect utf8​::valid() to tell me is that Perl
is happy with a character string (which I may have finangled from
somewhere -- for example by fiddling with the utf8 status of the
string).

I agree there's a place for functions that are strict UTF-8. I don't
think that everything should be like that, though.

The bug I was reporting is, however, in the UTF-8 (strict) handling in
Encode.
--
Chris Hall

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 21, 2008

From jgmyers@proofpoint.com

My primary use of utf8​::valid is to determine when it is necessary to
take the perfomance hit of firing up the Encode machinery to clean a
string obtained from an unreliable source​:

if (defined($out) && !utf8​::valid($out)) {
  utf8​::encode($out); # turn off utf-8 flag
  $out = Encode​::decode('utf-8', $out); # replace invalid chars with
U+FFFD
}

This requires utf8​::valid to do a strict check (as it does with my patch
for bug 43294).

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 23, 2008

From chris.hall@highwayman.com

On Fri, 21 Mar 2008 you wrote

My primary use of utf8​::valid is to determine when it is necessary to
take the perfomance hit of firing up the Encode machinery to clean a
string obtained from an unreliable source​:

if (defined($out) && !utf8​::valid($out)) {
utf8​::encode($out); # turn off utf-8 flag
$out = Encode​::decode('utf-8', $out); # replace invalid chars with
U+FFFD
}

This requires utf8​::valid to do a strict check (as it does with my patch
for bug 43294).

Well, yes, for what you want that is what would be required.

The documentation says​:

  $flag = utf8​::valid(STRING)

  [INTERNAL] Test whether STRING is in a consistent state regarding
  UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag on
  or if string is held as bytes (both these states are 'consistent').
  Main reason for this routine is to allow Perl's testsuite to check
  that operations have left strings in a consistent state.

which is invoking 'UTF-8' in caps and stuff, which one understands from
the Encode documentation to mean 'strict' UTF-8.

So either the documentation is phouquée or the code is.

What you want is entirely reasonable.

I don't know what the performance issues are with Encode/Decode, but I
can see it is tempting to exploit the fact that Perl is using something
like UTF-8.

More generally I can see a rôle for a 'quick' scanner that might
identify strings that contain any or all of​:

  1. broken sequences (probably including sequences starting 0xFE & FF)

  2. surrogates

  3. redundant sequences

  4. values > 0x10FFFF

  5. non-characters

  6. replacement characters

  7. private use characters

  8. unassigned characters

that is​: a scan function that takes a second argument to indicate what
the application considered 'invalid'. Some applications might like to
filter for character blocks that were not locally supported.

Chris
--
Chris Hall highwayman.com

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jan 10, 2011

From @khwilliamson

All 66 characters are now known to both Encode and Decode
--Karl Williamson

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jan 10, 2011

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant