strangeness with Unicode #2281

p5pRT · 2000-07-30T16:03:57Z

Migrated from rt.perl.org#3599 (status was 'resolved')

Searchable as RT3599$

p5pRT · 2000-07-30T16:03:57Z

From jfriedl@yahoo-inc.com

Created by jfriedl@yahoo-inc.com

This is another one where I hesitate to say it's a bug, since this is my
first venture into anything Unicode, but the action seems sufficiently
strange that I thought I'd post it.

Here's a test program that inspects the length of strings in a number
of ways:

#!/usr/local/bin/perl -w
use strict;
{ use bytes; } # just to make available later
use utf8;

my $smiley = "\x{263a}"; ## a smiley character

my $count = 0;
for my $string ("\x{263a}", # 1
$smiley, # 2

"" . $smiley, # 3
"" . "\x{263a}", # 4

$smiley . "", # 5
"\x{263a}" . "", # 6

"\x{263a}" . "\x{263a}", # 7
$smiley . $smiley, # 8

"\x{263a}\x{263a}", # 9
"$smiley$smiley", # 10

"\x{263a}" x 2, # 11
$smiley x 2, # 12
)
{
$count++;

my $chars = length($string); ## Unicode characters
my $bytes = bytes::length($string); ## raw bytes

my @regexchars = $string =~ m/(.)/g;
my $regexchars = @regexchars; ## chars as per the regex engine

my @splitchars = split //, $string;
my $splitchars = @splitchars; ## see how split counts them

print "$count: string [$string] has chars=$chars/$regexchars/$splitchars, bytes=$bytes\n";
}

Here's the output, piped through less (which shows hex codes for non-ASCII):

1: string [<E2><98><BA>] has chars=1/1/1, bytes=3
2: string [<E2><98><BA>] has chars=1/1/1, bytes=3
3: string [<E2><98><BA>] has chars=1/1/1, bytes=3
4: string [<E2><98><BA>] has chars=1/1/1, bytes=3
5: string [<E2><98><BA>] has chars=3/1/1, bytes=3
6: string [<E2><98><BA>] has chars=3/1/1, bytes=3
7: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
8: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
9: string [<E2><98><BA><E2><98><BA>] has chars=2/2/2, bytes=6
10: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
11: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6
12: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6

The first four look fine to me, as <E2><98><BA> are the utf8 for the smiley:

% utf8-decode
Enter Unicode> <E2><98><BA>
Unicode 263A encoded in utf8 as a 3-byte sequence: <E2> <98> <BA>
WHITE SMILING FACE
So (Symbol, Other)
ON (Other Neutrals)

and indeed, when I view the output on a utf8 xterm, I see the smiley.

Lines 5 and 6 seem odd, since the length() is 3 instead of the 1 I'd expect.

As for the rest, 7-12, I'd expect them all to be like #9, which shows
correctly that the two smileys are two characters.

#11 and 12 just have the length() wrong, but the other three are really
wild. I'd expect 6 bytes to create the two characters, but as it is, there
are nine bytes to create four unicode characters:

% utf8-decode
Enter Unicode> <C3><A2><C2><98><C2><BA><E2><98><BA>
Unicode 00E2 encoded in utf8 as a 2-byte sequence: <C3> <A2>
LATIN SMALL LETTER A WITH CIRCUMFLEX
Ll (Letter, Lowercase)
decomp=[0061 0302]
has upper (00C2)
Unicode 0098 encoded in utf8 as a 2-byte sequence: <C2> <98>
<control>
Cc (Other, Control)
BN (Boundary Neutral)
Unicode 00BA encoded in utf8 as a 2-byte sequence: <C2> <BA>
MASCULINE ORDINAL INDICATOR
Ll (Letter, Lowercase)
decomp=[<super> 006F]
Unicode 263A encoded in utf8 as a 3-byte sequence: <E2> <98> <BA>
WHITE SMILING FACE
So (Symbol, Other)
ON (Other Neutrals)

But, at least the length() is correct for them.

So, it seems that there are two separate problems:

* length() not working correctly (examples 5,6, 11, 12)
* string concatination not working (examples 7, 8, 10)

But hey, I'm learning a lot about Unicode :-)
Jeffrey

Perl Info


Flags:
    category=core
    severity=medium

Site configuration information for perl v5.6.0:

Configured by jfriedl at Sat Jul 29 20:09:33 PDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.2.15, archname=i686-linux
    uname='linux fummy.dsl.yahoo.com 2.2.16 #6 smp sun jul 23 11:26:16 pdt 2000 i686 unknown '
    config_args='-ds -e -A optimize=-g'
    hint=previous, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define 
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
  Compiler:
    cc='cc', optimize='-O2 -g', gccversion=pgcc-2.91.66 19990314 (egcs-1.1.2 release)
    cppflags='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=/lib/libc-2.1.1.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.6.0:
    /home/jfriedl/lib/perl
    /home/jfriedl/lib/perl/yahoo
    /usr/local/lib/perl5/5.6.0/i686-linux
    /usr/local/lib/perl5/5.6.0
    /usr/local/lib/perl5/site_perl/5.6.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.6.0
    /usr/local/lib/perl5/site_perl
    .


Environment for perl v5.6.0:
    HOME=/home/jfriedl
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/local/pgsql/lib:/home/jfriedl/src/rvplayer5.0
    LOGDIR (unset)
    PATH=/home/jfriedl/bin:/home/jfriedl/common/bin:/usr/local/gcc-2.95.2/bin:.:/usr/local/pgsql/bin:/usr/local/bin:/usr/X11R6/bin:/bin:/usr/bin:/usr/sbin:/sbin:/home/jfriedl/src/rvplayer5.0
    PERLLIB=/home/jfriedl/lib/perl:/home/jfriedl/lib/perl/yahoo
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh

p5pRT · 2000-07-31T14:32:24Z

From [Unknown Contact. See original ticket]

Jeffrey Friedl (lists.p5p):

So, it seems that there are two separate problems:

* length() not working correctly (examples 5,6, 11, 12)
* string concatination not working (examples 7, 8, 10)

Could you try these with a bleeding-edge Perl, and then contact
me directly? These should have been cleaned up.

But hey, I'm learning a lot about Unicode :-)

Fun, isn't it?

p5pRT · 2000-10-18T19:12:06Z

From The RT System itself

Seems to have been fixed, works in the bleeding edge post-5.7.0 Perl.

p5pRT closed this as completed Nov 28, 2003

p5pRT added Severity Medium distro-Linux type-library labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strangeness with Unicode #2281

strangeness with Unicode #2281

p5pRT commented Jul 30, 2000

p5pRT commented Jul 30, 2000

p5pRT commented Jul 31, 2000

p5pRT commented Oct 18, 2000

strangeness with Unicode #2281

strangeness with Unicode #2281

Comments

p5pRT commented Jul 30, 2000

p5pRT commented Jul 30, 2000

From jfriedl@yahoo-inc.com

Created by jfriedl@yahoo-inc.com

p5pRT commented Jul 31, 2000

From [Unknown Contact. See original ticket]

p5pRT commented Oct 18, 2000

From The RT System itself