Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strangeness with Unicode #2281

Closed
p5pRT opened this issue Jul 30, 2000 · 3 comments
Closed

strangeness with Unicode #2281

p5pRT opened this issue Jul 30, 2000 · 3 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 30, 2000

Migrated from rt.perl.org#3599 (status was 'resolved')

Searchable as RT3599$

@p5pRT
Copy link
Author

p5pRT commented Jul 30, 2000

From jfriedl@yahoo-inc.com

Created by jfriedl@yahoo-inc.com

This is another one where I hesitate to say it's a bug, since this is my
first venture into anything Unicode, but the action seems sufficiently
strange that I thought I'd post it.

Here's a test program that inspects the length of strings in a number
of ways​:

  #!/usr/local/bin/perl -w
  use strict;
  { use bytes; } # just to make available later
  use utf8;

  my $smiley = "\x{263a}"; ## a smiley character

  my $count = 0;
  for my $string ("\x{263a}", # 1
  $smiley, # 2

  "" . $smiley, # 3
  "" . "\x{263a}", # 4

  $smiley . "", # 5
  "\x{263a}" . "", # 6

  "\x{263a}" . "\x{263a}", # 7
  $smiley . $smiley, # 8

  "\x{263a}\x{263a}", # 9
  "$smiley$smiley", # 10

  "\x{263a}" x 2, # 11
  $smiley x 2, # 12
  )
  {
  $count++;

  my $chars = length($string); ## Unicode characters
  my $bytes = bytes​::length($string); ## raw bytes

  my @​regexchars = $string =~ m/(.)/g;
  my $regexchars = @​regexchars; ## chars as per the regex engine

  my @​splitchars = split //, $string;
  my $splitchars = @​splitchars; ## see how split counts them

  print "$count​: string [$string] has chars=$chars/$regexchars/$splitchars, bytes=$bytes\n";
  }

Here's the output, piped through less (which shows hex codes for non-ASCII)​:

  1​: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  2​: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  3​: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  4​: string [<E2><98><BA>] has chars=1/1/1, bytes=3
  5​: string [<E2><98><BA>] has chars=3/1/1, bytes=3
  6​: string [<E2><98><BA>] has chars=3/1/1, bytes=3
  7​: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
  8​: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
  9​: string [<E2><98><BA><E2><98><BA>] has chars=2/2/2, bytes=6
10​: string [<C3><A2><C2><98><C2><BA><E2><98><BA>] has chars=4/4/4, bytes=9
11​: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6
12​: string [<E2><98><BA><E2><98><BA>] has chars=6/2/2, bytes=6

The first four look fine to me, as <E2><98><BA> are the utf8 for the smiley​:

  % utf8-decode
  Enter Unicode> <E2><98><BA>
  Unicode 263A encoded in utf8 as a 3-byte sequence​: <E2> <98> <BA>
  WHITE SMILING FACE
  So (Symbol, Other)
  ON (Other Neutrals)

and indeed, when I view the output on a utf8 xterm, I see the smiley.

Lines 5 and 6 seem odd, since the length() is 3 instead of the 1 I'd expect.

As for the rest, 7-12, I'd expect them all to be like #9, which shows
correctly that the two smileys are two characters.

#11 and 12 just have the length() wrong, but the other three are really
wild. I'd expect 6 bytes to create the two characters, but as it is, there
are nine bytes to create four unicode characters​:

  % utf8-decode
  Enter Unicode> <C3><A2><C2><98><C2><BA><E2><98><BA>
  Unicode 00E2 encoded in utf8 as a 2-byte sequence​: <C3> <A2>
  LATIN SMALL LETTER A WITH CIRCUMFLEX
  Ll (Letter, Lowercase)
  decomp=[0061 0302]
  has upper (00C2)
  Unicode 0098 encoded in utf8 as a 2-byte sequence​: <C2> <98>
  <control>
  Cc (Other, Control)
  BN (Boundary Neutral)
  Unicode 00BA encoded in utf8 as a 2-byte sequence​: <C2> <BA>
  MASCULINE ORDINAL INDICATOR
  Ll (Letter, Lowercase)
  decomp=[<super> 006F]
  Unicode 263A encoded in utf8 as a 3-byte sequence​: <E2> <98> <BA>
  WHITE SMILING FACE
  So (Symbol, Other)
  ON (Other Neutrals)

But, at least the length() is correct for them.

So, it seems that there are two separate problems​:

  * length() not working correctly (examples 5,6, 11, 12)
  * string concatination not working (examples 7, 8, 10)

But hey, I'm learning a lot about Unicode :-)
  Jeffrey

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.6.0:

Configured by jfriedl at Sat Jul 29 20:09:33 PDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.2.15, archname=i686-linux
    uname='linux fummy.dsl.yahoo.com 2.2.16 #6 smp sun jul 23 11:26:16 pdt 2000 i686 unknown '
    config_args='-ds -e -A optimize=-g'
    hint=previous, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define 
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
  Compiler:
    cc='cc', optimize='-O2 -g', gccversion=pgcc-2.91.66 19990314 (egcs-1.1.2 release)
    cppflags='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=/lib/libc-2.1.1.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.6.0:
    /home/jfriedl/lib/perl
    /home/jfriedl/lib/perl/yahoo
    /usr/local/lib/perl5/5.6.0/i686-linux
    /usr/local/lib/perl5/5.6.0
    /usr/local/lib/perl5/site_perl/5.6.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.6.0
    /usr/local/lib/perl5/site_perl
    .


Environment for perl v5.6.0:
    HOME=/home/jfriedl
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/local/pgsql/lib:/home/jfriedl/src/rvplayer5.0
    LOGDIR (unset)
    PATH=/home/jfriedl/bin:/home/jfriedl/common/bin:/usr/local/gcc-2.95.2/bin:.:/usr/local/pgsql/bin:/usr/local/bin:/usr/X11R6/bin:/bin:/usr/bin:/usr/sbin:/sbin:/home/jfriedl/src/rvplayer5.0
    PERLLIB=/home/jfriedl/lib/perl:/home/jfriedl/lib/perl/yahoo
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh


@p5pRT
Copy link
Author

p5pRT commented Jul 31, 2000

From [Unknown Contact. See original ticket]

Jeffrey Friedl (lists.p5p)​:

So, it seems that there are two separate problems​:

* length() not working correctly (examples 5,6, 11, 12)
* string concatination not working (examples 7, 8, 10)

Could you try these with a bleeding-edge Perl, and then contact
me directly? These should have been cleaned up.

But hey, I'm learning a lot about Unicode :-)

Fun, isn't it?

@p5pRT
Copy link
Author

p5pRT commented Oct 18, 2000

From The RT System itself

Seems to have been fixed, works in the bleeding edge post-5.7.0 Perl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant