Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bug in perl "\u" string processing #15517

Closed
p5pRT opened this issue Aug 15, 2016 · 13 comments
Closed

possible bug in perl "\u" string processing #15517

p5pRT opened this issue Aug 15, 2016 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 15, 2016

Migrated from rt.perl.org#128950 (status was 'rejected')

Searchable as RT128950$

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From @jmdh

The following bug report was sent to Debian. I don't believe it is specific to Debian, and have no expertise in the area, so I am forwarding without further comment. It was reported against Debian perl 5.22.2-3, for which perl -V output is available[1] and I was also able to reproduce it against our perl 5.24.0 build.

The following might be a bug in how Perl uppercases strings as in e.g.​:
print "\uf";

perlop(1) already documents, that in Unicode context, this can result
in inserting addition characters.
Typical example is​:
U+00DF LATIN SMALL LETTER SHARP S
where the Unicode rules specify that this single character results in
"SS"[0].

Anyway, when in perl I do e.g.​:
$ perl -C -e 'print "\U\N{U+00DF}\E\n";'
SS
is returned, which is good.
However, when I do​:
$ perl -C -e 'print "\u\N{U+00DF}\E\n";'
Ss
is returned.

Now IMO that's an error, \u says "titlecase (not uppercase!) next character
only", the next character is however \N{U+00DF} (aka ß) and it's
capitalisation should AFAIU result in "SS", not "Ss".

Cheers,
Chris.

[0] Not sure if this is still the case in most recent versions, as there
  is now a majuscle form of that character​:
  U+1E9E LATIN CAPITAL LETTER SHARP S

[1] Summary of my perl5 (revision 5 version 22 subversion 2) configuration​:
 
  Platform​:
  osname=linux, osvers=3.16.0, archname=x86_64-linux-gnu-thread-multi
  uname='linux localhost 3.16.0 #1 smp debian 3.16.0 x86_64 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dcc=x86_64-linux-gnu-gcc -Dcpp=x86_64-linux-gnu-cpp -Dld=x86_64-linux-gnu-gcc -Dccflags=-DDEBIAN -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.22 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.22 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.22 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.22.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.22.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dusesitecustomize -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -dEs -Duseshrplib -Dlibperl=libperl.so.5.22.2'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='x86_64-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='5.4.0 20160609', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='x86_64-linux-gnu-gcc', ldflags =' -fstack-protector-strong -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/5/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=libc-2.23.so, so=so, useshrplib=true, libperl=libperl.so.5.22
  gnulibc_version='2.23'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector-strong'

Characteristics of this binary (from libperl)​:
  Compile-time options​: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
  PERL_DONT_CREATE_GVSV
  PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
  PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
  PERL_NEW_COPY_ON_WRITE PERL_PRESERVE_IVUV
  USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS
  USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE
  USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_LOCALE_TIME
  USE_PERLIO USE_PERL_ATOF USE_REENTRANT_API
  USE_SITECUSTOMIZE
  Locally applied patches​:
  DEBPKG​:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
  DEBPKG​:debian/db_file_ver - http​://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
  DEBPKG​:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
  DEBPKG​:debian/enc2xs_inc - http​://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @​INC directories.
  DEBPKG​:debian/errno_ver - http​://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
  DEBPKG​:debian/libperl_embed_doc - http​://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
  DEBPKG​:fixes/respect_umask - Respect umask during installation
  DEBPKG​:debian/writable_site_dirs - Set umask approproately for site install directories
  DEBPKG​:debian/extutils_set_libperl_path - EU​:MM​: set location of libperl.a under /usr/lib
  DEBPKG​:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
  DEBPKG​:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
  DEBPKG​:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
  DEBPKG​:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
  DEBPKG​:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
  DEBPKG​:debian/mod_paths - Tweak @​INC ordering for Debian
  DEBPKG​:debian/prune_libs - http​://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
  DEBPKG​:fixes/net_smtp_docs - [rt.cpan.org #36038] http​://bugs.debian.org/100195 Document the Net​::SMTP 'Port' option
  DEBPKG​:debian/perlivp - http​://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
  DEBPKG​:debian/deprecate-with-apt - http​://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules
  DEBPKG​:debian/squelch-locale-warnings - http​://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
  DEBPKG​:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
  DEBPKG​:debian/patchlevel - http​://bugs.debian.org/567489 List packaged patches for 5.22.2-3 in patchlevel.h
  DEBPKG​:debian/skip-kfreebsd-crash - http​://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
  DEBPKG​:fixes/document_makemaker_ccflags - http​://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
  DEBPKG​:debian/find_html2text - http​://bugs.debian.org/640479 Configure CPAN​::Distribution with correct name of html2text
  DEBPKG​:debian/perl5db-x-terminal-emulator.patch - http​://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
  DEBPKG​:debian/cpan-missing-site-dirs - http​://bugs.debian.org/688842 Fix CPAN​::FirstTime defaults with nonexisting site dirs if a parent is writable
  DEBPKG​:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http​://bugs.debian.org/587650 Memoize​::Storable​: respect 'nstore' option not respected
  DEBPKG​:debian/regen-skip - Skip a regeneration check in unrelated git repositories
  DEBPKG​:debian/makemaker-pasthru - http​://bugs.debian.org/758471 Pass LD settings through to subdirectories
  DEBPKG​:fixes/pod_man_reproducible_date - http​://bugs.debian.org/759405 Support POD_MAN_DATE in Pod​::Man for the left-hand footer
  DEBPKG​:debian/locale-robustness - http​://bugs.debian.org/782068 [perl #124310] Make t/run/locale.t survive missing locales masked by LC_ALL
  DEBPKG​:fixes/podman-utc - http​://bugs.debian.org/780259 Make the embedded date from Pod​::Man reproducible
  DEBPKG​:fixes/podman-utc-docs - http​://bugs.debian.org/780259 Documentation and test suite updates for UTC fix
  DEBPKG​:fixes/podman-empty-date - http​://bugs.debian.org/780259 Support an empty POD_MAN_DATE environment variable
  DEBPKG​:fixes/podman-pipe - http​://bugs.debian.org/777405 Better errors for man pages from standard input
  DEBPKG​:debian/pod2man-customized - Update porting/customized.dat for pod2man modifications
  DEBPKG​:debian/makemaker-manext - http​://bugs.debian.org/247370 Make EU​::MakeMaker honour MANnEXT settings in generated manpage headers
  DEBPKG​:debian/makemaker_customized - Update t/porting/customized.dat for files patched in Debian
  DEBPKG​:debian/do-not-record-build-date - [6baa8db] http​://bugs.debian.org/774422 [perl #125830] Allow overriding the compile time in "perl -V" output
  DEBPKG​:fixes/podman-source-date-epoch - http​://bugs.debian.org/801621 Make Pod​::Man honor the SOURCE_DATE_EPOCH environment variable
  DEBPKG​:fixes/podman-source-date-epoch-cleanups - http​://bugs.debian.org/801621 Coding style and documentation for SOURCE_EPOCH_DATE
  DEBPKG​:fixes/podman-source-date-epoch-testfix - http​://bugs.debian.org/807086 Guard for building with SOURCE_DATE_EPOCH or POD_MAN_DATE set
  DEBPKG​:debian/devel-ppport-reproducibility - http​://bugs.debian.org/801523 Sort the list of XS code files when generating RealPPPort.xs
  DEBPKG​:fixes/encode-unicode-bom - http​://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
  DEBPKG​:debian/encode-unicode-bom-doc - http​://bugs.debian.org/798727 Document Debian backport of Encode​::Unicode fix
  DEBPKG​:debian/kfreebsd-softupdates - http​://bugs.debian.org/796798 Work around Debian Bug#796798
  DEBPKG​:fixes/autodie-scope - http​://bugs.debian.org/798096 Fix a scoping issue with "no autodie" and the "system" sub
  DEBPKG​:debian/debugperl-compat-fix - [perl #127212] http​://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds
  DEBPKG​:fixes/crosscompile-no-targethost - [perl #127234] Fix the Configure escape with usecrosscompile but no targethost
  DEBPKG​:fixes/podlators-no-encode - [rt.cpan.org #111156] Degrade gracefully if utf8 is requested but Encode is not available
  DEBPKG​:debian/cross-time-hires - [rt.cpan.org #111391] Add an environment variable to skip running configuration probes
  DEBPKG​:fixes/encode-unicode-pod - Unicode.pm​: Fix POD error
  DEBPKG​:fixes/memoize-pod - [rt.cpan.org #89441] Fix POD errors in Memoize
  DEBPKG​:fixes/ok-pod - Added encoding for pod.
  DEBPKG​:debian/hurd-softupdates - http​://bugs.debian.org/822735 Fix t/op/stat.t failures on hurd
  DEBPKG​:fixes/xsloader-eval - [rt.cpan.org #115808] http​://bugs.debian.org/829578 =?UTF-8?q?Don=E2=80=99t=20let=20XSLoader=20load=20relative=20path?= =?UTF-8?q?s?=
  DEBPKG​:fixes/extutils-parsexs-reproducibility - [perl #128517] http​://bugs.debian.org/829296 Make the output of ExtUtils​::ParseXS reproducible
  DEBPKG​:fixes/CVE-2016-1238/remove-dot-when-loading - [perl #127834] (perl #127834) remove . from the end of @​INC if complex modules are loaded
  DEBPKG​:fixes/CVE-2016-1238/remove-dot-in-padwalker - [perl #127834] perl5db.pl​: ensure PadWalker is loaded from standard paths
  DEBPKG​:fixes/CVE-2016-1238/remove-dot-in-dist - [perl #127834] dist/​: remove . from @​INC when loading optional modules
  DEBPKG​:fixes/CVE-2016-1238/remove-dot-in-cpan - [perl #127834] cpan/​: remove . from @​INC when loading optional modules
  DEBPKG​:debian/CVE-2016-1238/test-suite-without-dot - [perl #127810] Patch unit tests to explicitly insert "." into @​INC when needed.
  DEBPKG​:debian/CVE-2016-1238/eumm-without-dot - [perl #127810] Add PERL_USE_UNSAFE_INC support to EU​::MM for fortify_inc support.
  DEBPKG​:debian/CVE-2016-1238/cpan-without-dot - [perl #127810] Set PERL_USE_UNSAFE_INC for cpan usage
  DEBPKG​:debian/CVE-2016-1238/sitecustomize-in-etc - Look for sitecustomize.pl in /etc/perl rather than sitelib on Debian systems
  DEBPKG​:debian/CVE-2016-1238/customized - Update customized.dat for CVE-2016-1238 changes
  Built under linux
  Compiled at Jul 25 2016 15​:00​:43
  @​INC​:
  /etc/perl
  /usr/local/lib/x86_64-linux-gnu/perl/5.22.2
  /usr/local/share/perl/5.22.2
  /usr/lib/x86_64-linux-gnu/perl5/5.22
  /usr/share/perl5
  /usr/lib/x86_64-linux-gnu/perl/5.22
  /usr/share/perl/5.22
  /usr/local/lib/site_perl
  /usr/lib/x86_64-linux-gnu/perl-base
  .

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From @mauke

Am 15.08.2016 um 23​:14 schrieb Dominic Hargreaves (via RT)​:

# New Ticket Created by Dominic Hargreaves
# Please include the string​: [perl #128950]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=128950 >

The following bug report was sent to Debian. I don't believe it is specific to Debian, and have no expertise in the area, so I am forwarding without further comment. It was reported against Debian perl 5.22.2-3, for which perl -V output is available[1] and I was also able to reproduce it against our perl 5.24.0 build.

The following might be a bug in how Perl uppercases strings as in e.g.​:
print "\uf";

perlop(1) already documents, that in Unicode context, this can result
in inserting addition characters.
Typical example is​:
U+00DF LATIN SMALL LETTER SHARP S
where the Unicode rules specify that this single character results in
"SS"[0].

Anyway, when in perl I do e.g.​:
$ perl -C -e 'print "\U\N{U+00DF}\E\n";'
SS
is returned, which is good.
However, when I do​:
$ perl -C -e 'print "\u\N{U+00DF}\E\n";'
Ss
is returned.

Now IMO that's an error, \u says "titlecase (not uppercase!) next character
only", the next character is however \N{U+00DF} (aka ß) and it's
capitalisation should AFAIU result in "SS", not "Ss".

This is not specific to \u. ucfirst("ß") also returns "Ss", which makes
me believe it's deliberate. (It might also be a bug, but then it's a bug
in ucfirst, not just \u.)

--
Lukas Mai <plokinom@​gmail.com>

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From @cpansprout

On Mon Aug 15 14​:14​:57 2016, dom wrote​:

Anyway, when in perl I do e.g.​:
$ perl -C -e 'print "\U\N{U+00DF}\E\n";'
SS
is returned, which is good.
However, when I do​:
$ perl -C -e 'print "\u\N{U+00DF}\E\n";'
Ss
is returned.

Now IMO that's an error, \u says "titlecase (not uppercase!)

I.e., titlecase, not capitalization.

next
character
only", the next character is however \N{U+00DF} (aka ß) and it's
capitalisation

But you did not ask for capitalization, but for titlecase.

should AFAIU result in "SS", not "Ss".

I think it should result in Ss, because that is the titlecase version of ß. Karl Williamson should be able to confirm whether I am right.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From @cpansprout

On Mon Aug 15 14​:27​:49 2016, sprout wrote​:

On Mon Aug 15 14​:14​:57 2016, dom wrote​:

Anyway, when in perl I do e.g.​:
$ perl -C -e 'print "\U\N{U+00DF}\E\n";'
SS
is returned, which is good.
However, when I do​:
$ perl -C -e 'print "\u\N{U+00DF}\E\n";'
Ss
is returned.

Now IMO that's an error, \u says "titlecase (not uppercase!)

I.e., titlecase, not capitalization.

next
character
only", the next character is however \N{U+00DF} (aka ß) and it's
capitalisation

But you did not ask for capitalization, but for titlecase.

should AFAIU result in "SS", not "Ss".

I think it should result in Ss, because that is the titlecase version
of ß. Karl Williamson should be able to confirm whether I am right.

I was a bit sloppy with my wording there, because ‘capitalization’ in English usually means what geeks call ‘titlecase’.

So​: You asked for capitalization and got exactly that.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From @khwilliamson

On 08/15/2016 03​:27 PM, Father Chrysostomos via RT wrote​:

On Mon Aug 15 14​:14​:57 2016, dom wrote​:

Anyway, when in perl I do e.g.​:
$ perl -C -e 'print "\U\N{U+00DF}\E\n";'
SS
is returned, which is good.
However, when I do​:
$ perl -C -e 'print "\u\N{U+00DF}\E\n";'
Ss
is returned.

Now IMO that's an error, \u says "titlecase (not uppercase!)

I.e., titlecase, not capitalization.

next
character
only", the next character is however \N{U+00DF} (aka ß) and it's
capitalisation

But you did not ask for capitalization, but for titlecase.

should AFAIU result in "SS", not "Ss".

I think it should result in Ss, because that is the titlecase version of ß. Karl Williamson should be able to confirm whether I am right.

It is deliberate. Consider the name "titlecase". It means how a title
of something, like a chapter, should appear where the first character in
each important word is capitalized, but subsequent letters appear as
lower case.

One would not have a word in a title that was "SSisch"

But I'm told that this situation would never come up in natural German.
ß can only occur immediately after a vowel; never at the beginning of a
word, so will never be actually titlecased.

The Unicode Standard has not changed the capitalization of ß with the
addition of LATIN CAPTIAL LETTER SHARP S U+1E9E into the standard, so
the issue is unchanged.

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From zefram@fysh.org

Dominic Hargreaves wrote​:

Now IMO that's an error, \u says "titlecase (not uppercase!) next character
only", the next character is however \N{U+00DF}

unicore/SpecialCasing.txt has

# # The German es-zed is special--the normal mapping is to SS.
# # Note​: the titlecase should never occur in practice. It is equal to titlecase(uppercase(<es-zed>))
#
# 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

So it's deliberate, but a known funny case.

[0] Not sure if this is still the case in most recent versions, as there
is now a majuscle form of that character​:
U+1E9E LATIN CAPITAL LETTER SHARP S

That one's even funnier. It's so rarely used that it doesn't count as
the uppercase form of U+df, though the lowercase of U+1e9e is U+df.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

From @khwilliamson

Rejecting, as Zefram pointed out the behavior is explicitly what The Unicode Standard specifies.
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2016

@khwilliamson - Status changed from 'open' to 'rejected'

@p5pRT p5pRT closed this as completed Aug 15, 2016
@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2016

From eric.herman@booking.com

On 16-08-16 00​:03, Karl Williamson wrote​:

But I'm told that this situation would never come up in natural German.
ß can only occur immediately after a vowel; never at the beginning of a
word, so will never be actually titlecased.

This may not come up in German, it will come up at least some in dutch.

For example the IJsselmeer lake.

https://en.wikipedia.org/wiki/IJ_%28digraph%29#Capitalisation

It should be noted, however, that the dutch U+0132 'IJ' and U+0133 'ij'
letters are falling out of use in favor of constructing the digraph by
simply typing adjacent 'i' and 'j' letters.

I have no good idea of how to correctly handle title case for words
written like this.

--
Eric Herman - mobile​: +31 620719662
Booking.com - Principal Developer - Core infra​: DB Scaling

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2016

From zefram@fysh.org

Eric Herman via perl5-porters wrote​:

This may not come up in German, it will come up at least some in dutch.

With different characters that have their own capitalisation rules.
In the case of the "IJ" digraph, Unicode has only uppercase U+132 "IJ"
and lowercase U+133 "ij". Thus it titlecases to the uppercase "IJ",
matching the standard Dutch usage. Another different case is the "LJ"
digraph, for which Unicode has uppercase U+1c7 "LJ", titlecase U+1c8
"Lj", and lowercase U+1c9 "lj". Those are used in Croatian.

       falling out of use in favor of constructing the digraph by simply

typing adjacent 'i' and 'j' letters.

I have no good idea of how to correctly handle title case for words written
like this.

Unicode can't handle that automatically. One would need a
language-specific algorithm. This is akin to the casing problem that
arises with the Turkish dotted-I and dotless-I letters, both of which
come in uppercase and lowercase forms​: the standard case pair of dotless
capital "I" with dotted small "i" is not correct for Turkish.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2016

From @Tux

On Tue, 16 Aug 2016 15​:24​:39 +0100, Zefram <zefram@​fysh.org> wrote​:

Eric Herman via perl5-porters wrote​:

This may not come up in German, it will come up at least some in dutch.

With different characters that have their own capitalisation rules.
In the case of the "IJ" digraph, Unicode has only uppercase U+132 "IJ"
and lowercase U+133 "ij". Thus it titlecases to the uppercase "IJ",
matching the standard Dutch usage.

Note that the Dutch law has banished the use of the IJ and ij ligatures
in official registration and documents. That means that my name shall
not be written as Merijn, but with the i and the j as separate letters​:
Merijn. That said, a name like "IJsbrand" is expected to uppercase
*both* I and J when titlecasing the name.

Does that help at all?

Another different case is the "LJ" digraph, for which Unicode has
uppercase U+1c7 "LJ", titlecase U+1c8 "Lj", and lowercase U+1c9
"lj". Those are used in Croatian.

falling out of use in favor of constructing the digraph by simply
typing adjacent 'i' and 'j' letters.

I have no good idea of how to correctly handle title case for words
written like this.

Unicode can't handle that automatically. One would need a
language-specific algorithm. This is akin to the casing problem that
arises with the Turkish dotted-I and dotless-I letters, both of which
come in uppercase and lowercase forms​: the standard case pair of
dotless capital "I" with dotted small "i" is not correct for Turkish.

-zefram

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.25 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Aug 17, 2016

From @demerphq

On 16 August 2016 at 17​:10, H.Merijn Brand <h.m.brand@​xs4all.nl> wrote​:

On Tue, 16 Aug 2016 15​:24​:39 +0100, Zefram <zefram@​fysh.org> wrote​:

Eric Herman via perl5-porters wrote​:

This may not come up in German, it will come up at least some in dutch.

With different characters that have their own capitalisation rules.
In the case of the "IJ" digraph, Unicode has only uppercase U+132 "IJ"
and lowercase U+133 "ij". Thus it titlecases to the uppercase "IJ",
matching the standard Dutch usage.

Note that the Dutch law has banished the use of the IJ and ij ligatures
in official registration and documents. That means that my name shall
not be written as Merijn, but with the i and the j as separate letters​:
Merijn. That said, a name like "IJsbrand" is expected to uppercase
*both* I and J when titlecasing the name.

Does that help at all?

Not really. If they pass a law that means that Unicode can't do their
titlecasing rules properly then there isn't much we can do about it.

Yves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant