Issue 10203 - std.string.toUpperInPlace is... not in place #1322

monarchdodra · 2013-05-29T17:12:09Z

This fixes #10203: toUpperInPlace is not in place. This new implementation is 100% in place, and, AFAIK, efficient.

This also addresses #10204: std.string does not take title case into account. It does not completely fix it though.

Algorithms are brand new, so diff view is mostly irrelevant.

Also: Issue 10204 - std.string does not take title case into account

DmitryOlshansky · 2013-05-29T20:03:22Z

std/string.d

+{
+    /*
+    * toLowerInPlace exploits the interesting fact that
+    * a letter's uppercase or lowercase utf representation will NEVER be


False

Quoting my recent answer on NG:

Okay, here you go - an UTF-8 table of cased sin Codepoints upper-case - lower-case - UTF codeunits 0x01e9e : 0x000df - 3 : 2 0x0023a : 0x02c65 - 2 : 3 0x0023e : 0x02c66 - 2 : 3 0x02c7e : 0x0023f - 3 : 2 0x02c7f : 0x00240 - 3 : 2 0x02c6f : 0x00250 - 3 : 2 0x02c6d : 0x00251 - 3 : 2 0x02c70 : 0x00252 - 3 : 2 0x0a78d : 0x00265 - 3 : 2 0x0a7aa : 0x00266 - 3 : 2 0x02c62 : 0x0026b - 3 : 2 0x02c6e : 0x00271 - 3 : 2 0x02c64 : 0x0027d - 3 : 2 0x01e9e : 0x000df - 3 : 2 0x02c62 : 0x0026b - 3 : 2 0x02c64 : 0x0027d - 3 : 2 0x0023a : 0x02c65 - 2 : 3 0x0023e : 0x02c66 - 2 : 3 0x02c6d : 0x00251 - 3 : 2 0x02c6e : 0x00271 - 3 : 2 0x02c6f : 0x00250 - 3 : 2 0x02c70 : 0x00252 - 3 : 2 0x02c7e : 0x0023f - 3 : 2 0x02c7f : 0x00240 - 3 : 2 0x0a78d : 0x00265 - 3 : 2 0x0a7aa : 0x00266 - 3 : 2 And this is only with 1:1 mapping. Generated by new and shiny std.uni: void main(){ import std.uni, std.utf, std.stdio; char buf[4]; foreach(dchar ch; unicode.Cased_Letter.byCodepoint){ dchar upper = toUpper(ch); dchar lower = toLower(ch); int uLen = encode(buf, upper); int lLen = encode(buf, lower); if(uLen != lLen) writefln("0x%05x : 0x%05x - %d : %d", upper, lower, uLen, lLen); } }

You missread me. You are comparing upper case and lower case. I said "original case" to result case.

For example: 0x01e9e uppercase is 0x000df.
But 0x000df's lower case isn't 0x01e9e. It's 0x000de.

Try this main (with your new uni, I wrote the equivalent with extracting the data from file):

import std.uni; import std.utf; import std.typetuple; void main() { foreach(Type; TypeTuple!(char, wchar)) { foreach( dchar ch; 0..0x10FFFF) { dchar upper = toUpper(ch); dchar lower = toLower(ch); int len = codeLength!Type(ch); int uLen = codeLength!Type(upper); int lLen = codeLength!Type(lower); assert(len >= uLen); assert(len >= lLen); } } }

As you can see, the encoding of any character's upper or lower is either the same, or smaller.

BTW, about your "cased letter", does it take into account characters such as 0x1c5, ǅ? They are not upper case, not lower case, but "title case"?

Okay if you don't trust me then:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toLowercase=%C9%B1:]

So are other latters in the last row

Indeed cased letter includes all letters that have case. It's strictly superset of what we want to test.
toUpper and toLower won't report wrong values - either the char itself or UPPER case. It may match the title case sometimes but it doesn't matter here.

Hum... well poop on a stick. Why didn't my own program catch those ? God damn it.

Guys, include any interesting related links in the documentation. Thanks!

@monarchdodra
Anyhow me being selfish wants you to think your algorithm ahead in terms of proper toLowerInPlace that will take care of 1:M mappings (that are rare but present). As part of std.uni pull I have to provide both toLower/toUpper so I'd like to do as little tweaking as possible. Will update soonish as I need to weed out mysterious linker failure.

Yeah, I did take into account 1:M mappings, as well as title case. However, I just built the algorithm on a false assumption. I'll have to try to see how my program failed (don't have it here :/)

Thankyou for reviewing. As always, you have humbled but educated me in regards to Unicode :)

I know that feeling. If anything work on newer std.uni uni was full of:

Hm what's this means ?
(ffrustrating evening later)

Now I get it !

(few days after "discovery" and more tests)
No apparently not ..

Theme repeated many, many times. I just hope I didn't botched it up in some fundamental, unfixable way.
As I'm certain we will see some bugs and omissions down the road.

monarchdodra · 2013-05-29T20:55:22Z

OK, well according @blackwhale my implementation is flawed. So I'll have to fix it.

One thing it does prove though, is that there are strings that simply can't be upper cased or lower cased in place, as the result string will be longer than the original string.

Given this situation, what semantics do we want for InPlace? I can do an implementation that does at most a single GC allocation. Is this OK?

DmitryOlshansky · 2013-05-29T21:01:48Z

Given this situation, what semantics do we want for InPlace? I can do an implementation that does at most a single GC allocation. Is this OK?

I was thinking the same when I rewrote toLower/toUpper not ni place, recently see:
https://github.com/blackwhale/gsoc-bench-2012/blob/master/uni.d#L6412

I've no idea for *inPlace stuff but thought I'd go for "whenever we see it's getting bigger - move forward and calculate full size, then reallocate".

monarchdodra · 2013-05-29T21:12:55Z

OK. Should I submit a fixup, or am I going to just get in your way? I suppose your implementation of "toLower" just forwards to "toCase" anyways? As for InPlace, yeah, I can definitely do as you said.

I also wanted to fix some "Title case is not correctly taken into account" issues. However, (AFAIK), these don't exist in 5.0, so don't require fixing yet. If your new implementations are correct, then it isn't worth touching anything anyways.

Off topic, but shouldn't your new std.uni have "isTitle" and "toTitle" ?

DmitryOlshansky · 2013-05-29T21:19:05Z

OK. Should I submit a fixup, or am I going to just get in your way?

I'm not sure. If I fix the linker failure on linux/OSX then I'll be able to roll out something tweakable till weekend. That will not inlcude *inPlace as I haven't even started on that in-place algorithm.

Off topic, but shouldn't your new std.uni have "isTitle" and "toTitle"

Tables for toTitlecase are in place. Somebody asked about it just the other day during voting. isTitle is not but easy to add.

monarchdodra · 2013-05-29T21:32:47Z

Bah, I'll just wait. It's not like there is urgency or anything.

Issue 10203 - std.string.toUpperInPlace is... not in place

a824943

Also: Issue 10204 - std.string does not take title case into account

DmitryOlshansky reviewed May 29, 2013
View reviewed changes

monarchdodra closed this May 29, 2013

Uh oh!

Issue 10203 - std.string.toUpperInPlace is... not in place #1322

Issue 10203 - std.string.toUpperInPlace is... not in place #1322

Uh oh!

Conversation

monarchdodra commented May 29, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

monarchdodra commented May 29, 2013

Uh oh!

DmitryOlshansky commented May 29, 2013

Uh oh!

monarchdodra commented May 29, 2013

Uh oh!

DmitryOlshansky commented May 29, 2013

Uh oh!

monarchdodra commented May 29, 2013

Uh oh!

Uh oh!