Skip to content

Conversation

monarchdodra
Copy link
Collaborator

This fixes #10203: toUpperInPlace is not in place. This new implementation is 100% in place, and, AFAIK, efficient.

This also addresses #10204: std.string does not take title case into account. It does not completely fix it though.

Algorithms are brand new, so diff view is mostly irrelevant.

Also:
Issue 10204 - std.string does not take title case into account
{
/*
* toLowerInPlace exploits the interesting fact that
* a letter's uppercase or lowercase utf representation will NEVER be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False

Quoting my recent answer on NG:

Okay, here you go - an UTF-8 table of cased sin

Codepoints upper-case - lower-case - UTF codeunits
0x01e9e : 0x000df - 3 : 2
0x0023a : 0x02c65 - 2 : 3
0x0023e : 0x02c66 - 2 : 3
0x02c7e : 0x0023f - 3 : 2
0x02c7f : 0x00240 - 3 : 2
0x02c6f : 0x00250 - 3 : 2
0x02c6d : 0x00251 - 3 : 2
0x02c70 : 0x00252 - 3 : 2
0x0a78d : 0x00265 - 3 : 2
0x0a7aa : 0x00266 - 3 : 2
0x02c62 : 0x0026b - 3 : 2
0x02c6e : 0x00271 - 3 : 2
0x02c64 : 0x0027d - 3 : 2
0x01e9e : 0x000df - 3 : 2
0x02c62 : 0x0026b - 3 : 2
0x02c64 : 0x0027d - 3 : 2
0x0023a : 0x02c65 - 2 : 3
0x0023e : 0x02c66 - 2 : 3
0x02c6d : 0x00251 - 3 : 2
0x02c6e : 0x00271 - 3 : 2
0x02c6f : 0x00250 - 3 : 2
0x02c70 : 0x00252 - 3 : 2
0x02c7e : 0x0023f - 3 : 2
0x02c7f : 0x00240 - 3 : 2
0x0a78d : 0x00265 - 3 : 2
0x0a7aa : 0x00266 - 3 : 2

And this is only with 1:1 mapping.

Generated by new and shiny std.uni:

void main(){
    import std.uni, std.utf, std.stdio;
    char buf[4];
    foreach(dchar ch; unicode.Cased_Letter.byCodepoint){
        dchar upper = toUpper(ch);
        dchar lower = toLower(ch);

        int uLen = encode(buf, upper);
        int lLen = encode(buf, lower);
        if(uLen != lLen)
            writefln("0x%05x : 0x%05x - %d : %d", upper, lower, uLen, lLen);
    }
} 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You missread me. You are comparing upper case and lower case. I said "original case" to result case.

For example: 0x01e9e uppercase is 0x000df.
But 0x000df's lower case isn't 0x01e9e. It's 0x000de.

Try this main (with your new uni, I wrote the equivalent with extracting the data from file):

import std.uni;
import std.utf;
import std.typetuple;

void main()
{
    foreach(Type; TypeTuple!(char, wchar))
    {
        foreach( dchar ch; 0..0x10FFFF)
        {
            dchar upper = toUpper(ch);
            dchar lower = toLower(ch);

            int len  = codeLength!Type(ch);
            int uLen = codeLength!Type(upper);
            int lLen = codeLength!Type(lower);
            assert(len >= uLen);
            assert(len >= lLen);
        }
    }
} 

As you can see, the encoding of any character's upper or lower is either the same, or smaller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, about your "cased letter", does it take into account characters such as 0x1c5, Dž? They are not upper case, not lower case, but "title case"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay if you don't trust me then:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toLowercase=%C9%B1:]

So are other latters in the last row

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed cased letter includes all letters that have case. It's strictly superset of what we want to test.
toUpper and toLower won't report wrong values - either the char itself or UPPER case. It may match the title case sometimes but it doesn't matter here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... well poop on a stick. Why didn't my own program catch those ? God damn it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, include any interesting related links in the documentation. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monarchdodra
Anyhow me being selfish wants you to think your algorithm ahead in terms of proper toLowerInPlace that will take care of 1:M mappings (that are rare but present). As part of std.uni pull I have to provide both toLower/toUpper so I'd like to do as little tweaking as possible. Will update soonish as I need to weed out mysterious linker failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did take into account 1:M mappings, as well as title case. However, I just built the algorithm on a false assumption. I'll have to try to see how my program failed (don't have it here :/)

Thankyou for reviewing. As always, you have humbled but educated me in regards to Unicode :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that feeling. If anything work on newer std.uni uni was full of:

Hm what's this means ?
(ffrustrating evening later)

Now I get it !

(few days after "discovery" and more tests)
No apparently not ..

Theme repeated many, many times. I just hope I didn't botched it up in some fundamental, unfixable way.
As I'm certain we will see some bugs and omissions down the road.

@monarchdodra
Copy link
Collaborator Author

OK, well according @blackwhale my implementation is flawed. So I'll have to fix it.

One thing it does prove though, is that there are strings that simply can't be upper cased or lower cased in place, as the result string will be longer than the original string.

Given this situation, what semantics do we want for InPlace? I can do an implementation that does at most a single GC allocation. Is this OK?

@DmitryOlshansky
Copy link
Member

Given this situation, what semantics do we want for InPlace? I can do an implementation that does at most a single GC allocation. Is this OK?

I was thinking the same when I rewrote toLower/toUpper not ni place, recently see:
https://github.com/blackwhale/gsoc-bench-2012/blob/master/uni.d#L6412

I've no idea for *inPlace stuff but thought I'd go for "whenever we see it's getting bigger - move forward and calculate full size, then reallocate".

@monarchdodra
Copy link
Collaborator Author

OK. Should I submit a fixup, or am I going to just get in your way? I suppose your implementation of "toLower" just forwards to "toCase" anyways? As for InPlace, yeah, I can definitely do as you said.

I also wanted to fix some "Title case is not correctly taken into account" issues. However, (AFAIK), these don't exist in 5.0, so don't require fixing yet. If your new implementations are correct, then it isn't worth touching anything anyways.

Off topic, but shouldn't your new std.uni have "isTitle" and "toTitle" ?

@DmitryOlshansky
Copy link
Member

OK. Should I submit a fixup, or am I going to just get in your way?

I'm not sure. If I fix the linker failure on linux/OSX then I'll be able to roll out something tweakable till weekend. That will not inlcude *inPlace as I haven't even started on that in-place algorithm.

Off topic, but shouldn't your new std.uni have "isTitle" and "toTitle"

Tables for toTitlecase are in place. Somebody asked about it just the other day during voting. isTitle is not but easy to add.

@monarchdodra
Copy link
Collaborator Author

Bah, I'll just wait. It's not like there is urgency or anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants