-
-
Notifications
You must be signed in to change notification settings - Fork 743
Issue 10203 - std.string.toUpperInPlace is... not in place #1322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also: Issue 10204 - std.string does not take title case into account
{ | ||
/* | ||
* toLowerInPlace exploits the interesting fact that | ||
* a letter's uppercase or lowercase utf representation will NEVER be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False
Quoting my recent answer on NG:
Okay, here you go - an UTF-8 table of cased sin
Codepoints upper-case - lower-case - UTF codeunits
0x01e9e : 0x000df - 3 : 2
0x0023a : 0x02c65 - 2 : 3
0x0023e : 0x02c66 - 2 : 3
0x02c7e : 0x0023f - 3 : 2
0x02c7f : 0x00240 - 3 : 2
0x02c6f : 0x00250 - 3 : 2
0x02c6d : 0x00251 - 3 : 2
0x02c70 : 0x00252 - 3 : 2
0x0a78d : 0x00265 - 3 : 2
0x0a7aa : 0x00266 - 3 : 2
0x02c62 : 0x0026b - 3 : 2
0x02c6e : 0x00271 - 3 : 2
0x02c64 : 0x0027d - 3 : 2
0x01e9e : 0x000df - 3 : 2
0x02c62 : 0x0026b - 3 : 2
0x02c64 : 0x0027d - 3 : 2
0x0023a : 0x02c65 - 2 : 3
0x0023e : 0x02c66 - 2 : 3
0x02c6d : 0x00251 - 3 : 2
0x02c6e : 0x00271 - 3 : 2
0x02c6f : 0x00250 - 3 : 2
0x02c70 : 0x00252 - 3 : 2
0x02c7e : 0x0023f - 3 : 2
0x02c7f : 0x00240 - 3 : 2
0x0a78d : 0x00265 - 3 : 2
0x0a7aa : 0x00266 - 3 : 2
And this is only with 1:1 mapping.
Generated by new and shiny std.uni:
void main(){
import std.uni, std.utf, std.stdio;
char buf[4];
foreach(dchar ch; unicode.Cased_Letter.byCodepoint){
dchar upper = toUpper(ch);
dchar lower = toLower(ch);
int uLen = encode(buf, upper);
int lLen = encode(buf, lower);
if(uLen != lLen)
writefln("0x%05x : 0x%05x - %d : %d", upper, lower, uLen, lLen);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You missread me. You are comparing upper case and lower case. I said "original case" to result case.
For example: 0x01e9e uppercase is 0x000df.
But 0x000df's lower case isn't 0x01e9e. It's 0x000de.
Try this main (with your new uni, I wrote the equivalent with extracting the data from file):
import std.uni;
import std.utf;
import std.typetuple;
void main()
{
foreach(Type; TypeTuple!(char, wchar))
{
foreach( dchar ch; 0..0x10FFFF)
{
dchar upper = toUpper(ch);
dchar lower = toLower(ch);
int len = codeLength!Type(ch);
int uLen = codeLength!Type(upper);
int lLen = codeLength!Type(lower);
assert(len >= uLen);
assert(len >= lLen);
}
}
}
As you can see, the encoding of any character's upper or lower is either the same, or smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, about your "cased letter", does it take into account characters such as 0x1c5, Dž? They are not upper case, not lower case, but "title case"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay if you don't trust me then:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toLowercase=%C9%B1:]
So are other latters in the last row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed cased letter includes all letters that have case. It's strictly superset of what we want to test.
toUpper and toLower won't report wrong values - either the char itself or UPPER case. It may match the title case sometimes but it doesn't matter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum... well poop on a stick. Why didn't my own program catch those ? God damn it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, include any interesting related links in the documentation. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monarchdodra
Anyhow me being selfish wants you to think your algorithm ahead in terms of proper toLowerInPlace that will take care of 1:M mappings (that are rare but present). As part of std.uni pull I have to provide both toLower/toUpper so I'd like to do as little tweaking as possible. Will update soonish as I need to weed out mysterious linker failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did take into account 1:M mappings, as well as title case. However, I just built the algorithm on a false assumption. I'll have to try to see how my program failed (don't have it here :/)
Thankyou for reviewing. As always, you have humbled but educated me in regards to Unicode :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that feeling. If anything work on newer std.uni uni was full of:
Hm what's this means ?
(ffrustrating evening later)
Now I get it !
(few days after "discovery" and more tests)
No apparently not ..
Theme repeated many, many times. I just hope I didn't botched it up in some fundamental, unfixable way.
As I'm certain we will see some bugs and omissions down the road.
OK, well according @blackwhale my implementation is flawed. So I'll have to fix it. One thing it does prove though, is that there are strings that simply can't be upper cased or lower cased in place, as the result string will be longer than the original string. Given this situation, what semantics do we want for InPlace? I can do an implementation that does at most a single GC allocation. Is this OK? |
I was thinking the same when I rewrote toLower/toUpper not ni place, recently see: I've no idea for *inPlace stuff but thought I'd go for "whenever we see it's getting bigger - move forward and calculate full size, then reallocate". |
OK. Should I submit a fixup, or am I going to just get in your way? I suppose your implementation of "toLower" just forwards to "toCase" anyways? As for InPlace, yeah, I can definitely do as you said. I also wanted to fix some "Title case is not correctly taken into account" issues. However, (AFAIK), these don't exist in 5.0, so don't require fixing yet. If your new implementations are correct, then it isn't worth touching anything anyways. Off topic, but shouldn't your new std.uni have "isTitle" and "toTitle" ? |
I'm not sure. If I fix the linker failure on linux/OSX then I'll be able to roll out something tweakable till weekend. That will not inlcude *inPlace as I haven't even started on that in-place algorithm.
Tables for toTitlecase are in place. Somebody asked about it just the other day during voting. isTitle is not but easy to add. |
Bah, I'll just wait. It's not like there is urgency or anything. |
This fixes #10203: toUpperInPlace is not in place. This new implementation is 100% in place, and, AFAIK, efficient.
This also addresses #10204: std.string does not take title case into account. It does not completely fix it though.
Algorithms are brand new, so diff view is mostly irrelevant.