Implemented toUTFz. #123

jmdavis · 2011-06-26T09:45:24Z

I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or
toStringz at all, but here's an implementation for toUTFz. After this is
in, we can make std.conv.to use it when converting to character
pointers, and we should probably make it so that we have toStringz,
toWstringz, and toDstringz which use it and return immutable character
pointers and get rid of toUTF16z.

I tried to make it so that toUTFz had a default argument for the template parameter for its return type, but I couldn't do it without forcing you to give the string type as well, since it had to reference the string type's template argument, which meant that it had to be after the string types template argument. So, that wasn't acceptable IMHO. It just gives a better reason for toStringz to stick around though, since it can be used for the default case where return an immutable character pointer (though it should be altered to use toUTFz). However, I figured that I'd deal with toStringz, toUTF16z, and std.conv.to after this gets merged in - particularly since we have outstanding pull requests both for std.string and std.conv.to, and I'd prefer not to create more merge headaches.

I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or toStringz at all, but here's an implementation for toUTFz. After this is in, we can make std.conv.to use it when converting to character pointers, and we should probably make it so that we have toStringz, toWstringz, and toDstringz which use it and return immutable character pointers and get rid of toUTF16z.

jmdavis · 2011-06-26T09:54:45Z

I do have to ask whether

if((cast(size_t)p & 3) && *p == '\0')

is really the best that we can use. The cast(size_t)p & 3 portion seems to make it fail rather frequently with wstrings and dstrings, even if they'd pass *p == '\0'. I took it and the comment that goes with it from toStringz though, and unless there's another way to do that check which is less conservative but still correct, I don't know how to improve it. It does okay with strings at least, but it would be nice if it could do better. Since, every time that it fails, we have to allocate a new string instead of being able to use the current one.

ghost · 2011-06-26T16:39:35Z

Nice work. I've asked for a feature on a related pull so that conv.to can also convert from a zero-terminated char_, wchar_ and dchar* (it only does char* now) to any type of D string. Perhaps we should have a mirrored fromUTFz function which std.conv can call?

But I think someone might already be working on conv.to to have these new additions.

jmdavis · 2011-06-26T17:30:22Z

As I mentioned, the idea is that std.conv.to will be calling toUTFz when converting any of the string types to any of the character pointer types. I just didn't make those changes in this pull request. There are multple pull requests affecting std.conv.to already as it is, and I think that we can wait until after they're dealt with to change std.conv.to for this.

redstar · 2011-06-27T04:12:51Z

Great work! Thanks for this implementation. I just tried it, and it works really charming.

toUTFz no longer guarantees that the string will remain zero-terminated. If the string can be zero-terminated but isn't immutable and doesn't need to be copied to have the requested character pointer type, then it no longer copies. This means that it's possible to have a string which is zero-terminated and then stops being zero-terminated if you alter the character one passed its end, but that's not likely to be an issue in most cases, and a note in the documentation points it out so that programmers can know about it and deal with it appropriately.

jmdavis · 2011-07-13T01:42:11Z

Does anyone have anything left to say on this? Based on what's been said about it thus far, it seems like it should be okay to merge it in. So, if no one says anything within the next day or two, then I'm going to merge it in.

ghost · 2011-07-13T03:08:31Z

Nope, I'm against these pointer dereference hacks. toStringz has a bug (I don't think it's in bugzilla as I just found it), and toUTFz is going to have the same bug with the "peek for the zero" approach:

import std.string;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");
    auto charptr = toStringz(a.foo[]);

    a.bar = "bo";
    printf(charptr);   // two chars, then garbage
}

Here's ported to your module, same bug:
import std.utf;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");
    auto charptr = toUTFz!(immutable(char)*)(a.foo[]);

    a.bar = "bo";
    printf(charptr);   // two chars, then garbage
}

jmdavis · 2011-07-13T03:17:13Z

Without checking past the end like they do, toUTFz and toStringz would always have to reallocate. Also, note the warning in the documentation:

$(RED Warning 1:) If the result of $(D toUTFz) equals $(D str.ptr), then if
anything alters the character one past the end of $(D str) (which is the
$(D '\0') character terminating the string), then the string won't be
zero-terminated anymore. However, that should only happen when you append to
$(D str) and no reallocation takes place or when $(D str) is a slice of a
larger array, and you alter the character in the larger array which is one
character past the end of $(D str).

Granted, you found a case which I didn't anticipate, and that warning should be adjusted, but the warning does make it clear that there's a possibility that your char* will not stay null-terminated if you alter the element one past its end. You just found another way to do so.

So, I think that this is perfectly acceptable behavior (particularly considering the alternative is to always allocate a new string). However, the warning does need to be improved.

ghost · 2011-07-13T03:22:36Z

Well then I'll gladly avoid using Phobos functions. If I need all the performance I can get I'll roll out something of my own. Preferring speed over safety and then adding warnings in the docs like "this might just explode in your face if you do this or that" is just nuts.

Perhaps we should add another attribute, call it @unsafe, so I can easily avoid these types of functions. Can you spot the sarcasm in my post?

jmdavis · 2011-07-13T03:32:37Z

You're talking about a rare case where you have a character array which starts with '\0' residing immediately in memory after the string that you want to be zero-terminated and where rather than immediately using your zero-terminated string, you save it and then alter the character array which follows it in memory. That is a highly abnormal situation.

If you always reallocate rather than checking one past the end for a '\0', then every string which was initialized with a string literal will have to reallocate - a case which is very common.

So, you're asking for a performance penalty which is quite common in order to avoid a potential issue which is extremely uncommon. If you wanted to have a string be zero-terminated and you didn't care about doing it efficiently, all you have to do is append '\0' to it. So really, the primary purpose of toStringz and toUTFz is to try and get you zero-terminated string without any reallocations if possible. So, forcing it to reallocate just because of such a rare issue would pretty much defeat the purpose of having the functions in the first place.

jmdavis · 2011-07-13T03:43:12Z

If there's a way to check whether the element immediately following the array is part of another array and then choose to reallocate in such a case, then that would definitely be an improvement, but I don't know if that's possible. Barring that though, toStringz (and to a lesser extent toUTFz, since it also deals with conversion between string types) seems pretty pointless if you don't check one past the end of the string for '\0'.

@System

I also put @System on the two overloads of toUTFZ which do pointer arithmetic. They're obviously @System anyway, but tagging them with it makes it clearer.

schveiguy · 2011-07-13T18:26:47Z

Let's not forget that these aren't arrays, they are slices. They could be slices of anything, not just string literals or heap-allocated stuff

First, we can (generally) rule out other threads changing anything during the call to toStringz because the arg is not marked shared. So technically we can do the same thing for both immutable and const (but not shared).

Second, is it reasonable to expect someone to store the result of toStringz, and then possibly modify the data after the slice, then use the original stored result?

i.e. isn't the code Andrej posted overwhelmingly more likely to be:

import std.string;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");

    a.bar = "bo";
    printf(toStringz(a.foo));
}

I mean, I don't think I've ever used toStringz except as a filter for a C function parameter. Andrej, did you find this because you had failing code, or did you find it by constructing a case after reading the toStringz (toUTFz) source? My gut says that a documentation warning is probably enough, despite the theoretical corner case.

As a possible solution, we could provide an optional template parameter to toUTFz, i.e.:

toUTFz(P, bool unsafe = false) ...

One thing that could be done is to use the array management functions to avoid appending a 0 if there's already one there (the array management code knows how big the block is and how much is allocated, so it can verify the 0 without worry), but this still incurs a performance penalty, because the heap block info must be looked up.

jmdavis · 2011-07-21T02:24:08Z

Okay. Given that this implementation of toUTFz is just as good as toStringz (the only problem with it is shared with toStringz), I think that it should be merged in as-is. We can add further improvements to it later, but people have been using toStringz for ages without running into the problem that Andej has brought up, and the warning on toUTFz should make the situation clear so that people can avoid it if it could be a problem.

Personally, I'm not at all familiar with the GC stuff, so I can't easily add checks based on that even if we were certain that that's what we wanted to do. If we decide that that is what we want to do, someone can add them later.

jmdavis · 2011-07-23T20:53:50Z

Okay. I'm merging it in.

Implemented toUTFz.

Add explicit repo name to git fetch and git fetch --tags

jmdavis added 6 commits June 29, 2011 19:29

Merge branch 'master' into utfz

aff08b3

Improvement to warning in documentation.

f7d8ca5

Updates to toUTFz per Andrei's suggestions.

6c6a493

Changed toUTFz to use ElementEncodingType instead of typeof(str[0]).

1192616

Merge branch 'master' into utfz

5e79789

Attempt at improving warning on toUTFz.

25cf1cb

I also put @System on the two overloads of toUTFZ which do pointer arithmetic. They're obviously @System anyway, but tagging them with it makes it clearer.

jmdavis added a commit that referenced this pull request Jul 23, 2011

Merge pull request #123 from jmdavis/utfz

282cd14

Implemented toUTFz.

jmdavis merged commit 282cd14 into dlang:master Jul 23, 2011

kuettler pushed a commit to kuettler/phobos that referenced this pull request Feb 6, 2018

Merge pull request dlang#123 from andralex/update

7da36ba

Add explicit repo name to git fetch and git fetch --tags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented toUTFz. #123

Implemented toUTFz. #123

jmdavis commented Jun 26, 2011

jmdavis commented Jun 26, 2011

ghost commented Jun 26, 2011

jmdavis commented Jun 26, 2011

redstar commented Jun 27, 2011

jmdavis commented Jul 13, 2011

ghost commented Jul 13, 2011

jmdavis commented Jul 13, 2011

ghost commented Jul 13, 2011

jmdavis commented Jul 13, 2011

jmdavis commented Jul 13, 2011

schveiguy commented Jul 13, 2011

jmdavis commented Jul 21, 2011

jmdavis commented Jul 23, 2011

Implemented toUTFz. #123

Implemented toUTFz. #123

Conversation

jmdavis commented Jun 26, 2011

jmdavis commented Jun 26, 2011

ghost commented Jun 26, 2011

jmdavis commented Jun 26, 2011

redstar commented Jun 27, 2011

jmdavis commented Jul 13, 2011

ghost commented Jul 13, 2011

jmdavis commented Jul 13, 2011

ghost commented Jul 13, 2011

jmdavis commented Jul 13, 2011

jmdavis commented Jul 13, 2011

schveiguy commented Jul 13, 2011

jmdavis commented Jul 21, 2011

jmdavis commented Jul 23, 2011