Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented toUTFz. #123

Merged
merged 8 commits into from Jul 23, 2011
Merged

Implemented toUTFz. #123

merged 8 commits into from Jul 23, 2011

Conversation

jmdavis
Copy link
Member

@jmdavis jmdavis commented Jun 26, 2011

I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or
toStringz at all, but here's an implementation for toUTFz. After this is
in, we can make std.conv.to use it when converting to character
pointers, and we should probably make it so that we have toStringz,
toWstringz, and toDstringz which use it and return immutable character
pointers and get rid of toUTF16z.

I tried to make it so that toUTFz had a default argument for the template parameter for its return type, but I couldn't do it without forcing you to give the string type as well, since it had to reference the string type's template argument, which meant that it had to be after the string types template argument. So, that wasn't acceptable IMHO. It just gives a better reason for toStringz to stick around though, since it can be used for the default case where return an immutable character pointer (though it should be altered to use toUTFz). However, I figured that I'd deal with toStringz, toUTF16z, and std.conv.to after this gets merged in - particularly since we have outstanding pull requests both for std.string and std.conv.to, and I'd prefer not to create more merge headaches.

I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or
toStringz at all, but here's an implementation for toUTFz. After this is
in, we can make std.conv.to use it when converting to character
pointers, and we should probably make it so that we have toStringz,
toWstringz, and toDstringz which use it and return immutable character
pointers and get rid of toUTF16z.
@jmdavis
Copy link
Member Author

jmdavis commented Jun 26, 2011

I do have to ask whether

if((cast(size_t)p & 3) && *p == '\0')

is really the best that we can use. The cast(size_t)p & 3 portion seems to make it fail rather frequently with wstrings and dstrings, even if they'd pass *p == '\0'. I took it and the comment that goes with it from toStringz though, and unless there's another way to do that check which is less conservative but still correct, I don't know how to improve it. It does okay with strings at least, but it would be nice if it could do better. Since, every time that it fails, we have to allocate a new string instead of being able to use the current one.

@ghost
Copy link

ghost commented Jun 26, 2011

Nice work. I've asked for a feature on a related pull so that conv.to can also convert from a zero-terminated char_, wchar_ and dchar* (it only does char* now) to any type of D string. Perhaps we should have a mirrored fromUTFz function which std.conv can call?

But I think someone might already be working on conv.to to have these new additions.

@jmdavis
Copy link
Member Author

jmdavis commented Jun 26, 2011

As I mentioned, the idea is that std.conv.to will be calling toUTFz when converting any of the string types to any of the character pointer types. I just didn't make those changes in this pull request. There are multple pull requests affecting std.conv.to already as it is, and I think that we can wait until after they're dealt with to change std.conv.to for this.

@redstar
Copy link
Contributor

redstar commented Jun 27, 2011

Great work! Thanks for this implementation. I just tried it, and it works really charming.

toUTFz no longer guarantees that the string will remain zero-terminated.
If the string can be zero-terminated but isn't immutable and doesn't
need to be copied to have the requested character pointer type, then it
no longer copies. This means that it's possible to have a string which
is zero-terminated and then stops being zero-terminated if you alter the
character one passed its end, but that's not likely to be an issue in
most cases, and a note in the documentation points it out so that
programmers can know about it and deal with it appropriately.
@jmdavis
Copy link
Member Author

jmdavis commented Jul 13, 2011

Does anyone have anything left to say on this? Based on what's been said about it thus far, it seems like it should be okay to merge it in. So, if no one says anything within the next day or two, then I'm going to merge it in.

@ghost
Copy link

ghost commented Jul 13, 2011

Nope, I'm against these pointer dereference hacks. toStringz has a bug (I don't think it's in bugzilla as I just found it), and toUTFz is going to have the same bug with the "peek for the zero" approach:

import std.string;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");
    auto charptr = toStringz(a.foo[]);

    a.bar = "bo";
    printf(charptr);   // two chars, then garbage
}

Here's ported to your module, same bug:
import std.utf;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");
    auto charptr = toUTFz!(immutable(char)*)(a.foo[]);

    a.bar = "bo";
    printf(charptr);   // two chars, then garbage
}

@jmdavis
Copy link
Member Author

jmdavis commented Jul 13, 2011

Without checking past the end like they do, toUTFz and toStringz would always have to reallocate. Also, note the warning in the documentation:

$(RED Warning 1:) If the result of $(D toUTFz) equals $(D str.ptr), then if
anything alters the character one past the end of $(D str) (which is the
$(D '\0') character terminating the string), then the string won't be
zero-terminated anymore. However, that should only happen when you append to
$(D str) and no reallocation takes place or when $(D str) is a slice of a
larger array, and you alter the character in the larger array which is one
character past the end of $(D str).

Granted, you found a case which I didn't anticipate, and that warning should be adjusted, but the warning does make it clear that there's a possibility that your char* will not stay null-terminated if you alter the element one past its end. You just found another way to do so.

So, I think that this is perfectly acceptable behavior (particularly considering the alternative is to always allocate a new string). However, the warning does need to be improved.

@ghost
Copy link

ghost commented Jul 13, 2011

Well then I'll gladly avoid using Phobos functions. If I need all the performance I can get I'll roll out something of my own. Preferring speed over safety and then adding warnings in the docs like "this might just explode in your face if you do this or that" is just nuts.

Perhaps we should add another attribute, call it @unsafe, so I can easily avoid these types of functions. Can you spot the sarcasm in my post?

@jmdavis
Copy link
Member Author

jmdavis commented Jul 13, 2011

You're talking about a rare case where you have a character array which starts with '\0' residing immediately in memory after the string that you want to be zero-terminated and where rather than immediately using your zero-terminated string, you save it and then alter the character array which follows it in memory. That is a highly abnormal situation.

If you always reallocate rather than checking one past the end for a '\0', then every string which was initialized with a string literal will have to reallocate - a case which is very common.

So, you're asking for a performance penalty which is quite common in order to avoid a potential issue which is extremely uncommon. If you wanted to have a string be zero-terminated and you didn't care about doing it efficiently, all you have to do is append '\0' to it. So really, the primary purpose of toStringz and toUTFz is to try and get you zero-terminated string without any reallocations if possible. So, forcing it to reallocate just because of such a rare issue would pretty much defeat the purpose of having the functions in the first place.

@jmdavis
Copy link
Member Author

jmdavis commented Jul 13, 2011

If there's a way to check whether the element immediately following the array is part of another array and then choose to reallocate in such a case, then that would definitely be an improvement, but I don't know if that's possible. Barring that though, toStringz (and to a lesser extent toUTFz, since it also deals with conversion between string types) seems pretty pointless if you don't check one past the end of the string for '\0'.

I also put @System on the two overloads of toUTFZ which do pointer
arithmetic. They're obviously @System anyway, but tagging them with it
makes it clearer.
@schveiguy
Copy link
Member

Let's not forget that these aren't arrays, they are slices. They could be slices of anything, not just string literals or heap-allocated stuff

First, we can (generally) rule out other threads changing anything during the call to toStringz because the arg is not marked shared. So technically we can do the same thing for both immutable and const (but not shared).

Second, is it reasonable to expect someone to store the result of toStringz, and then possibly modify the data after the slice, then use the original stored result?

i.e. isn't the code Andrej posted overwhelmingly more likely to be:

import std.string;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");

    a.bar = "bo";
    printf(toStringz(a.foo));
}

I mean, I don't think I've ever used toStringz except as a filter for a C function parameter. Andrej, did you find this because you had failing code, or did you find it by constructing a case after reading the toStringz (toUTFz) source? My gut says that a documentation warning is probably enough, despite the theoretical corner case.

As a possible solution, we could provide an optional template parameter to toUTFz, i.e.:

toUTFz(P, bool unsafe = false) ...

One thing that could be done is to use the array management functions to avoid appending a 0 if there's already one there (the array management code knows how big the block is and how much is allocated, so it can verify the 0 without worry), but this still incurs a performance penalty, because the heap block info must be looked up.

@jmdavis
Copy link
Member Author

jmdavis commented Jul 21, 2011

Okay. Given that this implementation of toUTFz is just as good as toStringz (the only problem with it is shared with toStringz), I think that it should be merged in as-is. We can add further improvements to it later, but people have been using toStringz for ages without running into the problem that Andej has brought up, and the warning on toUTFz should make the situation clear so that people can avoid it if it could be a problem.

Personally, I'm not at all familiar with the GC stuff, so I can't easily add checks based on that even if we were certain that that's what we wanted to do. If we decide that that is what we want to do, someone can add them later.

@jmdavis
Copy link
Member Author

jmdavis commented Jul 23, 2011

Okay. I'm merging it in.

jmdavis added a commit that referenced this pull request Jul 23, 2011
@jmdavis jmdavis merged commit 282cd14 into dlang:master Jul 23, 2011
kuettler pushed a commit to kuettler/phobos that referenced this pull request Feb 6, 2018
Add explicit repo name to git fetch and git fetch --tags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants