New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented toUTFz. #123
Implemented toUTFz. #123
Conversation
I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or toStringz at all, but here's an implementation for toUTFz. After this is in, we can make std.conv.to use it when converting to character pointers, and we should probably make it so that we have toStringz, toWstringz, and toDstringz which use it and return immutable character pointers and get rid of toUTF16z.
I do have to ask whether
is really the best that we can use. The |
Nice work. I've asked for a feature on a related pull so that conv.to can also convert from a zero-terminated char_, wchar_ and dchar* (it only does char* now) to any type of D string. Perhaps we should have a mirrored fromUTFz function which std.conv can call? But I think someone might already be working on conv.to to have these new additions. |
As I mentioned, the idea is that std.conv.to will be calling |
Great work! Thanks for this implementation. I just tried it, and it works really charming. |
toUTFz no longer guarantees that the string will remain zero-terminated. If the string can be zero-terminated but isn't immutable and doesn't need to be copied to have the requested character pointer type, then it no longer copies. This means that it's possible to have a string which is zero-terminated and then stops being zero-terminated if you alter the character one passed its end, but that's not likely to be an issue in most cases, and a note in the documentation points it out so that programmers can know about it and deal with it appropriately.
Does anyone have anything left to say on this? Based on what's been said about it thus far, it seems like it should be okay to merge it in. So, if no one says anything within the next day or two, then I'm going to merge it in. |
Nope, I'm against these pointer dereference hacks. toStringz has a bug (I don't think it's in bugzilla as I just found it), and toUTFz is going to have the same bug with the "peek for the zero" approach:
Here's ported to your module, same bug:
|
Without checking past the end like they do,
Granted, you found a case which I didn't anticipate, and that warning should be adjusted, but the warning does make it clear that there's a possibility that your So, I think that this is perfectly acceptable behavior (particularly considering the alternative is to always allocate a new string). However, the warning does need to be improved. |
Well then I'll gladly avoid using Phobos functions. If I need all the performance I can get I'll roll out something of my own. Preferring speed over safety and then adding warnings in the docs like "this might just explode in your face if you do this or that" is just nuts. Perhaps we should add another attribute, call it @unsafe, so I can easily avoid these types of functions. Can you spot the sarcasm in my post? |
You're talking about a rare case where you have a character array which starts with If you always reallocate rather than checking one past the end for a So, you're asking for a performance penalty which is quite common in order to avoid a potential issue which is extremely uncommon. If you wanted to have a string be zero-terminated and you didn't care about doing it efficiently, all you have to do is append |
If there's a way to check whether the element immediately following the array is part of another array and then choose to reallocate in such a case, then that would definitely be an improvement, but I don't know if that's possible. Barring that though, |
Let's not forget that these aren't arrays, they are slices. They could be slices of anything, not just string literals or heap-allocated stuff First, we can (generally) rule out other threads changing anything during the call to Second, is it reasonable to expect someone to store the result of toStringz, and then possibly modify the data after the slice, then use the original stored result? i.e. isn't the code Andrej posted overwhelmingly more likely to be:
I mean, I don't think I've ever used toStringz except as a filter for a C function parameter. Andrej, did you find this because you had failing code, or did you find it by constructing a case after reading the toStringz (toUTFz) source? My gut says that a documentation warning is probably enough, despite the theoretical corner case. As a possible solution, we could provide an optional template parameter to toUTFz, i.e.:
One thing that could be done is to use the array management functions to avoid appending a 0 if there's already one there (the array management code knows how big the block is and how much is allocated, so it can verify the 0 without worry), but this still incurs a performance penalty, because the heap block info must be looked up. |
Okay. Given that this implementation of Personally, I'm not at all familiar with the GC stuff, so I can't easily add checks based on that even if we were certain that that's what we wanted to do. If we decide that that is what we want to do, someone can add them later. |
Okay. I'm merging it in. |
Add explicit repo name to git fetch and git fetch --tags
I haven't made std.conv.to use it yet, and I haven't touched toUTF16z or
toStringz at all, but here's an implementation for toUTFz. After this is
in, we can make std.conv.to use it when converting to character
pointers, and we should probably make it so that we have toStringz,
toWstringz, and toDstringz which use it and return immutable character
pointers and get rid of toUTF16z.
I tried to make it so that
toUTFz
had a default argument for the template parameter for its return type, but I couldn't do it without forcing you to give the string type as well, since it had to reference the string type's template argument, which meant that it had to be after the string types template argument. So, that wasn't acceptable IMHO. It just gives a better reason fortoStringz
to stick around though, since it can be used for the default case where return an immutable character pointer (though it should be altered to usetoUTFz
). However, I figured that I'd deal withtoStringz
,toUTF16z
, andstd.conv.to
after this gets merged in - particularly since we have outstanding pull requests both forstd.string
andstd.conv.to
, and I'd prefer not to create more merge headaches.