Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always resize input to zero length in String(::Vector{UInt8}) #26093

Merged
merged 2 commits into from
Mar 14, 2018

Conversation

nalimilan
Copy link
Member

This makes the behavior more predictable than only resizing Vector{UInt8} inputs when they have been allocated via StringVector, as the caller may have obtained them from other code without knowing how they were created. This ensures code will not rely on the fact that a copy is made in many common cases. The behavior is also simpler to document.

This is an alternative to @stevengj's #25846 (I included a few changes from there), as discussed on the #strings Slack channel with @JeffBezanson and @StefanKarpinski.

The second commit fixes a bug I discovered when preparing the first one. It's not actually needed for the first one so better avoid squashing it.

@nalimilan nalimilan added domain:docs This change adds or pertains to documentation domain:strings "Strings!" labels Feb 17, 2018
@stevengj
Copy link
Member

stevengj commented Feb 17, 2018

The docstring should not give the misleading implication that String(vector) is zero-copy (as it used to be) except for special cases, however.

@nalimilan
Copy link
Member Author

Indeed. I've rephrased the docstring, let me know what you think.


When possible, the data in `v` will be used directly rather than making
a copy. This is the case for `Vector{UInt8}` arrays returned by [`take!`](@ref)
on a writable [`IOBuffer`](@ref) or by [`read(io, nb)`](@ref). Else, a copy is made.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Else -> Otherwise

When possible, the data in `v` will be used directly rather than making
a copy. This is the case for `Vector{UInt8}` arrays returned by [`take!`](@ref)
on a writable [`IOBuffer`](@ref) or by [`read(io, nb)`](@ref). Else, a copy is made.
In all cases, one should assume that the function takes "ownership" of the array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In all cases" seems overly broad here. Why not just say that this is the case for a Vector and leave it at that?

test/arrayops.jl Outdated
@testset "check that resize! is a no-op when size does not change" for T in (Int, UInt8, Int8)
x = Vector{T}(uninitialized, 1)
resize!(x, 0)
unsafe_store!(pointer(x), 1, 1)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is valid --- it does not need to be possible to store an element to an empty array. In the future the array's data pointer could be NULL, or tooling could flag this as a memory error, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's pretty hacky. I guess we can leave this one out and just rely on the String-based test below. I'm not sure whether we can find other cases where this bug would have user-visible consequences we could test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the test and squashed the commits.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Feb 17, 2018

The text about resizing to zero seems a bit misleading to me. If all it did was keep using the same buffer but truncate it to zero length, then you could still mutate the String's memory by pushing elements to the truncated array. The real point is that the memory buffer of the array is "stolen" by the String object and the array object is given a a new memory buffer, which happens to be empty since that's the most efficient thing to do. In the case of an immutable byte vector, the vector is not truncated afterwards. The real point is that if the vector is mutable, its buffer is taken and the vector is left with a new, empty buffer, preventing modification of the String through the buffer.

Incidentally, this is one of the reasons that having String(v, copy=true) didn't quite sit right with me since in the immutable case, you're not making a copy. What you're really asking for is "please don't empty the argument vector" – whatever it needs to do to accomplish that. Thus, String(v, empty=false) which would make a copy in the case of a mutable vector or just do the usual thing (jut share the memory) in the case of an immutable vector. Of course, if one writes String(copy(v)) and copy(v) is the identity for immutable v then it "just works" and the only issue is avoiding a second copy in the case where copy(v) makes a copy of a mutable vector and String then steals the buffer of that copy. That seems like a better API to me since the optimization is desirable in general.

@nalimilan
Copy link
Member Author

nalimilan commented Feb 17, 2018

Can you suggest a wording? It's hard to describe the behavior in a simple way while still giving enough details for people who are interested in the performance and copying characteristics.

EDIT: feel free to commit changes directly on GitHub if you like.


If you need to subsequently modify `v`, use `String(copy(v))` instead.
Create a new `String` from a byte vector `v` containing UTF-8 encoded
characters. If `v` is a mutable vector, it will be truncated and any future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated to zero length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this statement seems false. It only truncates Vector, not arbitrary mutable subtypes of AbstractVector.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior for other types should be implemented to match, but yes, at the moment this may not be 100% so so I guess this docstring is aspirational.

@JeffBezanson
Copy link
Sponsor Member

I agree String(copy(v)) is a good API, but we're not going to be able to optimize it any time soon. String(copy(v)) => String(v) is of course not valid, but nor does String(copy(v)) => String_in_place(copy(v)) work, since copy(v) might not have the right memory layout to be able to support that. We would basically need a special case to replace String(copy(v)) with String(copyto!(StringVector(length(v)), v)).

@stevengj
Copy link
Member

Maybe we should have String(v) always make a copy, and have a new copy-free truncating constructor String!(v) that throws an exception for non-StringVector arrays.

String(v) could print a deprecation message in 0.7 if is called on a StringVector.

@JeffBezanson
Copy link
Sponsor Member

I don't like that so much, since (1) there's no type called String!, (2) the vast majority (pretty much 100%) of uses of String(v) do not use v afterwards, and (3) this is not necessarily a mutating operation. In practice, we would have to tell people "never use String, always use String! instead", because that's the function that does what you want.

One place where something like String! or string! would be valuable though is to replace String(take!(iobuffer)). Perhaps take!(iobuffer, String). That would allow us to avoid an extra allocation since we know the caller doesn't want the vector, just the string underneath, so the buffer can reuse the vector object (just not its storage).

@nalimilan nalimilan force-pushed the nl/String branch 2 times, most recently from e491edc to cabe8ad Compare February 18, 2018 11:47
NEWS.md Outdated
* `String(array)` now accepts an arbitrary `AbstractVector{UInt8}` and "steals" the
memory buffer of mutable arrays, leaving the byte vector with an empty buffer which
is guaranteed not to be shared with the `String` object; if the byte vector is
immutable, it simply shares memory with the string and is not truncated ([#26093]).
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This promises too much --- we don't have the ability to share arbitrary immutable vector memory with strings. It also doesn't steal the memory of arbitrary mutable arrays, only Vector{UInt8}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest a wording?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed something, let me know what you think.

test/arrayops.jl Outdated
@@ -2342,4 +2342,4 @@ Base.view(::T25958, args...) = args
@test t[end,1,end] == @view(t[end,1,end]) == @views t[end,1,end]
@test t[end,end,1] == @view(t[end,end,1]) == @views t[end,end,1]
@test t[end,end,end] == @view(t[end,end,end]) == @views t[end,end,end]
end
end
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, those damned ending newlines... Fixed.

@stevengj
Copy link
Member

stevengj commented Feb 19, 2018

The basic problem here is that sometimes String makes a copy and sometimes not, and it is hard to determine which occurs when you look at a piece of code. If you had a String! that never makes a copy, and throws an exception if not, while String always makes a copy, it would be a lot easier to understand.

Another problem is that if you want to make a copy, then String(copy(v)) actually makes two copies of the data, since copy doesn't use a StringVector. Whereas if String(v) always made a copy, we could ensure that it only made a single copy.

@JeffBezanson
Copy link
Sponsor Member

In theory that seems nice, but in practice 99% of String(v) calls want the memory-stealing behavior.

@nalimilan nalimilan force-pushed the nl/String branch 2 times, most recently from 9d64ae9 to 2154ea6 Compare February 20, 2018 10:03
@stevengj
Copy link
Member

99% of String(v) calls may want the memory-stealing behavior, but probably less than 99% of them get it (because they allocated v the wrong way). The current inconsistency in the behavior of String (even for v::Vector{UInt8}) is sure to lead to unexpected allocation behavior.

(With String(v) changed to truncate v, that is another reason to rename it String!.)

@nalimilan
Copy link
Member Author

(With String(v) changed to truncate v, that is another reason to rename it String!.)

See also discussion about Hermitian! at #17367.

@JeffBezanson JeffBezanson added the status:triage This should be discussed on a triage call label Mar 10, 2018
@JeffBezanson JeffBezanson self-assigned this Mar 13, 2018
@StefanKarpinski
Copy link
Sponsor Member

I suspect that a very large portion of cases where people write String(v) are already getting the no-copy behavior that they want. Moreover, with this PR, if we find many cases where String(v) is forced to copy, we can fix that problem without breaking anything.

@JeffBezanson JeffBezanson added this to the 1.0 milestone Mar 14, 2018
@JeffBezanson
Copy link
Sponsor Member

At this point I think this should just be rebased and merged.

@StefanKarpinski StefanKarpinski removed the status:triage This should be discussed on a triage call label Mar 14, 2018
This makes the behavior more predictable than only resizing Vector{UInt8}
inputs when they have been allocated via StringVector, as the caller may have
obtained them from other code without knowing how they were created. This ensures
code will not rely on the fact that a copy is made in many common cases.
The behavior is also simpler to document.
Else we set the last element to zero for one-byte element types like UInt8,
which means that resizing a vector allocated with StringVector corrupts it.
Also add redundant checks in Julia code to avoid a function call.
@quinnj
Copy link
Member

quinnj commented Mar 22, 2018

Well, this led to some fun behavior because MbedTLS was calling string on an input Vector{UInt8} (for no reason really); poor web apps that were just trying to hmac their request bodies.....

Hopefully there's not too many cases where this change causes such subtle changes in behavior.

rejuvyesh added a commit to rejuvyesh/MuJoCo.jl that referenced this pull request Aug 10, 2018
Currently because `String` steals the buffer allocated,
we create a copy before conversion.
This may be suboptimal.

See JuliaLang/julia#26093
@kindlychung
Copy link
Contributor

Imagine you are just printing some debug info with String(v)...
Assuming this kind of things only happen 1% of time, it can lead to a lot of frustration.

@StefanKarpinski
Copy link
Sponsor Member

I think the way to get change here is not to comment on this long-closed issue but to open a new issue to discuss the possibility of migrating away from the current behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:docs This change adds or pertains to documentation domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants