Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore non-copying String behavior, add unsafe_string*, ... #16731

Merged
merged 7 commits into from
Jun 8, 2016

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jun 3, 2016

This PR closes #16470 and #16713. It:

  • Restores the old non-copying behavior of the String(a::Vector{UInt8}) constructor, which once again takes ownership of the array. bytestring(a) is now deprecated to String(copy(a)).
  • Renames String(ptr, len) to unsafe_string(ptr, len) (this function makes a copy from a pointer).
  • Renames pointer_to_string(ptr, len) to unsafe_string_wrapper(ptr, len) unsafe_wrap(String, ptr, len), which is hopefully more descriptive and now begins with unsafe as people seem to prefer. (This function does not copy the date, but simply "wraps" a String around the data.)
  • Renames pointer_to_array to unsafe_array_wrapper unsafe_wrap(Array, ...) for consistency.

(Note that I had to work around #16730 to document the String(a::Vector{UInt8}) constructor by calling doc! manually.)

In the course of making this change, I had to manually inspect every call to String(...), and I cleaned up many of the call sites somewhat.

@stevengj stevengj added the domain:strings "Strings!" label Jun 3, 2016
@stevengj
Copy link
Member Author

stevengj commented Jun 3, 2016

Another possible name, based on @nalimilan's suggestion, would be unsafe_wrapper(String, ptr, len) and unsafe_wrapper(Array, ptr, len). I don't have strong feelings on this.

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2016

If we're going to be renaming both these things, +1 to using a more general unsafe_wrapper(T, ptr, len) which would be extensible.

@@ -159,7 +159,7 @@ function strftime(fmt::AbstractString, tm::TmStruct)
if n == 0
return ""
end
return String(pointer(timestr), n)
return unsafe_string(pointer(timestr), n)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think String(timestr[1:n]) would be better here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the same pattern is used in several other places. Should they also be changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. But there are also a lot of unsafe_string(pointer(buffer)) usages that are not so easily changed, because they use nul-termination to compute the string length.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 3, 2016

The unsafe_wrap(String, ptr, len) suggestion sounds good to me. The rest of the PR lgtm.

end
# load file
if !isempty(ARGS) && !isempty(ARGS[1])
# program
repl = false
# remove filename from ARGS
global PROGRAM_FILE = String(shift!(ARGS))
global PROGRAM_FILE = shift!(ARGS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a typeassert just in case?

Copy link
Member Author

@stevengj stevengj Jun 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARGS is a Vector{String}, so a typeassert seems superfluous. The only reason for the String call here was that this used to be UTF8String(shift!(ARGS)) and in those days the conversion was needed to avoid having the type of PROGRAM_FILE depend on the arguments. I think it just got auto search-and-replaced by String when @StefanKarpinski made the switch.

@nalimilan
Copy link
Member

+1 for the cleanup!

I wonder about unsafe_string though, since it has no parallel for arrays. Maybe we should even merge it into unsafe_wrapper via a copy argument? It would be more natural to me, since the two operations are very similar.

This function is labelled "unsafe" because it will crash if `pointer` is not
a valid memory address to data of the requested length.
"""
unsafe_array_wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic point: maybe more logical to document function unsafe_array_wrapper end before its methods definitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't document it before it is defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can define it with the zero-method function end syntax though, just for the purposes of documenting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. (It would be good if this were explained in the manual. There's an obscure comment about the function end syntax being "preferred" with no explanation.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it would. we should write down a list of new things that are poorly documented that you've been noticing.

@stevengj
Copy link
Member Author

stevengj commented Jun 3, 2016

Renamed to unsafe_wrap(T, ptr, ...) for T=Array or T=String.

@StefanKarpinski StefanKarpinski mentioned this pull request Jun 3, 2016
32 tasks
open(fname) do io
Mmap.mmap(io, Vector{UInt8}, (Int(fsz),))
fsz = filesize(input)
if use_mmap && fsz > 0 && fsz < typemax(Int)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is from the original code, but I think the fsz > 0 check is probably because of #10516, so this check can be removed now.

And I don't understand the typemax(Int) check.... is there any circumstance in which fsz > typemax(Int) and readstring(input) will succeed? And why would mmap fail if fsz == typemax(Int)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanmaykm, you added the original version of this code in #3468, can you comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or actually it was @timholy in dcb01ae, for #3658. Is this still relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current mmap actually requires len < typemax(Int) - PAGESIZE

Copy link
Member Author

@stevengj stevengj Jun 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why don't we use mmap by default on Windows? ... ah, that is #4664. In general, this code probably needs revisiting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many people want to see this file removed from base entirely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I think this drive to push things out of Base is venturing too much into purism for its own sake. Julia's core market is technical users, and the "batteries included" philosophy is very appealing there. Also, because Julia is so rapidly evolving, it is a struggle to keep packages up to date, whereas including key functionality in Base makes it much easier to update it along with changes to the core language. (It also makes it easier to evaluate changes, because we immediately see their impact.)

But in any case, not a debate for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we constantly get complaints that readcsv is slow. independent developments outside of base are already better. don't want to get into Python's situation of broken batteries that no one uses because the packages are better and the stdlib is frozen. need to limit scope if julia the language is going to get to 1.0 soon enough for it to succeed outside of academic early adopters.

things being developed in different repos is just a workflow and automation problem that is easily solvable.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jun 3, 2016

Thank you, @stevengj! This was a wonderful birthday present. 🎂 🎉 🎈

@stevengj
Copy link
Member Author

stevengj commented Jun 4, 2016

I can't reproduce the Travis method ambiguity error on OSX, and it's weird that this failure doesn't occur on any other platform. Maybe I should just rebase and try again?

@stevengj
Copy link
Member Author

stevengj commented Jun 5, 2016

Yay, tests are green.

@TotalVerb
Copy link
Contributor

Was the original ambiguity problem a Julia bug? That should probably be reported as a separate issue.

@stevengj
Copy link
Member Author

stevengj commented Jun 6, 2016

I'm not actually sure, since I couldn't reproduce it. I could see how the original method definitions might be ambiguous, though I didn't think about it too hard since the code needed simplifying anyway.

@TotalVerb
Copy link
Contributor

I don't understand how this is supposed to work. Is this expected?

julia> f{T}(::Union{Type{Array},Type{Array{T}}}, ::Integer) = 1
f (generic function with 1 method)

julia> f(Array, 1)
ERROR: MethodError: no method matching f(::Type{Array}, ::Int64)
Closest candidates are:
  f{T}(::Union{Type{Array{T,N}},Type{Array}}, ::Integer)
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

julia> f{T}(::Union{Type{Array},Type{Array{T}}}, ::Int) = 2
f (generic function with 2 methods)

julia> f(Array, 1)
ERROR: MethodError: f(::Type{Array}, ::Int64) is ambiguous. Candidates:
  svec(Tuple{#f,Type{Array},Int64},svec(T),f{T}(::Union{Type{Array{T,N<:Any}},Type{Array}}, ::Int64) at REPL[4]:1)
  svec(Tuple{#f,Type{Array},Int64},svec(T),f{T}(::Union{Type{Array{T,N<:Any}},Type{Array}}, ::Integer) at REPL[2]:1)
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

I think what I don't understand here is:

  • why the first method didn't match,
  • and how there can be an ambiguity when the first method didn't match in the first place.

If I'm not missing something obvious here, I'll report it as a separate issue.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2016

The type parameter cannot be determined.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2016

Ref #13702

@TotalVerb
Copy link
Contributor

TotalVerb commented Jun 6, 2016

Got it. Though the issue referenced there, #3738, has been closed. I suppose that was a special case, and that case has been fixed?

If these things don't work, maybe this line:

function unsafe_wrap{T,N}(::Union{Type{Array},Type{Array{T}},Type{Array{T,N}}}
                          p::Ptr{T}, dims::NTuple{N,Int}, own::Bool=false)

should be changed also? Or does the presence of T in Ptr{T} resolve the issue?

@yuyichao
Copy link
Contributor

yuyichao commented Jun 6, 2016

This is at least possible to handle and it seems that this is correctly handled on master (which is why #3738 is closed). The case you had above should not work and explaining it better in the doc is why #13702 is still open AFAICT.

Create a string from the address of a C (0-terminated) string encoded as UTF-8.
A copy is made so the pointer can be safely freed. If `length` is specified, the
string does not have to be 0-terminated.
See also [`unsafe_string_wrapper`](:func:`unsafe_string_wrapper`), which takes a pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not called that any more

"""
unsafe_wrap(Array, pointer, dims, own=false)

Wrap a native pointer as a Julia `Array `object. The pointer element type determines the array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

``Array object.

@JeffBezanson
Copy link
Sponsor Member

A big +1 to merging this.

…unsafe_string_wrapper, and unsafe_array_wrapper; restore non-copying behavior of String(::Vector{UInt8}) constructor (closes JuliaLang#16470, closes JuliaLang#16713)
…opefully remove occasional method ambiguity errors: overflow is already checked by convert, and sizes ≤ 0 are already checked in jl_ptr_to_array
@stevengj
Copy link
Member Author

stevengj commented Jun 8, 2016

Rebased.

@StefanKarpinski
Copy link
Sponsor Member

I'm guessing that the libgit2 error is unrelated.

@tkelman
Copy link
Contributor

tkelman commented Jun 8, 2016

#16555. merge away

@bkamins
Copy link
Member

bkamins commented Jun 6, 2019

@stevengj This PR in particular introduced https://github.com/JuliaLang/julia/blob/master/stdlib/DelimitedFiles/src/DelimitedFiles.jl#L229

This makes readdlm mutate input if it is Vector{UInt8}. Was it intended or rather String(copy(input)) should be a proper implementation here?

@stevengj
Copy link
Member Author

stevengj commented Jun 6, 2019

@bkamins, String(input) looks correct to me there — there is no need to make a copy of input, because input is never modified during the execution of the function and the string is discarded on return. So avoiding the copy is good.

@bkamins
Copy link
Member

bkamins commented Jun 6, 2019

The problem is that input is modified. See:

julia> using DelimitedFiles

julia> s = """a,b,c
       1,2,3"""
"a,b,c\n1,2,3"

julia> input = Vector{UInt8}(s)
11-element Array{UInt8,1}:
 0x61
 0x2c
 0x62
 0x2c
 0x63
 0x0a
 0x31
 0x2c
 0x32
 0x2c
 0x33

julia> readdlm(input, ',')
2×3 Array{Any,2}:
  "a"   "b"   "c"
 1     2     3

julia> input
0-element Array{UInt8,1}

@stevengj
Copy link
Member Author

stevengj commented Jun 6, 2019

Oh, right, I forgot that String(bytes) nowadays truncates the bytes array.

It seems a shame to make a copy here, since the input could potentially be quite large. We could use unsafe_string(pointer(bytes), length(bytes)) as long as we do GC.@preserve. Alternatively, maybe we should have an unsafe_string(bytes) method that makes a string without truncating bytes. No, unsafe_string makes a copy as well; what we want is more like unsafe_wrap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String constructor can segfault
10 participants