Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New copystring function #54424

Open
nhz2 opened this issue May 9, 2024 · 13 comments
Open

New copystring function #54424

nhz2 opened this issue May 9, 2024 · 13 comments
Labels
domain:arrays [a, r, r, a, y, s] domain:strings "Strings!"

Comments

@nhz2
Copy link
Contributor

nhz2 commented May 9, 2024

Since calling String on a view of an AbstractVector{UInt8} is potential UB #54372 (comment)
I think it would be helpful to have a function.

copystring(v::AbstractVector{UInt8}) = take_string!(copyto!(StringMemory(length(v)), v)))

I could use this to fix ZipArchives.jl by replacing current uses of the String constructor on vectors of bytes with copystring.

Would this be better in a separate CopyStrings.jl package?

@nhz2 nhz2 added domain:arrays [a, r, r, a, y, s] domain:strings "Strings!" labels May 9, 2024
@nhz2 nhz2 changed the title New copy_string function New copystring function May 10, 2024
@nhz2
Copy link
Contributor Author

nhz2 commented May 10, 2024

Here is an example where using the String constructor on a Vector{UInt8} can result in mutating a String

io = IOBuffer()
write(io, [0x61,0x62,0x63])
seekstart(io)
a = read(io, 3)
b = reshape(a, 1, 3)
c = reshape(b, 3)
s = String(a)
@show s
c[1] = 0x62
@show s

This changes s from "abc" to "bbc"
This happens on Julia 1.6 as well as nightly.

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 10, 2024

copystring sounds like a string is being copied which is confusing.

@nhz2
Copy link
Contributor Author

nhz2 commented May 11, 2024

Maybe bytes2string is a better name?

@jariji
Copy link
Contributor

jariji commented May 11, 2024

Python calls this function bytes.decode.

bytes2string is clear, which is the most important factor imo.

2 names in Base: bytes2hex, deg2rad, hex2bytes, hex2bytes!, rad2deg.
to names in Base: htol, hton, ltoh, ntoh.

It should maybe take a Type{AbstractString} argument so it can support Strs.jl etc.

@nhz2
Copy link
Contributor Author

nhz2 commented May 11, 2024

There is already a transcode(::Type{String}, src) function.

transcode(::Type{String}, src) = String(transcode(UInt8, src))

Maybe this can be changed to avoid mutating the input.

@jakobnissen
Copy link
Contributor

One solution could be to make take! and read on IOBuffer not use Base.StringVector, but allocate a normal vector. This will make using IOBuffer for creating strings less efficient, but that is addressed by the new takestring! function in #54372.

I'm willing to pitch a PR if that's an acceptable solution.

@oscardssmith
Copy link
Member

I think that is a good solution.

@KristofferC
Copy link
Sponsor Member

This will make using IOBuffer for creating strings less efficient,

That seems completely unacceptable? Isn't the whole reason not to make this slow pretty much the reason for the current behavior of String?

@oscardssmith
Copy link
Member

@KristofferC the point is that with this change takestring!(iobuffer) will be as efficient as String(take!(iobuffer)) is now, while String(take!(iobuffer)) will have a copy. The reason for this change is that a single takestring! fully encapsulates the sketchy part where an immutable String is created from mutable data without a copy, while our current idiom for this exposes the mutable object to the user which creates the possibility of the user accidentally holding on to a reference of the mutable data.

With this change in combination with #54372, existing users will see a regression that is fixed by replacing instances of String(take!(iobuffer)) with takestring!(iobuffer) (we will likely make a Compat PR to make takestring! available in older julia versions).

@KristofferC
Copy link
Sponsor Member

he point is that with this change takestring!(iobuffer) will be as efficient as String(take!(iobuffer)) is now, while String(take!(iobuffer)) will have a copy.

The whole reason we didn't do this X years ago was that adding this type of perf regression to code run everywhere was not acceptable. I don't see why it would be now.

@KristofferC
Copy link
Sponsor Member

This is the comment I was thinking about: #24388 (comment)

@KristofferC
Copy link
Sponsor Member

Those who don't know history are destined to repeat it.

Reading through #24388, #25241, #25846, #26093 is probably a good idea for people interested in this to know a bit about the discussing and history that has lead to the current behavior.

@nhz2
Copy link
Contributor Author

nhz2 commented May 13, 2024

String(copy(v)) still makes two copies as noted in #26093 (comment) In ZipArchives.jl I would like to avoid the extra copy. This is currently possible with unsafe_string, so an extra function in Base is not needed, but IMO it would be nice to avoid having to deal with pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] domain:strings "Strings!"
Projects
None yet
Development

No branches or pull requests

5 participants