Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly track freed bytes in array_to_string #54309

Merged
merged 1 commit into from
May 1, 2024

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Apr 29, 2024

Should fix #54275.

@d-netto d-netto added the domain:arrays [a, r, r, a, y, s] label Apr 29, 2024
@gbaraldi
Copy link
Member

I'm a bit confused, because the array you get is empty

julia> a = rand(UInt8, 5)
5-element Vector{UInt8}:
 0xda
 0x73
 0x44
 0x3e
 0xb5

julia> str = String(a)
"\xdasD>\xb5"

julia> a
UInt8[]

So i'm a bit suspicious of how this is implemented. This is moving the memory from the Array to the string, so freeing the string is what should do the change.

@d-netto
Copy link
Member Author

d-netto commented Apr 29, 2024

Yes, that's one source of confusion for me as well. As it is, it seems like jl_array_to_string is calling jl_pchar_to_string which does copy the array into the allocated string.

For some reason, we're still resetting the indices of the Array{UInt8}, even though its buffer has not been moved and is still allocated in the Array{UInt8}.

I see two paths forward:

  • Don't reset the indices of the Array{UInt8} to ensure they actually reflect the fact that the buffer has been copied, and not moved (this PR).
  • Don't allocate a new buffer to the String and actually move the ownership of the buffer of the Array{UInt8} to the String.

@d-netto
Copy link
Member Author

d-netto commented Apr 29, 2024

Three possible paths forward, actually, the third one being what you suggested: implement a function to free the buffer which has been moved and make sure it's correctly accounted in GC live bytes.

@gbaraldi
Copy link
Member

gbaraldi commented Apr 29, 2024

I believe the issue is that ->maxsize shouldn't be reset. The others should be reset.

@d-netto
Copy link
Member Author

d-netto commented Apr 29, 2024

I believe the issue is that ->maxsize shouldn't be reset

Yes, this does solve #54275.

Still, it's a bit odd that constructing a String from an Array{UInt8} takes ownership of the corresponding buffer and resets the array. Copying the buffer and keeping the array intact seems like the less surprising/most intuitive behavior here.

If this is breaking, then yes, let's just solve the accounting bug and keep the observable behavior of String(::Array{UInt8}) as it is.

@oscardssmith
Copy link
Member

I think the current behavior of taking ownership is desired for performance. If you have a 1gb Vector{Uint8} having to copy all that data to get a string will be pretty slow (and double your memory requirement).

@d-netto
Copy link
Member Author

d-netto commented Apr 29, 2024

If you have a 1gb Vector{Uint8} having to copy all that data to get a string will be pretty slow (and double your memory requirement).

If I'm not misunderstanding something, that's precisely what we do in the C Array implementation? Link.

@oscardssmith
Copy link
Member

Right, so if you don't copy the memory, the reason we set maxsize to 0 is to prevent the original array from mutating the string afterward.

@d-netto
Copy link
Member Author

d-netto commented Apr 29, 2024

Right, so if you don't copy the memory, the reason we set maxsize to 0 is to prevent the original array from mutating the string afterward.

I meant to say that at least in String(::Array{UInt8}) we do copy the underlying buffer.

@oscardssmith
Copy link
Member

In that context, I really have no idea why we're doing this (but I want to make sure @vtjnash gets a chance to weigh in here)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 29, 2024

This case is only if you built the array weirdly for constructing a string (eg with pushfirst). If we have to copy the data anyways to shift it, we might as well allocate a new one of the exact right size too

@d-netto d-netto force-pushed the dcn-fix-54275 branch 2 times, most recently from 960e5bc to 8923c66 Compare April 29, 2024 18:52
@d-netto
Copy link
Member Author

d-netto commented Apr 29, 2024

In that case, to avoid changing the behavior of resetting the array, I just added a call (jl_gc_count_freed ) to track the number of freed bytes when we copy such buffer and reset the array.

Does it seem more reasonable to you folks?

@d-netto d-netto added the GC Garbage collector label Apr 29, 2024
@d-netto d-netto changed the title don't reset array length and size fields on array_to_string correctly track freed bytes in array_to_string Apr 29, 2024
@gbaraldi
Copy link
Member

I think your previous change (only maxsize) is more correct here. Because the buffer hasn't been freed yet, we've just resized the array, the memory is still there.

@d-netto d-netto force-pushed the dcn-fix-54275 branch 2 times, most recently from 960e5bc to 7b08f4c Compare April 29, 2024 19:04
@oscardssmith
Copy link
Member

@gbaraldi that fix isn't correct because with it, a push! to the array will expand back into the String's memory which would corrupt it.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 29, 2024

This is moving the memory from the Array to the string, so freeing the string is what should do the change

That is not quite accurate. It is not moving memory, as that memory was always owned by the String, it is simply deleting the reference to that String that used to be part of the Array.

@gbaraldi
Copy link
Member

So we are talking about the case at the bottom where we just make a copy.

@d-netto
Copy link
Member Author

d-netto commented May 1, 2024

Bump.

@oscardssmith oscardssmith merged commit 54614fd into release-1.10 May 1, 2024
6 checks passed
@oscardssmith oscardssmith deleted the dcn-fix-54275 branch May 1, 2024 18:00
@oscardssmith
Copy link
Member

Can you also make the master version of this PR?

d-netto added a commit to RelationalAI/julia that referenced this pull request May 2, 2024
d-netto added a commit to RelationalAI/julia that referenced this pull request May 3, 2024
d-netto added a commit that referenced this pull request May 6, 2024
Master version of #54309.

Should fix #54275.

---------

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
KristofferC pushed a commit that referenced this pull request May 6, 2024
Master version of #54309.

Should fix #54275.

---------

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
(cherry picked from commit 92ccc74)
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 9, 2024
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 19, 2024
gbaraldi pushed a commit that referenced this pull request May 24, 2024
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 26, 2024
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 28, 2024
DelveCI pushed a commit to RelationalAI/julia that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants