Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in multi-threaded CSV.read #1045

Closed
bkamins opened this issue Oct 23, 2022 · 2 comments
Closed

Memory leak in multi-threaded CSV.read #1045

bkamins opened this issue Oct 23, 2022 · 2 comments

Comments

@bkamins
Copy link
Member

bkamins commented Oct 23, 2022

I do not remember if it was reported before (as similar issues were reported), but reading the file instagram_posts.csv that can be found in https://www.kaggle.com/datasets/shmalex/instagram-dataset using 4 threads leaves 15GB memory leak (after destroying all the visible variables that reference to the read file) + GC.gc() is very slow all the time.

When doing the same on a single thread all is OK, i.e. after removing references to the read file and doing GC.gc memory is back to the previous level.

Configuration: Win11, Julia 1.8.2, CSV.jl 0.10.4.

@quinnj
Copy link
Member

quinnj commented Oct 25, 2022

@bkamins, can you check your script/workflow on #1046? I believe we're probably also running into a similar issue in Arrow.jl w/ multithreaded reading/writing. It's probably also worth considering for DataFrames.jl and any other packages utilizing Threads.@spawn.

@bkamins
Copy link
Member Author

bkamins commented Oct 25, 2022

#1046 resolves the issue.

CC @nalimilan regarding Threads.@spawn in DataFrames.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants