-
Notifications
You must be signed in to change notification settings - Fork 146
Description
I have long been trying to find the source of what I suspected was a memory leak somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();, I get some of it back but not all. If I continue to call the CSV.read another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAM
using CSV, DataFrames
logfile = "my_1GB_file.csv"
@time CSV.read(
logfile,
DataFrame;
header = 15,
datarow = 26,
drop = (i, name) ->
startswith(string(name), "Name") || startswith(string(name), "SymbolName"),
delim = '\t',
footerskip = 1,
);Some concrete numbers:
With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB.
Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here?
The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue.
Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing
I'm running
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
(@v1.6) pkg> st CSV
Status `~/.julia/environments/v1.6/Project.toml`
[336ed68f] CSV v0.8.5
Edit: I've noticed that each read says something like
1.540763 seconds (793.73 k allocations: 1.238 GiB, 2.47% gc time, 19.43% compilation time)
i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory?
Edit2: The compilation time appears to be due to my drop function. Removing this removes the compilation time, but does not change the memory issues.