Skip to content

Memory used during parsing never reclaimed #850

@baggepinnen

Description

@baggepinnen

I have long been trying to find the source of what I suspected was a memory leak somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();, I get some of it back but not all. If I continue to call the CSV.read another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAM

using CSV, DataFrames
logfile = "my_1GB_file.csv"
@time CSV.read(
    logfile,
    DataFrame;
    header = 15,
    datarow = 26,
    drop = (i, name) ->
        startswith(string(name), "Name") || startswith(string(name), "SymbolName"),
    delim = '\t',
    footerskip = 1,
);

Some concrete numbers:

With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB.
Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here?

The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue.
Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing

I'm running

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

(@v1.6) pkg> st CSV
      Status `~/.julia/environments/v1.6/Project.toml`
  [336ed68f] CSV v0.8.5

Edit: I've noticed that each read says something like

1.540763 seconds (793.73 k allocations: 1.238 GiB, 2.47% gc time, 19.43% compilation time)

i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory?


Edit2: The compilation time appears to be due to my drop function. Removing this removes the compilation time, but does not change the memory issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions