Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in yajl2_c backend #28

Closed
tomplex opened this issue Apr 21, 2020 · 10 comments
Closed

Memory leak in yajl2_c backend #28

tomplex opened this issue Apr 21, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@tomplex
Copy link

tomplex commented Apr 21, 2020

Opening a 1GB JSON file and iterating over it using ijson.items takes up a large amount of memory which can not be reclaimed by running the garbage collector.

Software & Environment version:
ijson==3.0.1, Python3.8, Debian buster.

Included are a Dockerfile and Python file which fully demonstrate the issue.


# Dockerfile
FROM python:3.8-buster

RUN wget https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/NewYork.zip

RUN unzip NewYork.zip -d /data
RUN rm NewYork.zip

RUN apt update && apt install -y \
    libyajl2

RUN pip3 install ijson==3.0.1 cffi

COPY main.py /

ENTRYPOINT ["python", "/main.py"]

# main.py
import resource
import gc

import ijson.backends.yajl2_c as ijson
# import ijson.backends.python as ijson
# import ijson.backends.yajl2_cffi as ijson


def memusage():
    """memory usage in GB. getrusage defaults to KB on Linux."""
    return str(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)


def iter_features(filename):
    with open(filename, 'rb') as f:
        yield from ijson.items(f, "features.item")


def main():
    print("using backend", ijson.backend)
    print("starting memory usage:", memusage(), 'GB')

    for feature in iter_features('/data/NewYork.geojson'):
        pass

    print("memory usage after reading file:", memusage(), 'GB')
    gc.collect()
    print("memory usage after garbage collection:", memusage(), 'GB')


if __name__ == '__main__':
    main()

Create a new directory with the Dockerfile and main.py and then run:

$ docker build -t test-ijson .
$ docker run -it test-ijson
using backend yajl2_c
starting memory usage: 0.010604 GB
memory after reading file: 5.824332 GB
memory usage after garbage collection: 5.824332 GB

if you change main.py to use the python backend, there is no issue:

using backend python
starting memory usage: 0.01056 GB
memory after reading file: 0.01582 GB
memory usage after garbage collection: 0.01582 GB

Same with yajl2_cffi:

using backend yajl2_cffi
starting memory usage: 0.015032 GB
memory after reading file: 0.016292 GB
memory usage after garbage collection: 0.016292 GB

Though they are, of course, much slower to test than yajl_c.

rtobar added a commit that referenced this issue Apr 22, 2020
This problem was initially reported in #28. Since memory leaks are still
checked manually (i.e., there is no automatic unit tests that check
memory consumption remains constant throughout an ijson execution) this
leak was somehow missed until now.

The problem was *not* present in the 2.x series, and this Py_DECREF()
call got lost during the refactoring for version 3.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@rtobar
Copy link

rtobar commented Apr 22, 2020

Thanks a lot @tomplex for the report. The problem wasn't specific to items, but was actually in the underlying parse_basecoro method (so it affected not only items, but also kvitems and parse, and all their *_async and *_coro variants). As stated in the commit message, there are no tests yet to automatically check that memory consumption remains ~constant throughout a given ijson execution, and I usually center my own manual checks around basic_parse, which is why this problem wasn't caught earlier.

I'm actually a bit surprised that this problem wasn't reported earlier, given that keeping memory usage low is the whole point of ijson. But on the other hand the leak was only triggered by array structures, so if your JSON document had few arrays you wouldn't see your memory grow much, while in array-rich documents (like yours) it became more obvious.

BTW the error wasn't present in the 2.X ijson versions, and the leak was introduced by a line of code that got lost during the code refactoring for version 3.

Please give the new code a try (it's already on the master branch), and after confirmation I'll release a new ijson patch version to PyPI.

@rtobar
Copy link

rtobar commented Apr 22, 2020

Oh, and BTW, the garbage collector bit is unnecessary -- memory was leaked not because of circular references, but simply because its count references were not reduced. In other words users shouldn't ever need to call the garbage collector. I take it you added it to your tests to ensure you were doing all you could to clean memory usage, but with the fix in place you shouldn't even see your memory grow in the first place.

@rtobar rtobar added the bug Something isn't working label Apr 22, 2020
@tomplex
Copy link
Author

tomplex commented Apr 22, 2020

Thanks for the quick response @rtobar! I'm running my test again right now, I'll let you know how it looks.

@tomplex
Copy link
Author

tomplex commented Apr 22, 2020

Awesome, it looks good! Here's the output:

using backend yajl2_c
starting memory usage: 0.008796 GB
memory usage after reading file: 0.010052 GB
memory usage after garbage collection: 0.010052 GB

As for the the garbage collector, you're right, it was just to make sure I was doing everything possible to clean up memory. I appreciate the clarification, though.

Again, thanks for the very quick response. I'm glad I'll be able to switch back to the yajl2_c backend for work tomorrow, because it is way faster. =) thanks for your great work on this project!!

@rtobar
Copy link

rtobar commented Apr 22, 2020

No worries, I'll close this issue for the time being then and will leave a note once a new version is up in PyPI.

@rtobar rtobar closed this as completed Apr 22, 2020
@rtobar
Copy link

rtobar commented Apr 22, 2020

ijson 3.0.2 is now available in PyPI.

rtobar added a commit that referenced this issue Apr 23, 2020
This problem was caused by calling Py_DECREF() more than once on the
static, global PyObject* "item" variable that we use to keep the "item"
string literal in memory. The situation only arouse when the top-level
JSON value was an array, and therefore did not hit all users
necessarily.

More than one of our tests use arrays as their top-level JSON values
that went through parse_basecoro, so in principle we should have hit the
bug, but it still went unnoticed. This might have been just luck, with
the python memory allocator playing in our favor and placing valid
objects back into the memory previously used by the "item" variable
before we decreased its reference count again. In order to reproduce the
error more reliably more tests need to be added to try and hit the
problem with a higher chance.

This problem was originally reported in #29, and was a direct
consequence of the fix introduced to fix #28.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@lopter
Copy link

lopter commented Apr 24, 2020

Hello,

Thank you for maintaining ijson. I wanted to say that this also seems to fix a segfault happening when doing something like:

import ijson

with open("test.json", "rb") as fp:
    l = list(ijson.items(fp, "item"))
print(len(l))

with open("test.json", "rb") as fp:
    l = list(ijson.items(fp, "item"))
print(len(l))

Where the second call to ijson.items would yield a segfault.

@rtobar
Copy link

rtobar commented Apr 24, 2020

@lopter interesting. Out of curiosity, are you sure it was fixed in 3.0.2 and not in 3.0.3? In any case it's good to hear the problem is not present anymore.

@lopter
Copy link

lopter commented Apr 24, 2020

@lopter interesting. Out of curiosity, are you sure it was fixed in 3.0.2 and not in 3.0.3? In any case it's good to hear the problem is not present anymore.

Oh yes, that was fixed in 3.0.3, I am now realizing that I should have posted in #29 but got confused.

@rtobar
Copy link

rtobar commented Apr 25, 2020

Ok, now it makes more sense :-). Yes, that's exactly the type of situation that would have triggered the segfault reported at #29 provided that the top-level JSON value was an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants