Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

Open
cbartz opened this issue Dec 20, 2022 · 1 comment
Open

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

cbartz opened this issue Dec 20, 2022 · 1 comment

Comments

@cbartz
Copy link

cbartz commented Dec 20, 2022

We currently use rdflib to process our thesaurus STW. We use the Graph.parse method to parse the corresponding RDF XML and the resulting graph is pickled (either by pickle from stdlib or with joblib) when saving different types of machine learning models/objects.

I noticed that the size of the models increased a lot when updating from 5.0.0 to 6.2.0. I could recreate this behavior with a simple script:

from pathlib import Path
from pickle import dump

from rdflib import Graph

STW_PATH = Path("/path/to/stw_9.12.rdf")
OUTPUT = Path("/tmp/output_x.y.z.pickle")

if __name__ == '__main__':
    g = Graph()
    g.parse(str(STW_PATH))

    with OUTPUT.open("wb") as f:
        dump(g, f)
$ ll -h /tmp/*.pickle
-rw-rw-r-- 1 1000 1000 7,2M Dez 20 09:33 /tmp/output_5.0.0.pickle
-rw-rw-r-- 1 1000 1000  21M Dez 20 09:35 /tmp/output_6.2.0.pickle

The size of the stw RDF file is 15 MB.

I wanted to report this behavior as it is cleary a step backwards in terms of disk space used and may be relevant to others as well. Although I'm not sure if the method of serialization by pickling is supported by you. I could only find one chapter in your docs about saving RDF in human readable formats. However, loading a pickled graph is much faster than parsing a graph, which is relevant when using a graph in a production system (where launch times matter).

@mielvds
Copy link
Contributor

mielvds commented Mar 26, 2023

We also used to pickle Graph objects indirectly through the Prefect workflow framework and it failed to serialize some graphs. We didn't find out why yes/no though; we now use the standard RDF serialization formats, which is indeed slow. I would be interested in contributing an optimized pickle implementation, but not sure what's needed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants