rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

cbartz · 2022-12-20T09:37:43Z

We currently use rdflib to process our thesaurus STW. We use the Graph.parse method to parse the corresponding RDF XML and the resulting graph is pickled (either by pickle from stdlib or with joblib) when saving different types of machine learning models/objects.

I noticed that the size of the models increased a lot when updating from 5.0.0 to 6.2.0. I could recreate this behavior with a simple script:

from pathlib import Path
from pickle import dump

from rdflib import Graph

STW_PATH = Path("/path/to/stw_9.12.rdf")
OUTPUT = Path("/tmp/output_x.y.z.pickle")

if __name__ == '__main__':
    g = Graph()
    g.parse(str(STW_PATH))

    with OUTPUT.open("wb") as f:
        dump(g, f)

$ ll -h /tmp/*.pickle
-rw-rw-r-- 1 1000 1000 7,2M Dez 20 09:33 /tmp/output_5.0.0.pickle
-rw-rw-r-- 1 1000 1000  21M Dez 20 09:35 /tmp/output_6.2.0.pickle

The size of the stw RDF file is 15 MB.

I wanted to report this behavior as it is cleary a step backwards in terms of disk space used and may be relevant to others as well. Although I'm not sure if the method of serialization by pickling is supported by you. I could only find one chapter in your docs about saving RDF in human readable formats. However, loading a pickled graph is much faster than parsing a graph, which is relevant when using a graph in a production system (where launch times matter).

The text was updated successfully, but these errors were encountered:

mielvds · 2023-03-26T17:40:19Z

We also used to pickle Graph objects indirectly through the Prefect workflow framework and it failed to serialize some graphs. We didn't find out why yes/no though; we now use the standard RDF serialization formats, which is indeed slow. I would be interested in contributing an optimized pickle implementation, but not sure what's needed...

aucampia added the performance label Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

cbartz commented Dec 20, 2022

mielvds commented Mar 26, 2023

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

rdflib 5.0.0 to 6.2.0: pickled graph size increased #2184

Comments

cbartz commented Dec 20, 2022

mielvds commented Mar 26, 2023