-
Notifications
You must be signed in to change notification settings - Fork 587
Description
I'm interested in this. I've been playing around with the idea of implementing RDF terms with object interning to save memory and avoid copying. This issue is a continuation from #2866.
In any embarassingly parallel, distributed ETLs where I've used RDFLib, I've always seen the memory usage grow over time. By implementing object interning, we may be able to fix this issue and potentially stop the memory growth when objects are no longer referenced. I think this particular issue is also related to this other issue described here #740.
The key is to implement RDF terms as immutable data structures. This way, we can safely reuse references to the same object if the unicode code point sequence in the term's value is the same.
An example of a Blank Node implementation with object interning and is thread-safe when accessing the weakrefs. Memory should be freed once the objects are no longer in use even though we have a weakref pointing to it.
import threading
from dataclasses import dataclass, field
from typing import Any, Self, final
from uuid import uuid4
from weakref import WeakValueDictionary
class InternedBlankNode:
_intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
_lock = threading.Lock()
__slots__ = ("__weakref__",)
def __new__(cls, value: str | None = None) -> Self:
if value is None:
value = str(uuid4()).replace("-", "0")
with cls._lock:
if value in cls._intern_cache:
return cls._intern_cache[value]
instance = super().__new__(cls)
object.__setattr__(instance, "value", value)
cls._intern_cache[value] = instance
return instance
@final
@dataclass(frozen=True, slots=True)
class BlankNode(InternedBlankNode):
"""
An RDF blank node representing an anonymous resource.
Specification: https://www.w3.org/TR/rdf12-concepts/#section-blank-nodes
This implementation uses object interning to ensure that blank nodes
with the same identifier reference the same object instance, optimizing
memory usage. The class is marked final to ensure the :py:meth:`IRI.__new__`
implementation cannot be overridden.
:param value:
A blank node identifier. If :py:obj:`None` is provided, an identifier
will be generated.
"""
value: str = field(default_factory=lambda: str(uuid4()).replace("-", "0"))
def __str__(self) -> str:
return f"_:{self.value}"
def __reduce__(self) -> str | tuple[Any, ...]:
return self.__class__, (self.value,)
__all__ = ["BlankNode"]And tests:
import pickle
import pytest
from rdf_core.terms import BlankNode
def test_blank_node():
bnode1 = BlankNode("123")
bnode2 = BlankNode("123")
bnode3 = BlankNode("222")
assert bnode1.value == bnode2.value
assert bnode1.value != bnode3.value
assert bnode1 == bnode2
assert bnode1 != bnode3
assert bnode1 is bnode2
assert bnode1 is not bnode3
assert hash(bnode1) == hash(bnode2)
bnode4 = BlankNode()
assert len(bnode4.value) > 0
def test_blank_node_repr_str():
bnode1 = BlankNode("123")
assert repr(bnode1) == "BlankNode(value='123')"
assert str(bnode1) == "_:123"
def test_blank_node_immutability():
bnode1 = BlankNode("123")
with pytest.raises(AttributeError):
bnode1.value = "222"
def test_blank_node_pickling():
bnode1 = BlankNode("123")
pickled = pickle.dumps(bnode1)
unpickled = pickle.loads(pickled)
assert bnode1 is unpickled
assert bnode1 == unpickled