Skip to content

RDF terms with object interning support #2972

@edmondchuc

Description

@edmondchuc

I'm interested in this. I've been playing around with the idea of implementing RDF terms with object interning to save memory and avoid copying. This issue is a continuation from #2866.

In any embarassingly parallel, distributed ETLs where I've used RDFLib, I've always seen the memory usage grow over time. By implementing object interning, we may be able to fix this issue and potentially stop the memory growth when objects are no longer referenced. I think this particular issue is also related to this other issue described here #740.

The key is to implement RDF terms as immutable data structures. This way, we can safely reuse references to the same object if the unicode code point sequence in the term's value is the same.

An example of a Blank Node implementation with object interning and is thread-safe when accessing the weakrefs. Memory should be freed once the objects are no longer in use even though we have a weakref pointing to it.

import threading
from dataclasses import dataclass, field
from typing import Any, Self, final
from uuid import uuid4
from weakref import WeakValueDictionary


class InternedBlankNode:
    _intern_cache: WeakValueDictionary[str, "Self"] = WeakValueDictionary()
    _lock = threading.Lock()

    __slots__ = ("__weakref__",)

    def __new__(cls, value: str | None = None) -> Self:
        if value is None:
            value = str(uuid4()).replace("-", "0")

        with cls._lock:
            if value in cls._intern_cache:
                return cls._intern_cache[value]

            instance = super().__new__(cls)
            object.__setattr__(instance, "value", value)
            cls._intern_cache[value] = instance
            return instance


@final
@dataclass(frozen=True, slots=True)
class BlankNode(InternedBlankNode):
    """
    An RDF blank node representing an anonymous resource.

    Specification: https://www.w3.org/TR/rdf12-concepts/#section-blank-nodes

    This implementation uses object interning to ensure that blank nodes
    with the same identifier reference the same object instance, optimizing
    memory usage. The class is marked final to ensure the :py:meth:`IRI.__new__`
    implementation cannot be overridden.

    :param value:
        A blank node identifier. If :py:obj:`None` is provided, an identifier
        will be generated.
    """

    value: str = field(default_factory=lambda: str(uuid4()).replace("-", "0"))

    def __str__(self) -> str:
        return f"_:{self.value}"

    def __reduce__(self) -> str | tuple[Any, ...]:
        return self.__class__, (self.value,)


__all__ = ["BlankNode"]

And tests:

import pickle

import pytest

from rdf_core.terms import BlankNode


def test_blank_node():
    bnode1 = BlankNode("123")
    bnode2 = BlankNode("123")
    bnode3 = BlankNode("222")

    assert bnode1.value == bnode2.value
    assert bnode1.value != bnode3.value
    assert bnode1 == bnode2
    assert bnode1 != bnode3
    assert bnode1 is bnode2
    assert bnode1 is not bnode3
    assert hash(bnode1) == hash(bnode2)

    bnode4 = BlankNode()
    assert len(bnode4.value) > 0


def test_blank_node_repr_str():
    bnode1 = BlankNode("123")
    assert repr(bnode1) == "BlankNode(value='123')"
    assert str(bnode1) == "_:123"


def test_blank_node_immutability():
    bnode1 = BlankNode("123")
    with pytest.raises(AttributeError):
        bnode1.value = "222"


def test_blank_node_pickling():
    bnode1 = BlankNode("123")
    pickled = pickle.dumps(bnode1)
    unpickled = pickle.loads(pickled)
    assert bnode1 is unpickled
    assert bnode1 == unpickled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions