Skip to content

Conversation

@jamesbraza
Copy link
Collaborator

Seen in this CI run we are still hitting the RuntimeError: storage has wrong byte size: expected %ld got %ld04, which #1144 was battling:

___________________________ test_parse_pdf_to_pages ____________________________
[gw0] linux -- Python 3.11.14 /home/runner/work/paper-qa/paper-qa/.venv/bin/python

    @pytest.mark.asyncio
    async def test_parse_pdf_to_pages() -> None:
        assert isinstance(parse_pdf_to_pages, PDFParserFn)
    
        filepath = STUB_DATA_DIR / "pasa.pdf"
>       parsed_text = parse_pdf_to_pages(filepath)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

packages/paper-qa-docling/tests/test_paperqa_docling.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
packages/paper-qa-docling/src/paperqa_docling/reader.py:77: in parse_pdf_to_pages
    result = converter.convert(path)
             ^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/pydantic/_internal/_validate_call.py:39: in wrapper_function
    return wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/pydantic/_internal/_validate_call.py:136: in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/docling/document_converter.py:245: in convert
    return next(all_res)
           ^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/docling/document_converter.py:268: in convert_all
    for conv_res in conv_res_iter:
.venv/lib/python3.11/site-packages/docling/document_converter.py:340: in _convert
    for item in map(
.venv/lib/python3.11/site-packages/docling/document_converter.py:387: in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/docling/document_converter.py:408: in _execute_pipeline
    pipeline = self._get_pipeline(in_doc.format)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/docling/document_converter.py:370: in _get_pipeline
    self.initialized_pipelines[cache_key] = pipeline_class(
.venv/lib/python3.11/site-packages/docling/pipeline/standard_pdf_pipeline.py:49: in __init__
    ocr_model = self.get_ocr_model(artifacts_path=self.artifacts_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/docling/pipeline/standard_pdf_pipeline.py:119: in get_ocr_model
    return factory.create_instance(
.venv/lib/python3.11/site-packages/docling/models/factories/base_factory.py:57: in create_instance
    return _cls(options=options, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/docling/models/auto_ocr_model.py:103: in __init__
    self._engine = RapidOcrModel(
.venv/lib/python3.11/site-packages/docling/models/rapid_ocr_model.py:198: in __init__
    self.reader = RapidOCR(
.venv/lib/python3.11/site-packages/rapidocr/main.py:43: in __init__
    self._initialize(cfg)
.venv/lib/python3.11/site-packages/rapidocr/main.py:73: in _initialize
    self.text_rec = TextRecognizer(cfg.Rec)
                    ^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/rapidocr/ch_ppocr_rec/main.py:37: in __init__
    self.session = get_engine(cfg.engine_type)(cfg)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/rapidocr/inference_engine/torch.py:25: in __init__
    self.predictor = self._build_and_load_model(arch_config, model_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/rapidocr/inference_engine/torch.py:69: in _build_and_load_model
    state_dict = torch.load(model_path, map_location="cpu", weights_only=False)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.11/site-packages/torch/serialization.py:1554: in load
    return _legacy_load(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

f = <_io.BufferedReader name='/home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_rec_infer.pth'>
map_location = 'cpu'
pickle_module = <module 'pickle' from '/home/runner/work/_temp/uv-python-dir/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/pickle.py'>
pickle_load_args = {'encoding': 'utf-8'}
legacy_load = <function _legacy_load.<locals>.legacy_load at 0x7f810c0c2f20>
persistent_load = <function _legacy_load.<locals>.persistent_load at 0x7f810c0c3100>
f_should_read_directly = True, magic_number = 119547037146038801333356
protocol_version = 1001
_sys_info = {'little_endian': True, 'protocol_version': 1001, 'type_sizes': {'int': 4, 'long': 4, 'short': 2}}
unpickler = <torch.serialization._legacy_load.<locals>.UnpicklerWrapper object at 0x7f810e9df4d0>

    def _legacy_load(f, map_location, pickle_module, **pickle_load_args):
        deserialized_objects: dict[int, Any] = {}
    
        restore_location = _get_restore_location(map_location)
    
        class UnpicklerWrapper(pickle_module.Unpickler):  # type: ignore[name-defined]
            def find_class(self, mod_name, name):
                if type(name) is str and "Storage" in name:
                    try:
                        return StorageType(name)
                    except KeyError:
                        pass
                return super().find_class(mod_name, name)
    
        def _check_container_source(container_type, source_file, original_source):
            try:
                current_source = "".join(get_source_lines_and_file(container_type)[0])
            except Exception:  # saving the source is optional, so we can ignore any errors
                warnings.warn(
                    "Couldn't retrieve source code for container of "
                    "type " + container_type.__name__ + ". It won't be checked "
                    "for correctness upon loading."
                )
                return
            if original_source != current_source:
                if container_type.dump_patches:
                    file_name = container_type.__name__ + ".patch"
                    diff = difflib.unified_diff(
                        current_source.split("\n"),
                        original_source.split("\n"),
                        source_file,
                        source_file,
                        lineterm="",
                    )
                    lines = "\n".join(diff)
                    try:
                        with open(file_name, "a+") as f:
                            file_size = f.seek(0, 2)
                            f.seek(0)
                            if file_size == 0:
                                f.write(lines)
                            elif file_size != len(lines) or f.read() != lines:
                                raise OSError
                        msg = (
                            "Saved a reverse patch to " + file_name + ". "
                            "Run `patch -p0 < " + file_name + "` to revert your "
                            "changes."
                        )
                    except OSError:
                        msg = (
                            "Tried to save a patch, but couldn't create a "
                            "writable file " + file_name + ". Make sure it "
                            "doesn't exist and your working directory is "
                            "writable."
                        )
                else:
                    msg = (
                        "you can retrieve the original source code by "
                        "accessing the object's source attribute or set "
                        "`torch.nn.Module.dump_patches = True` and use the "
                        "patch tool to revert the changes."
                    )
                msg = f"source code of class '{torch.typename(container_type)}' has changed. {msg}"
                warnings.warn(msg, SourceChangeWarning)
    
        def legacy_load(f):
            deserialized_objects: dict[int, Any] = {}
    
            def persistent_load(saved_id):
                if isinstance(saved_id, tuple):
                    # Ignore containers that don't have any sources saved
                    if all(saved_id[1:]):
                        _check_container_source(*saved_id)
                    return saved_id[0]
                return deserialized_objects[int(saved_id)]
    
            with (
                closing(
                    tarfile.open(fileobj=f, mode="r:", format=tarfile.PAX_FORMAT)
                ) as tar,
                mkdtemp() as tmpdir,
            ):
                if pickle_module is _weights_only_unpickler:
                    raise RuntimeError(
                        "Cannot use ``weights_only=True`` with files saved in the "
                        "legacy .tar format. " + UNSAFE_MESSAGE
                    )
                tar.extract("storages", path=tmpdir)
                with open(os.path.join(tmpdir, "storages"), "rb", 0) as f:
                    num_storages = pickle_module.load(f, **pickle_load_args)
                    for _ in range(num_storages):
                        args = pickle_module.load(f, **pickle_load_args)
                        key, location, storage_type = args
                        dtype = storage_type._dtype
                        obj = cast(Storage, torch.UntypedStorage)._new_with_file(
                            f, torch._utils._element_size(dtype)
                        )
                        obj = restore_location(obj, location)
                        # TODO: Once we decide to break serialization FC, we can
                        # stop wrapping with TypedStorage
                        deserialized_objects[key] = torch.storage.TypedStorage(
                            wrap_storage=obj, dtype=dtype, _internal=True
                        )
    
                    storage_views = pickle_module.load(f, **pickle_load_args)
                    for target_cdata, root_cdata, offset, numel in storage_views:
                        root = deserialized_objects[root_cdata]
                        element_size = torch._utils._element_size(root.dtype)
                        offset_bytes = offset * element_size
                        # TODO: Once we decide to break serialization FC, we can
                        # stop wrapping with TypedStorage
                        deserialized_objects[target_cdata] = torch.storage.TypedStorage(
                            wrap_storage=root._untyped_storage[
                                offset_bytes : offset_bytes + numel * element_size
                            ],
                            dtype=root.dtype,
                            _internal=True,
                        )
    
                tar.extract("tensors", path=tmpdir)
                with open(os.path.join(tmpdir, "tensors"), "rb", 0) as f:
                    num_tensors = pickle_module.load(f, **pickle_load_args)
                    for _ in range(num_tensors):
                        args = pickle_module.load(f, **pickle_load_args)
                        key, storage_id, _original_tensor_type = args
                        storage = deserialized_objects[storage_id]
                        (ndim,) = struct.unpack("<i", f.read(4))
                        # skip next 4 bytes; legacy encoding treated ndim as 8 bytes
                        f.read(4)
                        numel = struct.unpack(f"<{ndim}q", f.read(8 * ndim))
                        stride = struct.unpack(f"<{ndim}q", f.read(8 * ndim))
                        (storage_offset,) = struct.unpack("<q", f.read(8))
                        tensor = torch.empty((0,), dtype=storage.dtype).set_(
                            storage._untyped_storage, storage_offset, numel, stride
                        )
                        deserialized_objects[key] = tensor
    
                pickle_file = tar.extractfile("pickle")
                unpickler = UnpicklerWrapper(pickle_file, **pickle_load_args)
                unpickler.persistent_load = persistent_load
                result = unpickler.load()
                return result
    
        deserialized_objects = {}
    
        def persistent_load(saved_id):
            assert isinstance(saved_id, tuple)
            typename = _maybe_decode_ascii(saved_id[0])
            data = saved_id[1:]
    
            if typename == "module":
                # Ignore containers that don't have any sources saved
                if all(data[1:]):
                    _check_container_source(*data)
                return data[0]
            elif typename == "storage":
                storage_type, root_key, location, numel, view_metadata = data
                location = _maybe_decode_ascii(location)
                dtype = storage_type.dtype
    
                nbytes = numel * torch._utils._element_size(dtype)
    
                if root_key not in deserialized_objects:
                    if torch._guards.active_fake_mode() is not None:
                        obj = cast(Storage, torch.UntypedStorage(nbytes, device="meta"))
                    elif _serialization_tls.skip_data:
                        obj = cast(Storage, torch.UntypedStorage(nbytes))
                        obj = restore_location(obj, location)
                    else:
                        obj = cast(Storage, torch.UntypedStorage(nbytes))
                        obj._torch_load_uninitialized = True
                        obj = restore_location(obj, location)
                    # TODO: Once we decide to break serialization FC, we can
                    # stop wrapping with TypedStorage
                    typed_storage = torch.storage.TypedStorage(
                        wrap_storage=obj, dtype=dtype, _internal=True
                    )
                    deserialized_objects[root_key] = typed_storage
                else:
                    typed_storage = deserialized_objects[root_key]
                    if typed_storage._data_ptr() == 0:
                        typed_storage = torch.storage.TypedStorage(
                            device=typed_storage._untyped_storage.device,
                            dtype=dtype,
                            _internal=True,
                        )
    
                if view_metadata is not None:
                    view_key, offset, view_size = view_metadata
                    offset_bytes = offset * torch._utils._element_size(dtype)
                    view_size_bytes = view_size * torch._utils._element_size(dtype)
                    if view_key not in deserialized_objects:
                        # TODO: Once we decide to break serialization FC, we can
                        # stop wrapping with TypedStorage
                        deserialized_objects[view_key] = torch.storage.TypedStorage(
                            wrap_storage=typed_storage._untyped_storage[
                                offset_bytes : offset_bytes + view_size_bytes
                            ],
                            dtype=dtype,
                            _internal=True,
                        )
                    res = deserialized_objects[view_key]
    
                else:
                    res = typed_storage
                return res
            else:
                raise RuntimeError(f"Unknown saved id type: {saved_id[0]}")
    
        _check_seekable(f)
        f_should_read_directly = _should_read_directly(f)
    
        if f_should_read_directly and f.tell() == 0:
            # legacy_load requires that f has fileno()
            # only if offset is zero we can attempt the legacy tar file loader
            try:
                return legacy_load(f)
            except tarfile.TarError:
                if _is_zipfile(f):
                    # .zip is used for torch.jit.save and will throw an un-pickling error here
                    raise RuntimeError(
                        f"{f.name} is a zip archive (did you mean to use torch.jit.load()?)"
                    ) from None
                # if not a tarfile, reset file offset and proceed
                f.seek(0)
    
        magic_number = pickle_module.load(f, **pickle_load_args)
        if magic_number != MAGIC_NUMBER:
            raise RuntimeError("Invalid magic number; corrupt file?")
        protocol_version = pickle_module.load(f, **pickle_load_args)
        if protocol_version != PROTOCOL_VERSION:
            raise RuntimeError(f"Invalid protocol version: {protocol_version}")
    
        _sys_info = pickle_module.load(f, **pickle_load_args)
        unpickler = UnpicklerWrapper(f, **pickle_load_args)
        unpickler.persistent_load = persistent_load
        result = unpickler.load()
    
        deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)
    
        if torch._guards.active_fake_mode() is None and not _serialization_tls.skip_data:
            offset = f.tell() if f_should_read_directly else None
            for key in deserialized_storage_keys:
                assert key in deserialized_objects
                typed_storage = deserialized_objects[key]
>               typed_storage._untyped_storage._set_from_file(
                    f,
                    offset,
                    f_should_read_directly,
                    torch._utils._element_size(typed_storage.dtype),
                )
E               RuntimeError: storage has wrong byte size: expected %ld got %ld04

.venv/lib/python3.11/site-packages/torch/serialization.py:1821: RuntimeError
----------------------------- Captured stderr call -----------------------------
[INFO] 2025-10-18 03:52:58,258 [RapidOCR] base.py:22: Using engine_name: torch
[INFO] 2025-10-18 03:52:58,262 [RapidOCR] download_file.py:68: Initiating download: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/torch/PP-OCRv4/det/ch_PP-OCRv4_det_infer.pth
[INFO] 2025-10-18 03:53:00,346 [RapidOCR] download_file.py:82: Download size: 13.83MB
[INFO] 2025-10-18 03:53:01,104 [RapidOCR] download_file.py:95: Successfully saved to: /home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth
[INFO] 2025-10-18 03:53:01,106 [RapidOCR] torch.py:54: Using /home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth
[INFO] 2025-10-18 03:53:01,705 [RapidOCR] base.py:22: Using engine_name: torch
[INFO] 2025-10-18 03:53:01,706 [RapidOCR] download_file.py:68: Initiating download: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/torch/PP-OCRv4/cls/ch_ptocr_mobile_v2.0_cls_infer.pth
[INFO] 2025-10-18 03:53:02,639 [RapidOCR] download_file.py:82: Download size: 0.56MB
[INFO] 2025-10-18 03:53:02,712 [RapidOCR] download_file.py:95: Successfully saved to: /home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ptocr_mobile_v2.0_cls_infer.pth
[INFO] 2025-10-18 03:53:02,714 [RapidOCR] torch.py:54: Using /home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ptocr_mobile_v2.0_cls_infer.pth
[INFO] 2025-10-18 03:53:02,807 [RapidOCR] base.py:22: Using engine_name: torch
[INFO] 2025-10-18 03:53:02,807 [RapidOCR] download_file.py:68: Initiating download: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/torch/PP-OCRv4/rec/ch_PP-OCRv4_rec_infer.pth
[INFO] 2025-10-18 03:53:04,032 [RapidOCR] download_file.py:82: Download size: 25.67MB
[INFO] 2025-10-18 03:53:05,536 [RapidOCR] download_file.py:95: Successfully saved to: /home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_rec_infer.pth
[INFO] 2025-10-18 03:53:05,538 [RapidOCR] torch.py:54: Using /home/runner/work/paper-qa/paper-qa/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_rec_infer.pth

Notice the issue is RapidOCR is not using the ~/.cache/docling. I made docling-project/docling#2500 to report this unexpected behavior, and am manually setting DOCLING_ARTIFACTS_PATH in our code in the meantime.

@jamesbraza jamesbraza self-assigned this Oct 20, 2025
Copilot AI review requested due to automatic review settings October 20, 2025 17:59
@jamesbraza jamesbraza added the bug Something isn't working label Oct 20, 2025
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Oct 20, 2025
@dosubot
Copy link

dosubot bot commented Oct 20, 2025

Documentation Updates

Checked 1 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a runtime error in CI tests related to RapidOCR model downloads by explicitly setting the DOCLING_ARTIFACTS_PATH environment variable to ensure models are downloaded to and loaded from the expected cache directory.

Key Changes:

  • Added DOCLING_ARTIFACTS_PATH environment variable to the test workflow to work around RapidOCR not respecting the default Docling cache location

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jamesbraza jamesbraza force-pushed the fixing-docling-cache-again branch from 305ffa6 to 42594a7 Compare October 20, 2025 18:26
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 20, 2025
@jamesbraza jamesbraza merged commit 208d9dd into main Oct 20, 2025
7 checks passed
@jamesbraza jamesbraza deleted the fixing-docling-cache-again branch October 20, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants