Optimize load() by avoiding redundant hash checks and unpacking by antonvnv · Pull Request #1081 · NVIDIA-BioNeMo/bionemo-framework

antonvnv · 2025-08-28T00:39:35Z

Description

Pooch by default revalidates file hashes and re-unpacks archives on every call, which is very slow for large checkpoints. This change introduces a .checked marker file that stores the resolved resource path once verification succeeds. Subsequent calls reuse this cached path instead of repeating the expensive validation and extraction steps.

Key changes:

Use a .checked file alongside the cached resource to record the verified path.
Load from the .checked file if it exists, bypassing re-validation.
Ensure .checked is written after successful retrieval/unpacking.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

# TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Cache-based early exit to reuse previously verified and unpacked checkpoints, avoiding redundant downloads.
- Automatic unpacking/decompression during retrieval based on file type.
Performance
- Faster subsequent loads by skipping repeated integrity checks and extraction on cache hits.
Refactor
- Unified post-retrieval path handling across flows; no public API changes.
Chores
- Added debug logs to indicate when cached paths are used for improved traceability.

copy-pr-bot · 2025-08-28T00:39:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pstjohn

LGTM.

jstjohn · 2025-08-29T19:48:55Z

/ok to test b1596ba

coderabbitai · 2025-09-06T02:39:42Z

Walkthrough

Adds a cache-hit early exit using a per-resource ".checked" marker file, integrates a processor with pooch.retrieve for unpack/decompress, unifies post-download path handling, and writes the resolved path to the marker. On cache hit, reads and returns the stored path; otherwise downloads/processes, records, and returns it.

Changes

Cohort / File(s)	Summary
Cache early-exit & processor integration `sub-packages/bionemo-core/src/bionemo/core/data/load.py`	Introduced per-resource `.checked` marker keyed by `<sha256>-<filename>`; on hit, read and return cached path. Added processor to `pooch.retrieve` to manage unpack/decompress. Refactored to compute final path in both unpacked and file cases, then persist it to the marker. Added debug logging for cache usage.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Load as load.py
  participant FS as Cache FS
  participant Pooch as pooch.retrieve
  participant Proc as Processor

  User->>Load: load_resource(spec)
  rect rgba(200,230,255,0.18)
    note over Load: compute fname and .checked marker
    Load->>FS: check .checked exists?
    alt Cache hit
      Load->>FS: read stored path
      Load-->>User: return cached Path
    else Cache miss
      note over Load,Pooch: prepare processor based on extension/settings
      Load->>Pooch: retrieve(url, fname, processor=Proc, ...)
      Pooch-->>Load: downloaded path or list
      Load->>Load: resolve final Path (file or unpacked)
      Load->>FS: write final Path to .checked
      Load-->>User: return final Path
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed Checks (1 warning)

Check Name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description follows the overall template structure but omits marking the relevant change type in the “Type of changes” section and leaves the Usage snippet as a TODO rather than providing an actual example, so it does not meet the template’s requirements.	Please mark the appropriate box in the Type of changes section (e.g., new feature or refactor) and replace the TODO in the Usage section with a concrete code example showing how to use the updated load() function.

✅ Passed Checks (2 passed)

Check Name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately describes the primary change to optimize the load() function by skipping redundant hash checks and unpacking, making it clear and specific for reviewers.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

I cached a carrot in the moonlit heap,
A .checked note to guard my sleepy keep.
If nibble’s stored, I hop right through—no wait!
If not, I fetch, unpack, then annotate.
Burrow paths marked true and neat—now every hop’s a speedy treat! 🥕🐇

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  - Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.
  - Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)

228-252: Honor explicit unpack/decompress flags and cover more extensions

Current logic ignores explicit unpack=True/decompress=True for unsupported extensions and misses common variants (.tgz, .tar.bz2, .tar.xz). Clarify precedence and raise on unsupported combos.

Apply:

-def _get_processor(extension: str, unpack: bool | None, decompress: bool | None):
+def _get_processor(extension: str, unpack: bool | None, decompress: bool | None):
@@
-    if extension in {".gz", ".bz2", ".xz"} and decompress is None:
-        return pooch.Decompress()
-
-    elif extension in {".tar", ".tar.gz"} and unpack is None:
-        return pooch.Untar()
-
-    elif extension == ".zip" and unpack is None:
-        return pooch.Unzip()
-
-    else:
-        return None
+    ext = extension.lower()
+    # 1) Respect explicit flags
+    if unpack is True:
+        if ext == ".zip":
+            return pooch.Unzip()
+        if ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}:
+            return pooch.Untar()
+        raise ValueError(f"unpack=True not supported for extension: {extension}")
+    if decompress is True:
+        # Do not "decompress" tarballs; those should be untarred.
+        if ext in {".gz", ".bz2", ".xz"} and not (ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}):
+            return pooch.Decompress()
+        raise ValueError(f"decompress=True not supported for extension: {extension}")
+    # 2) Infer when neither flag provided
+    if unpack is None and decompress is None:
+        if ext == ".zip":
+            return pooch.Unzip()
+        if ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}:
+            return pooch.Untar()
+        if ext in {".gz", ".bz2", ".xz"}:
+            return pooch.Decompress()
+    return None

🧹 Nitpick comments (3)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (3)
210-210: Cache key changed to include SHA; consider legacy cache fallback

Using fname=f"{resource.sha256}-{filename}" is good for multi-version caches but will invalidate prior caches named only by filename. Consider an optional fallback to the legacy location or note the migration in release notes.

220-223: Avoid reliance on non-API attribute; add safe fallback

processor.extract_dir is accessed with # type: ignore. Guard it and fall back to the parent of the first extracted file.

Apply:
-    if isinstance(download, list):
-        path = Path(processor.extract_dir)  # type: ignore
-    else:
-        path = Path(download)
+    if isinstance(download, list):
+        if hasattr(processor, "extract_dir"):
+            path = Path(processor.extract_dir)  # type: ignore[attr-defined]
+        else:
+            path = Path(download[0]).parent
+    else:
+        path = Path(download)
198-207: Add a force/TTL bypass to periodically re-verify integrity

Early exit skips future hash checks entirely. Add a force=False (or recheck_after=timedelta(...)) parameter to load() to recompute hashes periodically or on demand.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc468a7 and 72c7036.

📒 Files selected for processing (1)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (2 hunks)

🔇 Additional comments (1)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)

181-194: Ignore NGC‐only filename concern; all resources define PBSS
Every YAML under sub-packages/bionemo-core/src/bionemo/core/data/resources with an ngc entry also has a pbss field, so deriving filename from resource.pbss never encounters a missing value.

Likely an incorrect or invalid review comment.

antonvnv · 2025-09-06T02:54:23Z

/ok to test 72c7036

Pooch by default revalidates file hashes and re-unpacks archives on every call, which is very slow for large checkpoints. This change introduces a `.checked` marker file that stores the resolved resource path once verification succeeds. Subsequent calls reuse this cached path instead of repeating the expensive validation and extraction steps. Key changes: - Use a `.checked` file alongside the cached resource to record the verified path. - Load from the `.checked` file if it exists, bypassing re-validation. - Ensure `.checked` is written after successful retrieval/unpacking. Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>

Fixes manual forward unit test: ``` > ckpt_weights: Path = load(ckpt_name) / "weights" E TypeError: unsupported operand type(s) for /: 'str' and 'str' sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py:505: TypeError ``` Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (2)

198-207: Validate and sandbox .checked reads; handle stale/poisoned markers

Reading and trusting arbitrary text from the marker can return a non-existent or out-of-cache path. Validate, sandbox to cache_dir, and clean up bad markers. Also strip newlines and handle read errors.

Apply this diff:

-    if checked.exists():
-        path = checked.read_text()
-        logger.debug(f"Using cached {path=} from {checked=}")
-        return Path(path)
+    if checked.exists():
+        try:
+            cached = checked.read_text().strip()
+            resolved = Path(cached).resolve()
+            cache_root = cache_dir.resolve()
+            if resolved.exists() and (resolved == cache_root or cache_root in resolved.parents):
+                logger.debug(f"Using cached path={resolved} from checked={checked}")
+                return resolved
+            logger.warning(f"Ignoring stale/unsafe marker {checked}; will re-validate.")
+        except OSError as e:
+            logger.warning(f"Failed reading {checked}: {e}; will re-validate.")
+        with contextlib.suppress(OSError):
+            checked.unlink()

224-225: Make .checked writes atomic to prevent torn reads across processes

Use write-to-temp + fsync + atomic replace.

Apply this diff:

-    checked.write_text(str(path))
-    return path
+    # Atomic marker write to avoid partial reads
+    tmp = checked.with_suffix(checked.suffix + ".tmp")
+    with open(tmp, "w", encoding="utf-8") as f:
+        f.write(str(path))
+        f.flush()
+        os.fsync(f.fileno())
+    os.replace(tmp, checked)
+    return path

🧹 Nitpick comments (1)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)
220-223: Be robust if processor lacks extract_dir; fall back to parent of first file

Unzip/Untar return a list and set extract_dir after call. Prefer it when present; otherwise, derive from the first returned path. This guards against custom processors that return lists without extract_dir.

Apply this diff:
-    if isinstance(download, list):
-        path = Path(processor.extract_dir)  # type: ignore
-    else:
-        path = Path(download)
+    if isinstance(download, list):
+        # Prefer processor.extract_dir; else use the parent of the first extracted file.
+        if hasattr(processor, "extract_dir") and getattr(processor, "extract_dir"):
+            path = Path(processor.extract_dir)  # type: ignore[attr-defined]
+        else:
+            path = Path(download[0]).parent
+    else:
+        path = Path(download)
Context: Decompress returns a single file path, while Unzip/Untar return a list and maintain extract_dir. (fatiando.org)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72c7036 and 61db514.

📒 Files selected for processing (1)

sub-packages/bionemo-core/src/bionemo/core/data/load.py (2 hunks)

antonvnv · 2025-09-08T21:37:56Z

/ok to test 61db514

antonvnv requested review from DejunL, dorotat-nv, farhadrgh, guoqing-zhou, jstjohn, pstjohn and skothenhill-nv as code owners August 28, 2025 00:39

antonvnv force-pushed the av/faster-load branch from d886a65 to b1596ba Compare August 28, 2025 00:41

jstjohn approved these changes Aug 28, 2025

View reviewed changes

pstjohn approved these changes Aug 28, 2025

View reviewed changes

farhadrgh approved these changes Aug 28, 2025

View reviewed changes

jstjohn enabled auto-merge August 29, 2025 19:49

antonvnv force-pushed the av/faster-load branch from c9c07fc to 72c7036 Compare September 6, 2025 02:40

coderabbitai Bot reviewed Sep 6, 2025

View reviewed changes

Comment thread sub-packages/bionemo-core/src/bionemo/core/data/load.py

Comment thread sub-packages/bionemo-core/src/bionemo/core/data/load.py

antonvnv added 2 commits September 8, 2025 14:11

antonvnv force-pushed the av/faster-load branch from 72c7036 to 61db514 Compare September 8, 2025 21:11

coderabbitai Bot reviewed Sep 8, 2025

View reviewed changes

Comment thread sub-packages/bionemo-core/src/bionemo/core/data/load.py

jstjohn added this pull request to the merge queue Sep 8, 2025

Merged via the queue into NVIDIA-BioNeMo:main with commit 0d162d5 Sep 9, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize load() by avoiding redundant hash checks and unpacking#1081

Optimize load() by avoiding redundant hash checks and unpacking#1081
jstjohn merged 2 commits into
NVIDIA-BioNeMo:mainfrom
antonvnv:av/faster-load

antonvnv commented Aug 28, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Aug 28, 2025

Uh oh!

pstjohn left a comment

Uh oh!

jstjohn commented Aug 29, 2025

Uh oh!

coderabbitai Bot commented Sep 6, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

antonvnv commented Sep 6, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

antonvnv commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

antonvnv commented Aug 28, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Usage

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Aug 28, 2025

Uh oh!

pstjohn left a comment

Choose a reason for hiding this comment

Uh oh!

jstjohn commented Aug 29, 2025

Uh oh!

coderabbitai Bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (2 passed, 1 warning)

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

antonvnv commented Sep 6, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonvnv commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

antonvnv commented Aug 28, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 6, 2025 •

edited

Loading