Skip to content

Optimize load() by avoiding redundant hash checks and unpacking#1081

Merged
jstjohn merged 2 commits into
NVIDIA-BioNeMo:mainfrom
antonvnv:av/faster-load
Sep 9, 2025
Merged

Optimize load() by avoiding redundant hash checks and unpacking#1081
jstjohn merged 2 commits into
NVIDIA-BioNeMo:mainfrom
antonvnv:av/faster-load

Conversation

@antonvnv
Copy link
Copy Markdown
Collaborator

@antonvnv antonvnv commented Aug 28, 2025

Description

Pooch by default revalidates file hashes and re-unpacks archives on every call, which is very slow for large checkpoints. This change introduces a .checked marker file that stores the resolved resource path once verification succeeds. Subsequent calls reuse this cached path instead of repeating the expensive validation and extraction steps.

Key changes:

  • Use a .checked file alongside the cached resource to record the verified path.

  • Load from the .checked file if it exists, bypassing re-validation.

  • Ensure .checked is written after successful retrieval/unpacking.

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

# TODO: Add code snippet

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features
    • Cache-based early exit to reuse previously verified and unpacked checkpoints, avoiding redundant downloads.
    • Automatic unpacking/decompression during retrieval based on file type.
  • Performance
    • Faster subsequent loads by skipping repeated integrity checks and extraction on cache hits.
  • Refactor
    • Unified post-retrieval path handling across flows; no public API changes.
  • Chores
    • Added debug logs to indicate when cached paths are used for improved traceability.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Aug 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jstjohn
Copy link
Copy Markdown
Collaborator

jstjohn commented Aug 29, 2025

/ok to test b1596ba

@jstjohn jstjohn enabled auto-merge August 29, 2025 19:49
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 6, 2025

Walkthrough

Adds a cache-hit early exit using a per-resource ".checked" marker file, integrates a processor with pooch.retrieve for unpack/decompress, unifies post-download path handling, and writes the resolved path to the marker. On cache hit, reads and returns the stored path; otherwise downloads/processes, records, and returns it.

Changes

Cohort / File(s) Summary
Cache early-exit & processor integration
sub-packages/bionemo-core/src/bionemo/core/data/load.py
Introduced per-resource .checked marker keyed by <sha256>-<filename>; on hit, read and return cached path. Added processor to pooch.retrieve to manage unpack/decompress. Refactored to compute final path in both unpacked and file cases, then persist it to the marker. Added debug logging for cache usage.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Load as load.py
  participant FS as Cache FS
  participant Pooch as pooch.retrieve
  participant Proc as Processor

  User->>Load: load_resource(spec)
  rect rgba(200,230,255,0.18)
    note over Load: compute fname and .checked marker
    Load->>FS: check .checked exists?
    alt Cache hit
      Load->>FS: read stored path
      Load-->>User: return cached Path
    else Cache miss
      note over Load,Pooch: prepare processor based on extension/settings
      Load->>Pooch: retrieve(url, fname, processor=Proc, ...)
      Pooch-->>Load: downloaded path or list
      Load->>Load: resolve final Path (file or unpacked)
      Load->>FS: write final Path to .checked
      Load-->>User: return final Path
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed Checks (1 warning)
Check Name Status Explanation Resolution
Description Check ⚠️ Warning The description follows the overall template structure but omits marking the relevant change type in the “Type of changes” section and leaves the Usage snippet as a TODO rather than providing an actual example, so it does not meet the template’s requirements. Please mark the appropriate box in the Type of changes section (e.g., new feature or refactor) and replace the TODO in the Usage section with a concrete code example showing how to use the updated load() function.
✅ Passed Checks (2 passed)
Check Name Status Explanation
Title Check ✅ Passed The title succinctly and accurately describes the primary change to optimize the load() function by skipping redundant hash checks and unpacking, making it clear and specific for reviewers.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

I cached a carrot in the moonlit heap,
A .checked note to guard my sleepy keep.
If nibble’s stored, I hop right through—no wait!
If not, I fetch, unpack, then annotate.
Burrow paths marked true and neat—now every hop’s a speedy treat! 🥕🐇

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  - Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.
  - Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)

228-252: Honor explicit unpack/decompress flags and cover more extensions

Current logic ignores explicit unpack=True/decompress=True for unsupported extensions and misses common variants (.tgz, .tar.bz2, .tar.xz). Clarify precedence and raise on unsupported combos.

Apply:

-def _get_processor(extension: str, unpack: bool | None, decompress: bool | None):
+def _get_processor(extension: str, unpack: bool | None, decompress: bool | None):
@@
-    if extension in {".gz", ".bz2", ".xz"} and decompress is None:
-        return pooch.Decompress()
-
-    elif extension in {".tar", ".tar.gz"} and unpack is None:
-        return pooch.Untar()
-
-    elif extension == ".zip" and unpack is None:
-        return pooch.Unzip()
-
-    else:
-        return None
+    ext = extension.lower()
+    # 1) Respect explicit flags
+    if unpack is True:
+        if ext == ".zip":
+            return pooch.Unzip()
+        if ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}:
+            return pooch.Untar()
+        raise ValueError(f"unpack=True not supported for extension: {extension}")
+    if decompress is True:
+        # Do not "decompress" tarballs; those should be untarred.
+        if ext in {".gz", ".bz2", ".xz"} and not (ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}):
+            return pooch.Decompress()
+        raise ValueError(f"decompress=True not supported for extension: {extension}")
+    # 2) Infer when neither flag provided
+    if unpack is None and decompress is None:
+        if ext == ".zip":
+            return pooch.Unzip()
+        if ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}:
+            return pooch.Untar()
+        if ext in {".gz", ".bz2", ".xz"}:
+            return pooch.Decompress()
+    return None
🧹 Nitpick comments (3)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (3)

210-210: Cache key changed to include SHA; consider legacy cache fallback

Using fname=f"{resource.sha256}-{filename}" is good for multi-version caches but will invalidate prior caches named only by filename. Consider an optional fallback to the legacy location or note the migration in release notes.


220-223: Avoid reliance on non-API attribute; add safe fallback

processor.extract_dir is accessed with # type: ignore. Guard it and fall back to the parent of the first extracted file.

Apply:

-    if isinstance(download, list):
-        path = Path(processor.extract_dir)  # type: ignore
-    else:
-        path = Path(download)
+    if isinstance(download, list):
+        if hasattr(processor, "extract_dir"):
+            path = Path(processor.extract_dir)  # type: ignore[attr-defined]
+        else:
+            path = Path(download[0]).parent
+    else:
+        path = Path(download)

198-207: Add a force/TTL bypass to periodically re-verify integrity

Early exit skips future hash checks entirely. Add a force=False (or recheck_after=timedelta(...)) parameter to load() to recompute hashes periodically or on demand.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc468a7 and 72c7036.

📒 Files selected for processing (1)
  • sub-packages/bionemo-core/src/bionemo/core/data/load.py (2 hunks)
🔇 Additional comments (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)

181-194: Ignore NGC‐only filename concern; all resources define PBSS
Every YAML under sub-packages/bionemo-core/src/bionemo/core/data/resources with an ngc entry also has a pbss field, so deriving filename from resource.pbss never encounters a missing value.

Likely an incorrect or invalid review comment.

Comment thread sub-packages/bionemo-core/src/bionemo/core/data/load.py
Comment thread sub-packages/bionemo-core/src/bionemo/core/data/load.py
@antonvnv
Copy link
Copy Markdown
Collaborator Author

antonvnv commented Sep 6, 2025

/ok to test 72c7036

Pooch by default revalidates file hashes and re-unpacks archives on
every call, which is very slow for large checkpoints. This change
introduces a `.checked` marker file that stores the resolved resource
path once verification succeeds. Subsequent calls reuse this cached path
instead of repeating the expensive validation and extraction steps.

Key changes:

- Use a `.checked` file alongside the cached resource to record the
  verified path.

- Load from the `.checked` file if it exists, bypassing re-validation.

- Ensure `.checked` is written after successful retrieval/unpacking.

Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>
Fixes manual forward unit test:

```
>           ckpt_weights: Path = load(ckpt_name) / "weights"
E           TypeError: unsupported operand type(s) for /: 'str' and 'str'

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py:505: TypeError
```

Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (2)

198-207: Validate and sandbox .checked reads; handle stale/poisoned markers

Reading and trusting arbitrary text from the marker can return a non-existent or out-of-cache path. Validate, sandbox to cache_dir, and clean up bad markers. Also strip newlines and handle read errors.

Apply this diff:

-    if checked.exists():
-        path = checked.read_text()
-        logger.debug(f"Using cached {path=} from {checked=}")
-        return Path(path)
+    if checked.exists():
+        try:
+            cached = checked.read_text().strip()
+            resolved = Path(cached).resolve()
+            cache_root = cache_dir.resolve()
+            if resolved.exists() and (resolved == cache_root or cache_root in resolved.parents):
+                logger.debug(f"Using cached path={resolved} from checked={checked}")
+                return resolved
+            logger.warning(f"Ignoring stale/unsafe marker {checked}; will re-validate.")
+        except OSError as e:
+            logger.warning(f"Failed reading {checked}: {e}; will re-validate.")
+        with contextlib.suppress(OSError):
+            checked.unlink()

224-225: Make .checked writes atomic to prevent torn reads across processes

Use write-to-temp + fsync + atomic replace.

Apply this diff:

-    checked.write_text(str(path))
-    return path
+    # Atomic marker write to avoid partial reads
+    tmp = checked.with_suffix(checked.suffix + ".tmp")
+    with open(tmp, "w", encoding="utf-8") as f:
+        f.write(str(path))
+        f.flush()
+        os.fsync(f.fileno())
+    os.replace(tmp, checked)
+    return path
🧹 Nitpick comments (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)

220-223: Be robust if processor lacks extract_dir; fall back to parent of first file

Unzip/Untar return a list and set extract_dir after call. Prefer it when present; otherwise, derive from the first returned path. This guards against custom processors that return lists without extract_dir.

Apply this diff:

-    if isinstance(download, list):
-        path = Path(processor.extract_dir)  # type: ignore
-    else:
-        path = Path(download)
+    if isinstance(download, list):
+        # Prefer processor.extract_dir; else use the parent of the first extracted file.
+        if hasattr(processor, "extract_dir") and getattr(processor, "extract_dir"):
+            path = Path(processor.extract_dir)  # type: ignore[attr-defined]
+        else:
+            path = Path(download[0]).parent
+    else:
+        path = Path(download)

Context: Decompress returns a single file path, while Unzip/Untar return a list and maintain extract_dir. (fatiando.org)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72c7036 and 61db514.

📒 Files selected for processing (1)
  • sub-packages/bionemo-core/src/bionemo/core/data/load.py (2 hunks)

Comment thread sub-packages/bionemo-core/src/bionemo/core/data/load.py
@antonvnv
Copy link
Copy Markdown
Collaborator Author

antonvnv commented Sep 8, 2025

/ok to test 61db514

@jstjohn jstjohn added this pull request to the merge queue Sep 8, 2025
Merged via the queue into NVIDIA-BioNeMo:main with commit 0d162d5 Sep 9, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants