Optimize load() by avoiding redundant hash checks and unpacking#1081
Conversation
d886a65 to
b1596ba
Compare
|
/ok to test b1596ba |
WalkthroughAdds a cache-hit early exit using a per-resource ".checked" marker file, integrates a processor with pooch.retrieve for unpack/decompress, unifies post-download path handling, and writes the resolved path to the marker. On cache hit, reads and returns the stored path; otherwise downloads/processes, records, and returns it. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Load as load.py
participant FS as Cache FS
participant Pooch as pooch.retrieve
participant Proc as Processor
User->>Load: load_resource(spec)
rect rgba(200,230,255,0.18)
note over Load: compute fname and .checked marker
Load->>FS: check .checked exists?
alt Cache hit
Load->>FS: read stored path
Load-->>User: return cached Path
else Cache miss
note over Load,Pooch: prepare processor based on extension/settings
Load->>Pooch: retrieve(url, fname, processor=Proc, ...)
Pooch-->>Load: downloaded path or list
Load->>Load: resolve final Path (file or unpacked)
Load->>FS: write final Path to .checked
Load-->>User: return final Path
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks (2 passed, 1 warning)❌ Failed Checks (1 warning)
✅ Passed Checks (2 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c9c07fc to
72c7036
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)
228-252: Honor explicit unpack/decompress flags and cover more extensionsCurrent logic ignores explicit
unpack=True/decompress=Truefor unsupported extensions and misses common variants (.tgz,.tar.bz2,.tar.xz). Clarify precedence and raise on unsupported combos.Apply:
-def _get_processor(extension: str, unpack: bool | None, decompress: bool | None): +def _get_processor(extension: str, unpack: bool | None, decompress: bool | None): @@ - if extension in {".gz", ".bz2", ".xz"} and decompress is None: - return pooch.Decompress() - - elif extension in {".tar", ".tar.gz"} and unpack is None: - return pooch.Untar() - - elif extension == ".zip" and unpack is None: - return pooch.Unzip() - - else: - return None + ext = extension.lower() + # 1) Respect explicit flags + if unpack is True: + if ext == ".zip": + return pooch.Unzip() + if ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}: + return pooch.Untar() + raise ValueError(f"unpack=True not supported for extension: {extension}") + if decompress is True: + # Do not "decompress" tarballs; those should be untarred. + if ext in {".gz", ".bz2", ".xz"} and not (ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}): + return pooch.Decompress() + raise ValueError(f"decompress=True not supported for extension: {extension}") + # 2) Infer when neither flag provided + if unpack is None and decompress is None: + if ext == ".zip": + return pooch.Unzip() + if ext.startswith(".tar") or ext in {".tgz", ".tbz2", ".txz"}: + return pooch.Untar() + if ext in {".gz", ".bz2", ".xz"}: + return pooch.Decompress() + return None
🧹 Nitpick comments (3)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (3)
210-210: Cache key changed to include SHA; consider legacy cache fallbackUsing
fname=f"{resource.sha256}-{filename}"is good for multi-version caches but will invalidate prior caches named only byfilename. Consider an optional fallback to the legacy location or note the migration in release notes.
220-223: Avoid reliance on non-API attribute; add safe fallback
processor.extract_diris accessed with# type: ignore. Guard it and fall back to the parent of the first extracted file.Apply:
- if isinstance(download, list): - path = Path(processor.extract_dir) # type: ignore - else: - path = Path(download) + if isinstance(download, list): + if hasattr(processor, "extract_dir"): + path = Path(processor.extract_dir) # type: ignore[attr-defined] + else: + path = Path(download[0]).parent + else: + path = Path(download)
198-207: Add a force/TTL bypass to periodically re-verify integrityEarly exit skips future hash checks entirely. Add a
force=False(orrecheck_after=timedelta(...)) parameter toload()to recompute hashes periodically or on demand.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py(2 hunks)
🔇 Additional comments (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)
181-194: Ignore NGC‐only filename concern; all resources define PBSS
Every YAML under sub-packages/bionemo-core/src/bionemo/core/data/resources with an ngc entry also has a pbss field, so deriving filename from resource.pbss never encounters a missing value.Likely an incorrect or invalid review comment.
|
/ok to test 72c7036 |
Pooch by default revalidates file hashes and re-unpacks archives on every call, which is very slow for large checkpoints. This change introduces a `.checked` marker file that stores the resolved resource path once verification succeeds. Subsequent calls reuse this cached path instead of repeating the expensive validation and extraction steps. Key changes: - Use a `.checked` file alongside the cached resource to record the verified path. - Load from the `.checked` file if it exists, bypassing re-validation. - Ensure `.checked` is written after successful retrieval/unpacking. Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>
Fixes manual forward unit test: ``` > ckpt_weights: Path = load(ckpt_name) / "weights" E TypeError: unsupported operand type(s) for /: 'str' and 'str' sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py:505: TypeError ``` Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>
72c7036 to
61db514
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (2)
198-207: Validate and sandbox .checked reads; handle stale/poisoned markersReading and trusting arbitrary text from the marker can return a non-existent or out-of-cache path. Validate, sandbox to cache_dir, and clean up bad markers. Also strip newlines and handle read errors.
Apply this diff:
- if checked.exists(): - path = checked.read_text() - logger.debug(f"Using cached {path=} from {checked=}") - return Path(path) + if checked.exists(): + try: + cached = checked.read_text().strip() + resolved = Path(cached).resolve() + cache_root = cache_dir.resolve() + if resolved.exists() and (resolved == cache_root or cache_root in resolved.parents): + logger.debug(f"Using cached path={resolved} from checked={checked}") + return resolved + logger.warning(f"Ignoring stale/unsafe marker {checked}; will re-validate.") + except OSError as e: + logger.warning(f"Failed reading {checked}: {e}; will re-validate.") + with contextlib.suppress(OSError): + checked.unlink()
224-225: Make .checked writes atomic to prevent torn reads across processesUse write-to-temp + fsync + atomic replace.
Apply this diff:
- checked.write_text(str(path)) - return path + # Atomic marker write to avoid partial reads + tmp = checked.with_suffix(checked.suffix + ".tmp") + with open(tmp, "w", encoding="utf-8") as f: + f.write(str(path)) + f.flush() + os.fsync(f.fileno()) + os.replace(tmp, checked) + return path
🧹 Nitpick comments (1)
sub-packages/bionemo-core/src/bionemo/core/data/load.py (1)
220-223: Be robust if processor lacks extract_dir; fall back to parent of first fileUnzip/Untar return a list and set extract_dir after call. Prefer it when present; otherwise, derive from the first returned path. This guards against custom processors that return lists without extract_dir.
Apply this diff:
- if isinstance(download, list): - path = Path(processor.extract_dir) # type: ignore - else: - path = Path(download) + if isinstance(download, list): + # Prefer processor.extract_dir; else use the parent of the first extracted file. + if hasattr(processor, "extract_dir") and getattr(processor, "extract_dir"): + path = Path(processor.extract_dir) # type: ignore[attr-defined] + else: + path = Path(download[0]).parent + else: + path = Path(download)Context: Decompress returns a single file path, while Unzip/Untar return a list and maintain extract_dir. (fatiando.org)
|
/ok to test 61db514 |
Description
Pooch by default revalidates file hashes and re-unpacks archives on every call, which is very slow for large checkpoints. This change introduces a
.checkedmarker file that stores the resolved resource path once verification succeeds. Subsequent calls reuse this cached path instead of repeating the expensive validation and extraction steps.Key changes:
Use a
.checkedfile alongside the cached resource to record the verified path.Load from the
.checkedfile if it exists, bypassing re-validation.Ensure
.checkedis written after successful retrieval/unpacking.Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
Note
By default, the notebooks validation tests are skipped unless explicitly enabled.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Usage
# TODO: Add code snippetPre-submit Checklist
Summary by CodeRabbit