Harden downloads and migrate packaging/docs to uv#86
Conversation
Remove plotting capabilities
Support comma-separated multiple categories in download-all-public-category-files
📝 WalkthroughWalkthroughUpdated CI to use Changes
Sequence Diagram(s)sequenceDiagram
participant User as Client (CLI)
participant CLI as pridepy CLI
participant Files as Files module
participant Remote as Remote endpoints (ftp/aspera/globus/s3)
participant FS as Local filesystem
rect rgba(200,230,255,0.5)
User->>CLI: run download command (--protocol ftp)
CLI->>Files: call download_files(protocol="ftp", checksum_check=...)
end
rect rgba(200,255,200,0.5)
Files->>Remote: batch download all files via primary protocol (ftp)
Remote-->>Files: file artifacts (may be empty/corrupt)
Files->>FS: write batch files to output_folder
Files->>Files: if checksum_check -> download checksum manifest
Files->>Files: compute/verify MD5 per file
end
rect rgba(255,230,200,0.5)
Files->>Files: for invalid/missing files -> remove and attempt fallback
Files->>Remote: retry single file via next protocol in PROTOCOL_ORDER (aspera, globus, s3...)
Remote-->>Files: file artifact (valid or invalid)
alt valid
Files->>FS: write validated file
else invalid after all attempts
Files-->>CLI: raise RuntimeError(failing filenames)
end
end
Files-->>CLI: return success (all files valid)
CLI-->>User: exit with status/result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@chakrabandla all tests are failing. |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (4)
pridepy/tests/test_download_resilience.py (1)
56-75: Assert the fallback path actually tries both protocols.This test can pass on final success without proving
asperafailed befores3was attempted. Capture protocol attempts to lock in the fallback behavior.🧪 Proposed stronger assertion
local_path = os.path.join(tmp_dir, "p.raw") + attempted_protocols = [] def fake_download(record, output_folder, protocol, aspera_bandwidth): + attempted_protocols.append(protocol) if protocol == "aspera": with open(local_path, "wb") as handle: handle.write(b"") elif protocol == "s3": with open(local_path, "wb") as handle: @@ assert success is True + assert attempted_protocols == ["aspera", "s3"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/tests/test_download_resilience.py` around lines 56 - 75, The test currently doesn't prove the fallback actually attempted "aspera" before "s3"; modify the fake_download side_effect used on Files._download_one_file_by_protocol to record each protocol it receives (e.g., append to a local list) and simulate aspera producing an empty file then s3 producing the valid file, then after calling handler._download_with_fallback assert success is True and assert the recorded protocol attempts equals or begins with ["aspera", "s3"] to lock in the fallback behavior for the protocol_sequence argument.pridepy/files/files.py (3)
62-62:"aspera"first in auto ordering penalizes systems without Aspera.
PROTOCOL_ORDERputsasperafirst, and_download_with_fallbackrunsmax_protocol_retries=2attempts per protocol. On any machine without theascpbinary (or behind a firewall blocking port 33001), every file will burn two aspera failures before falling back to s3/ftp. For a project with hundreds of files this is a noticeable startup hit and a lot of noisy error logs.Consider either (a) probing
get_ascp_binary()once at the start ofdownload_filesand droppingasperafrom the sequence when unavailable, or (b) reorderingPROTOCOL_ORDERto prefers3(which uses HTTPS and is broadly reachable) first in auto mode.Also applies to: 190-195
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` at line 62, PROTOCOL_ORDER currently lists "aspera" first which causes _download_with_fallback to attempt Aspera repeatedly on machines without the ascp binary; in download_files call get_ascp_binary() once and if it returns None (or raises), remove "aspera" from the protocol list used by _download_with_fallback (do not change the global PROTOCOL_ORDER constant), or alternatively change PROTOCOL_ORDER to put "s3" first; ensure the local code passes the adjusted protocol sequence into _download_with_fallback so unavailable aspera is skipped and retries are not wasted.
918-959:download_filesis@staticmethodbut instantiatesFiles()to call an instance method.
_download_with_fallbackdoesn't useself— it only calls static helpers. Either drop the@staticmethoddecorator ondownload_files(and useself), or make_download_with_fallbacka@staticmethodso no throwawayFiles()instance is needed. The latter is less invasive:- def _download_with_fallback( - self, + `@staticmethod` + def _download_with_fallback( file_record: Dict, ...and on Line 949-952:
- file_handler = Files() for file_record in file_list_json: expected_checksum = checksum_map.get(file_record["fileName"]) - file_handler._download_with_fallback( + Files._download_with_fallback( file_record=file_record, ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 918 - 959, The download_files method instantiates a throwaway Files() to call _download_with_fallback even though _download_with_fallback does not use self; make _download_with_fallback a `@staticmethod` (remove any self parameter or update local references accordingly), then replace the file_handler = Files() + file_handler._download_with_fallback(...) call with a direct call to Files._download_with_fallback(...); also search for other call sites of _download_with_fallback and update them to the static call if needed.
123-132: Addusedforsecurity=Falseto suppress Ruff S324 warning.MD5 here is used for non-security integrity checks against server-provided checksums. Setting
usedforsecurity=Falsemakes the intent explicit and suppresses the static analyzer warning. The project's minimum Python version is 3.9+, where this parameter is supported.- def compute_md5(file_path: str, chunk_size: int = 4 * 1024 * 1024) -> str: - md5 = hashlib.md5() + def compute_md5(file_path: str, chunk_size: int = 4 * 1024 * 1024) -> str: + md5 = hashlib.md5(usedforsecurity=False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 123 - 132, The compute_md5 function uses hashlib.md5() for non-security integrity checks; update the hashlib.md5() call in compute_md5 to explicitly pass usedforsecurity=False (e.g., hashlib.md5(usedforsecurity=False)) so the intent is clear and Ruff S324 is suppressed; keep the rest of the logic (file reading in chunks and md5.update) unchanged and ensure the function signature (compute_md5, file_path, chunk_size) remains the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/python-publish.yml:
- Around line 19-22: The workflow uses the third-party action reference
astral-sh/setup-uv@v4 which is mutable; replace that tag with the action's
specific commit SHA (e.g. astral-sh/setup-uv@<commit-sha>) to pin the dependency
and keep the existing with: enable-cache: true unchanged; update the uses entry
in the publish workflow to the SHA-pinned reference so the publish job is
reproducible and immutable.
In `@pridepy/files/files.py`:
- Around line 862-869: The current fallback assumes publicFileLocations[0]
exists which can raise IndexError when publicFileLocations is [] and also
directly indexes file_record["fileName"] which can raise KeyError; update the
block around Files._get_download_url to safely extract the public URL and
filename: after catching ValueError, read publicFileLocations using
file_record.get("publicFileLocations") and pick the first non-empty dict safely
(e.g. check truthiness or use next(...) with a default) to obtain .get("value",
"") so no IndexError occurs, and replace file_record["fileName"] with
file_record.get("fileName", "unknown") before calling Files.get_output_file_name
or os.path.join to avoid KeyError; keep usage of Files.get_output_file_name and
Files._get_download_url unchanged otherwise.
- Around line 134-149: validate_download currently treats any non-empty file as
valid when expected_checksum is None, allowing truncated downloads to pass; fix
by adding an expected_size check and by ensuring download routines surface
failures: update validate_download to accept an optional expected_size parameter
and return False if os.path.getsize(file_path) != expected_size when
expected_size is provided (keep current empty-file check and checksum behavior),
and update the protocol-specific download functions (download_files_from_ftp,
download_files_from_aspera, download_files_from_globus, download_files_from_s3)
to (a) obtain remote content length (ftp.size / S3 content_length /
Content-Length) and pass it into validate_download, and (b) stop swallowing
exceptions silently—re-raise or return an error so _download_with_fallback can
detect failures and trigger protocol fallback.
- Around line 937-940: Replace the silent log+return when an invalid protocol is
passed with a raised exception: instead of calling logging.error(...) and
returning in the protocol validation block (where protocols_supported = ["ftp",
"aspera", "globus", "s3", "auto"] and protocol is checked), raise a ValueError
containing the invalid protocol and the allowed protocols so callers (and the
CLI) receive a non-zero-failure signal; update any surrounding docstrings or
tests if they assume silent returns.
- Around line 849-916: The caller download_files currently ignores the boolean
return of _download_with_fallback so per-file failures are swallowed; update
download_files to capture the return value of Files._download_with_fallback (or
self._download_with_fallback) for each file, append failed file names to a
failures list when False is returned, and after processing the batch log a
concise summary of failed files (and total succeeded/failed). Finally propagate
failure to the CLI by returning a non-zero indicator or raising SystemExit with
a non-zero code (or returning False from download_files) so the process exits
non-zero on partial failure.
In `@pridepy/pridepy.py`:
- Around line 19-24: The --protocol click options currently accept free-form
strings; update each Click option declaration for the protocol flags (the ones
around the option block at the top, and the other two option blocks near lines
referencing the same flag) to include
type=click.Choice(["auto","ftp","aspera","globus","s3"]) so that commands using
the protocol option validate input at parse time; specifically add
type=click.Choice(["auto","ftp","aspera","globus","s3"]) to the Click option
calls that define the "-p/--protocol" parameter (the option declarations that
include default="auto" and help="Protocol to use...") so all three occurrences
reject invalid values.
In `@README.md`:
- Around line 232-234: The docker run example uses an absolute host path
(/Users/yperez/work/pridepy/paper/) which is not portable; update the example to
mount a repository-relative path instead (replace the host path in the -v option
used in the docker run command such as the entry containing "-v
/Users/yperez/...:/data") so contributors can copy/paste — e.g., use the
current-directory or repo-relative mount (like using "$(pwd)" or "./paper") for
the host side and keep the container side as ":/data", ensuring the rest of the
command (docker run ... openjournals/inara:latest paper.md -p -o pdf) remains
the same.
- Line 223: The README currently claims `pridepy` is published via GitHub
Actions using "PyPI trusted publishing/token flow" with `uv build`, but the
workflow `.github/workflows/python-publish.yml` still uses
`secrets.PYPI_API_TOKEN`; either update the README wording to state it uses the
token-based flow (`secrets.PYPI_API_TOKEN`) or modify the workflow to use OIDC
(remove the static secret and configure the `permissions`/`id-token` and PyPI
OIDC setup) so the claim about trusted publishing via OIDC is accurate.
---
Nitpick comments:
In `@pridepy/files/files.py`:
- Line 62: PROTOCOL_ORDER currently lists "aspera" first which causes
_download_with_fallback to attempt Aspera repeatedly on machines without the
ascp binary; in download_files call get_ascp_binary() once and if it returns
None (or raises), remove "aspera" from the protocol list used by
_download_with_fallback (do not change the global PROTOCOL_ORDER constant), or
alternatively change PROTOCOL_ORDER to put "s3" first; ensure the local code
passes the adjusted protocol sequence into _download_with_fallback so
unavailable aspera is skipped and retries are not wasted.
- Around line 918-959: The download_files method instantiates a throwaway
Files() to call _download_with_fallback even though _download_with_fallback does
not use self; make _download_with_fallback a `@staticmethod` (remove any self
parameter or update local references accordingly), then replace the file_handler
= Files() + file_handler._download_with_fallback(...) call with a direct call to
Files._download_with_fallback(...); also search for other call sites of
_download_with_fallback and update them to the static call if needed.
- Around line 123-132: The compute_md5 function uses hashlib.md5() for
non-security integrity checks; update the hashlib.md5() call in compute_md5 to
explicitly pass usedforsecurity=False (e.g., hashlib.md5(usedforsecurity=False))
so the intent is clear and Ruff S324 is suppressed; keep the rest of the logic
(file reading in chunks and md5.update) unchanged and ensure the function
signature (compute_md5, file_path, chunk_size) remains the same.
In `@pridepy/tests/test_download_resilience.py`:
- Around line 56-75: The test currently doesn't prove the fallback actually
attempted "aspera" before "s3"; modify the fake_download side_effect used on
Files._download_one_file_by_protocol to record each protocol it receives (e.g.,
append to a local list) and simulate aspera producing an empty file then s3
producing the valid file, then after calling handler._download_with_fallback
assert success is True and assert the recorded protocol attempts equals or
begins with ["aspera", "s3"] to lock in the fallback behavior for the
protocol_sequence argument.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee7939da-352a-4a48-bf06-34fb23fcef42
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
.github/workflows/python-app.yml.github/workflows/python-package.yml.github/workflows/python-publish.ymlREADME.mdpridepy/files/files.pypridepy/pridepy.pypridepy/tests/test_download_resilience.pypyproject.tomlrecipe/meta.yaml
| @staticmethod | ||
| def validate_download(file_path: str, expected_checksum: Optional[str] = None) -> Tuple[bool, str]: | ||
| """ | ||
| Validate a local file exists, is non-empty, and checksum matches when provided. | ||
| """ | ||
| if not os.path.exists(file_path): | ||
| return False, "file does not exist" | ||
| if os.path.getsize(file_path) == 0: | ||
| return False, "file is empty" | ||
| if expected_checksum: | ||
| actual_checksum = Files.compute_md5(file_path) | ||
| if actual_checksum.lower() != expected_checksum.lower(): | ||
| return False, ( | ||
| f"checksum mismatch (expected={expected_checksum.lower()}, actual={actual_checksum.lower()})" | ||
| ) | ||
| return True, "ok" |
There was a problem hiding this comment.
Partial/corrupt downloads pass validation when checksum_check=False.
validate_download only checks existence and non-empty size when expected_checksum is None. Combined with the sub-download functions (download_files_from_ftp/aspera/globus/s3) that swallow exceptions internally and don't re-raise, a partially-written file from a truncated transfer will look "valid" to _download_with_fallback and protocol fallback will never trigger unless the user opts into checksum_check=True. Users running with defaults will silently receive corrupt data.
Options: propagate failures from the sub-download functions so the outer try/except at Line 894 sees them, or compare against a known size (e.g., ftp.size / S3 content_length / Content-Length) in validate_download even when no checksum is available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pridepy/files/files.py` around lines 134 - 149, validate_download currently
treats any non-empty file as valid when expected_checksum is None, allowing
truncated downloads to pass; fix by adding an expected_size check and by
ensuring download routines surface failures: update validate_download to accept
an optional expected_size parameter and return False if
os.path.getsize(file_path) != expected_size when expected_size is provided (keep
current empty-file check and checksum behavior), and update the
protocol-specific download functions (download_files_from_ftp,
download_files_from_aspera, download_files_from_globus, download_files_from_s3)
to (a) obtain remote content length (ftp.size / S3 content_length /
Content-Length) and pass it into validate_download, and (b) stop swallowing
exceptions silently—re-raise or return an error so _download_with_fallback can
detect failures and trigger protocol fallback.
There was a problem hiding this comment.
Pull request overview
This PR hardens pridepy’s download flow by defaulting to an automatic protocol fallback strategy and adding checksum-based validation, while also migrating packaging and CI/docs from Poetry to a uv + Setuptools (PEP 621) setup.
Changes:
- Implemented protocol fallback (
auto) and local-file validation (non-empty + optional checksum) in the downloader. - Migrated packaging metadata/build backend from Poetry to Setuptools and updated CI to use
uv. - Refreshed README installation/dev instructions and added tests for the new download resilience behavior.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
recipe/meta.yaml |
Updates conda build requirements to align with Setuptools-based builds and updates license metadata. |
pyproject.toml |
Converts Poetry config to PEP 621 [project] metadata and Setuptools build backend; adds optional dev deps. |
pridepy/files/files.py |
Adds checksum parsing/validation + protocol fallback download orchestration and updates default protocol. |
pridepy/pridepy.py |
Changes CLI defaults/docs to use auto protocol and mentions fallback + checksum options. |
pridepy/tests/test_download_resilience.py |
Adds tests covering checksum parsing, validation, protocol sequencing, and fallback behavior. |
README.md |
Updates install/dev instructions (uv), expands usage examples, and documents the new robust download mode. |
.github/workflows/python-publish.yml |
Switches build step to uv build and modernizes setup actions. |
.github/workflows/python-package.yml |
Updates CI to install/build/lint/test via uv. |
.github/workflows/python-app.yml |
Updates CI to install/lint/test via uv. |
poetry.lock |
Removes Poetry lockfile as part of migration away from Poetry. |
Comments suppressed due to low confidence (1)
.github/workflows/python-publish.yml:30
- This workflow publishes using an API token secret (
user: __token__+PYPI_API_TOKEN). If the intent is PyPI Trusted Publishing (OIDC), this should be switched to trusted publishing and the workflow should requestpermissions: id-token: write(and avoid long-lived secrets). Otherwise, please update the README text that claims trusted publishing.
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
verbose: true
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def compute_md5(file_path: str, chunk_size: int = 4 * 1024 * 1024) -> str: | ||
| md5 = hashlib.md5() | ||
| with open(file_path, "rb") as file_handle: |
There was a problem hiding this comment.
compute_md5() uses hashlib.md5() directly. On some OpenSSL/FIPS-enabled environments this can raise a ValueError (MD5 disabled), which would make --checksum-check unusable. Consider using the usedforsecurity=False flag when available and/or gracefully handling MD5-unavailable environments so downloads can still proceed (at least with the non-empty validation).
| output_path = os.path.join(output_folder, f"{accession}-checksum.tsv") | ||
| with open(output_path, "w") as file: | ||
| file.write(data) | ||
| return output_path |
There was a problem hiding this comment.
save_checksum_file() writes the checksum file into output_folder without ensuring the directory exists, which can raise FileNotFoundError when --checksum-check is enabled (since checksums are fetched before any protocol-specific downloader creates the directory). Also, the checksum endpoint used in this function points at wwwdev.ebi.ac.uk; if that host isn’t intended for end users, please switch to the production www.ebi.ac.uk base (or derive from V3_API_BASE_URL).
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
pridepy/files/files.py (3)
954-957:⚠️ Potential issue | 🟡 MinorRaise on unsupported protocols instead of returning success.
A silent
returnleaves API callers and CLI wrappers with no failure signal for an invalid protocol value.Proposed fix
- protocols_supported = ["ftp", "aspera", "globus", "s3", "auto"] + protocols_supported = {"ftp", "aspera", "globus", "s3", "auto"} if protocol not in protocols_supported: - logging.error("Protocol should be one of ftp, aspera, globus, s3, auto") - return + raise ValueError( + f"Unsupported protocol '{protocol}'. Must be one of: {sorted(protocols_supported)}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 954 - 957, The code currently logs an error and returns silently when an unsupported protocol is provided (check protocols_supported and protocol), which masks failures; replace the logging.error + return with raising an exception (e.g., ValueError) that includes the invalid protocol value and the allowed list (ftp, aspera, globus, s3, auto) so callers/CLI receive a proper failure signal; keep or log the error as secondary but ensure the function raises the error instead of returning.
879-886:⚠️ Potential issue | 🟡 MinorGuard the fallback URL extraction for empty locations.
file_record.get("publicFileLocations", [{}])[0]still raisesIndexErrorwhen the API returns an empty list, aborting the batch before fallback can report a clean failure.Proposed fix
try: canonical_url = Files._get_download_url(file_record, "ftp") except ValueError: - canonical_url = file_record.get("publicFileLocations", [{}])[0].get("value", "") + locations = file_record.get("publicFileLocations") or [] + canonical_url = locations[0].get("value", "") if locations else "" if canonical_url: local_path = Files.get_output_file_name(canonical_url, file_record, output_folder) else: - local_path = os.path.join(output_folder, file_record["fileName"]) + local_path = os.path.join(output_folder, file_record.get("fileName", "unknown"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 879 - 886, The fallback extraction in the except block can still raise IndexError because file_record.get("publicFileLocations", [{}])[0] assumes a non-empty list; update the fallback to safely handle empty or missing lists by retrieving publicFileLocations into a variable (e.g., locations = file_record.get("publicFileLocations") or []) and then set canonical_url = locations[0].get("value", "") only if locations is truthy, otherwise set canonical_url = "" (so Files.get_output_file_name and the subsequent else branch won't crash); modify the logic around Files._get_download_url and Files.get_output_file_name accordingly to use this guarded canonical_url.
140-155:⚠️ Potential issue | 🟠 MajorNon-empty truncated files still pass when checksum validation is off.
With
expected_checksum=None, validation accepts any non-empty file. Since checksum checking is optional, a partial transfer can still be treated as successful instead of triggering fallback.Suggested direction
- def validate_download(file_path: str, expected_checksum: Optional[str] = None) -> Tuple[bool, str]: + def validate_download( + file_path: str, + expected_checksum: Optional[str] = None, + expected_size: Optional[int] = None, + ) -> Tuple[bool, str]: @@ if os.path.getsize(file_path) == 0: return False, "file is empty" + if expected_size is not None and os.path.getsize(file_path) != expected_size: + return False, "file size mismatch"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pridepy/files/files.py` around lines 140 - 155, validate_download currently treats any non-empty file as valid when expected_checksum is None, allowing truncated partial downloads to pass; update Files.validate_download to accept an optional expected_size: Optional[int] parameter and change validation logic so that if expected_checksum is provided use Files.compute_md5 as before, else if expected_size is provided compare os.path.getsize(file_path) to expected_size and fail on mismatch, and only accept the file when one of these verifications passes; reference the method name Files.validate_download and Files.compute_md5 when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pridepy/files/files.py`:
- Around line 81-93: The parser currently replaces tabs with spaces then splits
(tokens = clean.replace("\t", " ").split()), which breaks tab-delimited
filenames containing spaces and causes checksum detection to use the wrong
basename; update the parsing in the function that sets
tokens/checksum/path_token so that if the original line contains a tab you split
on the last tab (preserving the full filename left of the tab) to produce [path,
checksum], otherwise fall back to whitespace split and the existing checksum
detection logic (use the variables tokens, checksum, path_token to locate and
set values accordingly).
- Around line 128-131: Replace the try/except fallback that creates md5 with a
single secure call: remove the try/except around hashlib.md5 and always
instantiate md5 using hashlib.md5(usedforsecurity=False); update the variable
md5 in the same scope (where md5 is currently assigned) so there is no raw
hashlib.md5() fallback and S324 warnings are eliminated.
In `@pridepy/tests/test_download_resilience.py`:
- Line 1: Replace the computed MD5 fixture with its known digest literal and
remove the now-unused hashlib import: find where the test uses hashlib.md5() to
produce the deterministic fixture (in tests in this file) and substitute the
literal hex digest string directly in the fixture/value, then delete the import
hashlib line so the Ruff S324 lint violation is resolved.
---
Duplicate comments:
In `@pridepy/files/files.py`:
- Around line 954-957: The code currently logs an error and returns silently
when an unsupported protocol is provided (check protocols_supported and
protocol), which masks failures; replace the logging.error + return with raising
an exception (e.g., ValueError) that includes the invalid protocol value and the
allowed list (ftp, aspera, globus, s3, auto) so callers/CLI receive a proper
failure signal; keep or log the error as secondary but ensure the function
raises the error instead of returning.
- Around line 879-886: The fallback extraction in the except block can still
raise IndexError because file_record.get("publicFileLocations", [{}])[0] assumes
a non-empty list; update the fallback to safely handle empty or missing lists by
retrieving publicFileLocations into a variable (e.g., locations =
file_record.get("publicFileLocations") or []) and then set canonical_url =
locations[0].get("value", "") only if locations is truthy, otherwise set
canonical_url = "" (so Files.get_output_file_name and the subsequent else branch
won't crash); modify the logic around Files._get_download_url and
Files.get_output_file_name accordingly to use this guarded canonical_url.
- Around line 140-155: validate_download currently treats any non-empty file as
valid when expected_checksum is None, allowing truncated partial downloads to
pass; update Files.validate_download to accept an optional expected_size:
Optional[int] parameter and change validation logic so that if expected_checksum
is provided use Files.compute_md5 as before, else if expected_size is provided
compare os.path.getsize(file_path) to expected_size and fail on mismatch, and
only accept the file when one of these verifications passes; reference the
method name Files.validate_download and Files.compute_md5 when making these
changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 865b02be-6ef0-412c-a426-f75abca30c43
📒 Files selected for processing (9)
.flake8.github/workflows/python-app.yml.github/workflows/python-package.yml.github/workflows/python-publish.ymlREADME.mdpridepy/files/files.pypridepy/pridepy.pypridepy/tests/test_download_resilience.pypyproject.toml
✅ Files skipped from review due to trivial changes (2)
- .flake8
- .github/workflows/python-publish.yml
🚧 Files skipped from review as they are similar to previous changes (4)
- .github/workflows/python-package.yml
- .github/workflows/python-app.yml
- pridepy/pridepy.py
- pyproject.toml
| tokens = clean.replace("\t", " ").split() | ||
| if len(tokens) < 2: | ||
| return None | ||
|
|
||
| checksum = None | ||
| path_token = None | ||
| for idx, token in enumerate(tokens): | ||
| normalized = token.lower() | ||
| if len(normalized) == 32 and all(c in "0123456789abcdef" for c in normalized): | ||
| checksum = normalized | ||
| remaining = [t for i, t in enumerate(tokens) if i != idx] | ||
| if remaining: | ||
| path_token = remaining[-1] |
There was a problem hiding this comment.
Preserve tab-delimited filenames with spaces.
Replacing tabs with spaces before splitting can turn file name.raw<TAB><md5> into the basename name.raw, so --checksum-check silently misses the real file.
Proposed parser adjustment
- tokens = clean.replace("\t", " ").split()
- if len(tokens) < 2:
- return None
-
- checksum = None
- path_token = None
- for idx, token in enumerate(tokens):
- normalized = token.lower()
- if len(normalized) == 32 and all(c in "0123456789abcdef" for c in normalized):
- checksum = normalized
- remaining = [t for i, t in enumerate(tokens) if i != idx]
- if remaining:
- path_token = remaining[-1]
- break
+ def is_md5(token: str) -> bool:
+ normalized = token.lower()
+ return len(normalized) == 32 and all(c in "0123456789abcdef" for c in normalized)
+
+ if "\t" in clean:
+ fields = [field.strip() for field in clean.split("\t") if field.strip()]
+ candidates = [(fields[0], fields[-1]), (fields[-1], fields[0])] if len(fields) >= 2 else []
+ else:
+ candidates = []
+ first_split = clean.split(maxsplit=1)
+ last_split = clean.rsplit(maxsplit=1)
+ if len(first_split) == 2:
+ candidates.append((first_split[0], first_split[1]))
+ if len(last_split) == 2:
+ candidates.append((last_split[1], last_split[0]))
+
+ checksum = None
+ path_token = None
+ for checksum_token, candidate_path in candidates:
+ if is_md5(checksum_token):
+ checksum = checksum_token.lower()
+ path_token = candidate_path
+ break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pridepy/files/files.py` around lines 81 - 93, The parser currently replaces
tabs with spaces then splits (tokens = clean.replace("\t", " ").split()), which
breaks tab-delimited filenames containing spaces and causes checksum detection
to use the wrong basename; update the parsing in the function that sets
tokens/checksum/path_token so that if the original line contains a tab you split
on the last tab (preserving the full filename left of the tab) to produce [path,
checksum], otherwise fall back to whitespace split and the existing checksum
detection logic (use the variables tokens, checksum, path_token to locate and
set values accordingly).
…n packaging - files.py: split download orchestration into a batch-first primary pass (reuses one FTP/S3 connection for the whole project, as before) plus a per-file fallback for invalid/missing files. Removes the per-file reconnect regression from the initial PR version. - files.py: make `_get_download_url` symmetric — raise cleanly when the requested protocol's location is unavailable instead of silently falling back to an arbitrary (possibly Aspera) URL for ftp/s3/globus. - files.py: hoist the globus base URL into a class constant so the mapping is defined in one place; call static helpers via `Files.`. - pyproject.toml: use PEP 621 table form for the license field so it works with the pinned `setuptools>=68` and re-add the `License :: OSI Approved :: Apache Software License` classifier. Bump boto3/botocore lower bounds to match meta.yaml. Add Python 3.12 to the supported classifiers. - python-package.yml: test against 3.9-3.12 and run lint/tests before building the wheel. - tests: update the resilience suite for the renamed helper and add a test asserting the batch path skips the per-file fallback on success. https://claude.ai/code/session_01V9mHYkhG1pggyzJCii8do3
Add resilient multi-protocol download with fallback and checksum validation
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pridepy/files/files.py`:
- Around line 1001-1016: The loop indexes file_record["fileName"] directly which
can raise KeyError; change to a defensive local variable like file_name =
file_record.get("fileName", "<unknown>") and use that variable for checksum
lookup (replace checksum_map.get(file_record["fileName"]) with
checksum_map.get(file_name)), for log messages, and when calling
Files._remove_if_exists/Files.validate_download handling so the loop won't abort
on missing keys; also review Files._resolve_local_path and any calls in
_download_with_fallback that access file_record["fileName"] and apply the same
.get fallback for consistency.
In `@pyproject.toml`:
- Around line 68-81: The package discovery currently uses
[tool.setuptools.packages.find] with include = ["pridepy*"] which also picks up
test packages; update that section by adding an exclude entry to omit test
packages (e.g., add exclude = ["pridepy.tests*"] or similar pattern) so that
pridepy.tests and any subpackages are not shipped in the installed wheel; modify
the same [tool.setuptools.packages.find] block where include = ["pridepy*"] is
declared to add the exclude key.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a3d6a8ae-cb03-437c-a6fa-4db41e4580b9
📒 Files selected for processing (8)
.github/workflows/python-package.yml.gitignoreREADME.mdpridepy/files/files.pypridepy/pridepy.pypridepy/tests/test_download_resilience.pypyproject.tomlrecipe/meta.yaml
✅ Files skipped from review due to trivial changes (2)
- .gitignore
- recipe/meta.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/python-package.yml
- pridepy/pridepy.py
| failed_files: List[str] = [] | ||
| for file_record in file_list_json: | ||
| expected_checksum = checksum_map.get(file_record["fileName"]) | ||
| local_path = Files._resolve_local_path(file_record, output_folder) | ||
| valid, reason = Files.validate_download(local_path, expected_checksum) | ||
| if valid: | ||
| continue | ||
|
|
||
| logging.warning( | ||
| f"{file_record['fileName']} invalid after {primary_protocol} ({reason})" | ||
| ) | ||
| elif protocol == "s3": | ||
| Files.download_files_from_s3(file_list_json, output_folder, skip_if_downloaded_already) | ||
| Files._remove_if_exists(local_path) | ||
|
|
||
| if not fallback_sequence: | ||
| failed_files.append(file_record.get("fileName", "<unknown>")) | ||
| continue |
There was a problem hiding this comment.
file_record["fileName"] can KeyError and abort the entire batch.
Line 1003 indexes fileName directly, and line 1010’s log message does the same. If the API ever returns a file record without fileName, the whole download_files call aborts mid-loop — even though line 1015 already uses .get("fileName", "<unknown>") for the failure summary. Use the same defensive access everywhere.
🛡️ Proposed fix
- for file_record in file_list_json:
- expected_checksum = checksum_map.get(file_record["fileName"])
- local_path = Files._resolve_local_path(file_record, output_folder)
- valid, reason = Files.validate_download(local_path, expected_checksum)
- if valid:
- continue
-
- logging.warning(
- f"{file_record['fileName']} invalid after {primary_protocol} ({reason})"
- )
+ for file_record in file_list_json:
+ file_name = file_record.get("fileName", "<unknown>")
+ expected_checksum = checksum_map.get(file_name)
+ local_path = Files._resolve_local_path(file_record, output_folder)
+ valid, reason = Files.validate_download(local_path, expected_checksum)
+ if valid:
+ continue
+
+ logging.warning(
+ f"{file_name} invalid after {primary_protocol} ({reason})"
+ )Note: _resolve_local_path (line 217) and _download_with_fallback (lines 905, 919, 925, 930, 938) also index file_record["fileName"] directly; consider the same fallback there for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pridepy/files/files.py` around lines 1001 - 1016, The loop indexes
file_record["fileName"] directly which can raise KeyError; change to a defensive
local variable like file_name = file_record.get("fileName", "<unknown>") and use
that variable for checksum lookup (replace
checksum_map.get(file_record["fileName"]) with checksum_map.get(file_name)), for
log messages, and when calling Files._remove_if_exists/Files.validate_download
handling so the loop won't abort on missing keys; also review
Files._resolve_local_path and any calls in _download_with_fallback that access
file_record["fileName"] and apply the same .get fallback for consistency.
| [tool.setuptools.packages.find] | ||
| where = ["."] | ||
| include = ["pridepy*"] | ||
|
|
||
| [tool.setuptools.package-data] | ||
| pridepy = [ | ||
| "aspera/etc/*", | ||
| "aspera/key/*", | ||
| "aspera/linux-32/*", | ||
| "aspera/linux-64/*", | ||
| "aspera/mac-intel/*", | ||
| "aspera/windows-32/*", | ||
| "aspera/windows-64/*", | ||
| ] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Exclude the test package from the installed wheel.
include = ["pridepy*"] also matches pridepy.tests, so the resilience/test modules will ship in the installed wheel and become importable as pridepy.tests.* by end users. Add an explicit exclude to keep tests out of the distribution.
🧹 Proposed fix
[tool.setuptools.packages.find]
where = ["."]
include = ["pridepy*"]
+exclude = ["pridepy.tests*", "tests*"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 68 - 81, The package discovery currently uses
[tool.setuptools.packages.find] with include = ["pridepy*"] which also picks up
test packages; update that section by adding an exclude entry to omit test
packages (e.g., add exclude = ["pridepy.tests*"] or similar pattern) so that
pridepy.tests and any subpackages are not shipped in the installed wheel; modify
the same [tool.setuptools.packages.find] block where include = ["pridepy*"] is
declared to add the exclude key.
Summary by CodeRabbit
New Features
Documentation
Chores
Tests