Skip to content

Harden downloads and migrate packaging/docs to uv#86

Merged
ypriverol merged 13 commits intodevfrom
feature/robust-transfer-failover-integrity
Apr 21, 2026
Merged

Harden downloads and migrate packaging/docs to uv#86
ypriverol merged 13 commits intodevfrom
feature/robust-transfer-failover-integrity

Conversation

@chakrabandla
Copy link
Copy Markdown
Contributor

@chakrabandla chakrabandla commented Apr 21, 2026

Summary by CodeRabbit

  • New Features

    • Downloads now support S3 and use automatic multi-protocol fallback with per-file integrity checks.
  • Documentation

    • Expanded README with Quick Start, CLI examples, Python API usage, and updated install/test/build instructions.
  • Chores

    • Modernized CI/workflows, moved packaging to setuptools/PEP 621, bumped package version and updated license.
  • Tests

    • Added tests for download resilience, checksum parsing, and protocol fallback.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

Updated CI to use uv and newer GitHub Actions; migrated packaging from Poetry to PEP 621/Setuptools; added file-download resilience with MD5 checksum validation and multi-protocol fallback (including s3); updated CLI to expose protocol choices; expanded README and tests; added Flake8 config and updated ignores.

Changes

Cohort / File(s) Summary
CI workflows
.github/workflows/python-app.yml, .github/workflows/python-package.yml, .github/workflows/python-publish.yml
Replaced pip/poetry steps with uv setup (astral-sh/setup-uv), enabled caching keyed to pyproject.toml, upgraded actions (checkout & setup-python), run lint/test/build via uv run/uv build.
Packaging & build metadata
pyproject.toml, recipe/meta.yaml
Migrated from Poetry to PEP 621/Setuptools, bumped version to 0.0.13, moved deps and dev deps, adjusted build-backend and conda recipe, changed license in recipe.
File download logic & tests
pridepy/files/files.py, pridepy/tests/test_download_resilience.py
Added MD5 checksum parsing/validation, compute/validate helpers, checksum manifest download/save, protocol URL mapping and PROTOCOL_ORDER, two-phase batch-download + per-file fallback across protocols (including s3), and tests covering parsing, validation, protocol sequencing, fallback behavior.
CLI
pridepy/pridepy.py
Added PROTOCOL_CHOICES (ftp/aspera/globus/s3) and applied to --protocol options for download commands; help text updated to mention automatic fallback.
Docs & examples
README.md
Rewrote top-level description, updated installation (uv/pip/dev) and Quick Start examples, refreshed CLI and Python API examples, updated dev/release instructions to use uv, removed EOF newline issue.
Linting & repo config
.flake8, .gitignore
Added .flake8 config (excludes, max-line-length 127, max-complexity 10) and updated .gitignore to ignore poetry.lock and uv.lock.

Sequence Diagram(s)

sequenceDiagram
  participant User as Client (CLI)
  participant CLI as pridepy CLI
  participant Files as Files module
  participant Remote as Remote endpoints (ftp/aspera/globus/s3)
  participant FS as Local filesystem

  rect rgba(200,230,255,0.5)
    User->>CLI: run download command (--protocol ftp)
    CLI->>Files: call download_files(protocol="ftp", checksum_check=...)
  end

  rect rgba(200,255,200,0.5)
    Files->>Remote: batch download all files via primary protocol (ftp)
    Remote-->>Files: file artifacts (may be empty/corrupt)
    Files->>FS: write batch files to output_folder
    Files->>Files: if checksum_check -> download checksum manifest
    Files->>Files: compute/verify MD5 per file
  end

  rect rgba(255,230,200,0.5)
    Files->>Files: for invalid/missing files -> remove and attempt fallback
    Files->>Remote: retry single file via next protocol in PROTOCOL_ORDER (aspera, globus, s3...)
    Remote-->>Files: file artifact (valid or invalid)
    alt valid
      Files->>FS: write validated file
    else invalid after all attempts
      Files-->>CLI: raise RuntimeError(failing filenames)
    end
  end

  Files-->>CLI: return success (all files valid)
  CLI-->>User: exit with status/result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Dev #53 — modifies pridepy/files/files.py and download logic; closely related to the new fallback and validation changes.
  • move to poetry #61 — modifies CLI download commands/options; related to the --protocol option updates.
  • update DOIs #64 — modifies packaging metadata (pyproject.toml, recipe); related to the migration from Poetry to PEP 621/Setuptools.

Suggested labels

Review effort 1/5

Suggested reviewers

  • selvaebi

Poem

"A rabbit hops with bytes in paw,
checksums snug and protocols awed,
ftp first, then s3’s cheer,
retries hop until files appear —
hooray! the archive’s whole and broad 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely captures the two main objectives: hardening downloads with resilience/integrity checking and migrating the project's packaging and documentation from Poetry to uv.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/robust-transfer-failover-integrity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol
Copy link
Copy Markdown
Contributor

@chakrabandla all tests are failing.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (4)
pridepy/tests/test_download_resilience.py (1)

56-75: Assert the fallback path actually tries both protocols.

This test can pass on final success without proving aspera failed before s3 was attempted. Capture protocol attempts to lock in the fallback behavior.

🧪 Proposed stronger assertion
             local_path = os.path.join(tmp_dir, "p.raw")
+            attempted_protocols = []
 
             def fake_download(record, output_folder, protocol, aspera_bandwidth):
+                attempted_protocols.append(protocol)
                 if protocol == "aspera":
                     with open(local_path, "wb") as handle:
                         handle.write(b"")
                 elif protocol == "s3":
                     with open(local_path, "wb") as handle:
@@
 
             assert success is True
+            assert attempted_protocols == ["aspera", "s3"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/tests/test_download_resilience.py` around lines 56 - 75, The test
currently doesn't prove the fallback actually attempted "aspera" before "s3";
modify the fake_download side_effect used on
Files._download_one_file_by_protocol to record each protocol it receives (e.g.,
append to a local list) and simulate aspera producing an empty file then s3
producing the valid file, then after calling handler._download_with_fallback
assert success is True and assert the recorded protocol attempts equals or
begins with ["aspera", "s3"] to lock in the fallback behavior for the
protocol_sequence argument.
pridepy/files/files.py (3)

62-62: "aspera" first in auto ordering penalizes systems without Aspera.

PROTOCOL_ORDER puts aspera first, and _download_with_fallback runs max_protocol_retries=2 attempts per protocol. On any machine without the ascp binary (or behind a firewall blocking port 33001), every file will burn two aspera failures before falling back to s3/ftp. For a project with hundreds of files this is a noticeable startup hit and a lot of noisy error logs.

Consider either (a) probing get_ascp_binary() once at the start of download_files and dropping aspera from the sequence when unavailable, or (b) reordering PROTOCOL_ORDER to prefer s3 (which uses HTTPS and is broadly reachable) first in auto mode.

Also applies to: 190-195

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` at line 62, PROTOCOL_ORDER currently lists "aspera"
first which causes _download_with_fallback to attempt Aspera repeatedly on
machines without the ascp binary; in download_files call get_ascp_binary() once
and if it returns None (or raises), remove "aspera" from the protocol list used
by _download_with_fallback (do not change the global PROTOCOL_ORDER constant),
or alternatively change PROTOCOL_ORDER to put "s3" first; ensure the local code
passes the adjusted protocol sequence into _download_with_fallback so
unavailable aspera is skipped and retries are not wasted.

918-959: download_files is @staticmethod but instantiates Files() to call an instance method.

_download_with_fallback doesn't use self — it only calls static helpers. Either drop the @staticmethod decorator on download_files (and use self), or make _download_with_fallback a @staticmethod so no throwaway Files() instance is needed. The latter is less invasive:

-    def _download_with_fallback(
-        self,
+    `@staticmethod`
+    def _download_with_fallback(
         file_record: Dict,
         ...

and on Line 949-952:

-        file_handler = Files()
         for file_record in file_list_json:
             expected_checksum = checksum_map.get(file_record["fileName"])
-            file_handler._download_with_fallback(
+            Files._download_with_fallback(
                 file_record=file_record,
                 ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 918 - 959, The download_files method
instantiates a throwaway Files() to call _download_with_fallback even though
_download_with_fallback does not use self; make _download_with_fallback a
`@staticmethod` (remove any self parameter or update local references
accordingly), then replace the file_handler = Files() +
file_handler._download_with_fallback(...) call with a direct call to
Files._download_with_fallback(...); also search for other call sites of
_download_with_fallback and update them to the static call if needed.

123-132: Add usedforsecurity=False to suppress Ruff S324 warning.

MD5 here is used for non-security integrity checks against server-provided checksums. Setting usedforsecurity=False makes the intent explicit and suppresses the static analyzer warning. The project's minimum Python version is 3.9+, where this parameter is supported.

-    def compute_md5(file_path: str, chunk_size: int = 4 * 1024 * 1024) -> str:
-        md5 = hashlib.md5()
+    def compute_md5(file_path: str, chunk_size: int = 4 * 1024 * 1024) -> str:
+        md5 = hashlib.md5(usedforsecurity=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 123 - 132, The compute_md5 function uses
hashlib.md5() for non-security integrity checks; update the hashlib.md5() call
in compute_md5 to explicitly pass usedforsecurity=False (e.g.,
hashlib.md5(usedforsecurity=False)) so the intent is clear and Ruff S324 is
suppressed; keep the rest of the logic (file reading in chunks and md5.update)
unchanged and ensure the function signature (compute_md5, file_path, chunk_size)
remains the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/python-publish.yml:
- Around line 19-22: The workflow uses the third-party action reference
astral-sh/setup-uv@v4 which is mutable; replace that tag with the action's
specific commit SHA (e.g. astral-sh/setup-uv@<commit-sha>) to pin the dependency
and keep the existing with: enable-cache: true unchanged; update the uses entry
in the publish workflow to the SHA-pinned reference so the publish job is
reproducible and immutable.

In `@pridepy/files/files.py`:
- Around line 862-869: The current fallback assumes publicFileLocations[0]
exists which can raise IndexError when publicFileLocations is [] and also
directly indexes file_record["fileName"] which can raise KeyError; update the
block around Files._get_download_url to safely extract the public URL and
filename: after catching ValueError, read publicFileLocations using
file_record.get("publicFileLocations") and pick the first non-empty dict safely
(e.g. check truthiness or use next(...) with a default) to obtain .get("value",
"") so no IndexError occurs, and replace file_record["fileName"] with
file_record.get("fileName", "unknown") before calling Files.get_output_file_name
or os.path.join to avoid KeyError; keep usage of Files.get_output_file_name and
Files._get_download_url unchanged otherwise.
- Around line 134-149: validate_download currently treats any non-empty file as
valid when expected_checksum is None, allowing truncated downloads to pass; fix
by adding an expected_size check and by ensuring download routines surface
failures: update validate_download to accept an optional expected_size parameter
and return False if os.path.getsize(file_path) != expected_size when
expected_size is provided (keep current empty-file check and checksum behavior),
and update the protocol-specific download functions (download_files_from_ftp,
download_files_from_aspera, download_files_from_globus, download_files_from_s3)
to (a) obtain remote content length (ftp.size / S3 content_length /
Content-Length) and pass it into validate_download, and (b) stop swallowing
exceptions silently—re-raise or return an error so _download_with_fallback can
detect failures and trigger protocol fallback.
- Around line 937-940: Replace the silent log+return when an invalid protocol is
passed with a raised exception: instead of calling logging.error(...) and
returning in the protocol validation block (where protocols_supported = ["ftp",
"aspera", "globus", "s3", "auto"] and protocol is checked), raise a ValueError
containing the invalid protocol and the allowed protocols so callers (and the
CLI) receive a non-zero-failure signal; update any surrounding docstrings or
tests if they assume silent returns.
- Around line 849-916: The caller download_files currently ignores the boolean
return of _download_with_fallback so per-file failures are swallowed; update
download_files to capture the return value of Files._download_with_fallback (or
self._download_with_fallback) for each file, append failed file names to a
failures list when False is returned, and after processing the batch log a
concise summary of failed files (and total succeeded/failed). Finally propagate
failure to the CLI by returning a non-zero indicator or raising SystemExit with
a non-zero code (or returning False from download_files) so the process exits
non-zero on partial failure.

In `@pridepy/pridepy.py`:
- Around line 19-24: The --protocol click options currently accept free-form
strings; update each Click option declaration for the protocol flags (the ones
around the option block at the top, and the other two option blocks near lines
referencing the same flag) to include
type=click.Choice(["auto","ftp","aspera","globus","s3"]) so that commands using
the protocol option validate input at parse time; specifically add
type=click.Choice(["auto","ftp","aspera","globus","s3"]) to the Click option
calls that define the "-p/--protocol" parameter (the option declarations that
include default="auto" and help="Protocol to use...") so all three occurrences
reject invalid values.

In `@README.md`:
- Around line 232-234: The docker run example uses an absolute host path
(/Users/yperez/work/pridepy/paper/) which is not portable; update the example to
mount a repository-relative path instead (replace the host path in the -v option
used in the docker run command such as the entry containing "-v
/Users/yperez/...:/data") so contributors can copy/paste — e.g., use the
current-directory or repo-relative mount (like using "$(pwd)" or "./paper") for
the host side and keep the container side as ":/data", ensuring the rest of the
command (docker run ... openjournals/inara:latest paper.md -p -o pdf) remains
the same.
- Line 223: The README currently claims `pridepy` is published via GitHub
Actions using "PyPI trusted publishing/token flow" with `uv build`, but the
workflow `.github/workflows/python-publish.yml` still uses
`secrets.PYPI_API_TOKEN`; either update the README wording to state it uses the
token-based flow (`secrets.PYPI_API_TOKEN`) or modify the workflow to use OIDC
(remove the static secret and configure the `permissions`/`id-token` and PyPI
OIDC setup) so the claim about trusted publishing via OIDC is accurate.

---

Nitpick comments:
In `@pridepy/files/files.py`:
- Line 62: PROTOCOL_ORDER currently lists "aspera" first which causes
_download_with_fallback to attempt Aspera repeatedly on machines without the
ascp binary; in download_files call get_ascp_binary() once and if it returns
None (or raises), remove "aspera" from the protocol list used by
_download_with_fallback (do not change the global PROTOCOL_ORDER constant), or
alternatively change PROTOCOL_ORDER to put "s3" first; ensure the local code
passes the adjusted protocol sequence into _download_with_fallback so
unavailable aspera is skipped and retries are not wasted.
- Around line 918-959: The download_files method instantiates a throwaway
Files() to call _download_with_fallback even though _download_with_fallback does
not use self; make _download_with_fallback a `@staticmethod` (remove any self
parameter or update local references accordingly), then replace the file_handler
= Files() + file_handler._download_with_fallback(...) call with a direct call to
Files._download_with_fallback(...); also search for other call sites of
_download_with_fallback and update them to the static call if needed.
- Around line 123-132: The compute_md5 function uses hashlib.md5() for
non-security integrity checks; update the hashlib.md5() call in compute_md5 to
explicitly pass usedforsecurity=False (e.g., hashlib.md5(usedforsecurity=False))
so the intent is clear and Ruff S324 is suppressed; keep the rest of the logic
(file reading in chunks and md5.update) unchanged and ensure the function
signature (compute_md5, file_path, chunk_size) remains the same.

In `@pridepy/tests/test_download_resilience.py`:
- Around line 56-75: The test currently doesn't prove the fallback actually
attempted "aspera" before "s3"; modify the fake_download side_effect used on
Files._download_one_file_by_protocol to record each protocol it receives (e.g.,
append to a local list) and simulate aspera producing an empty file then s3
producing the valid file, then after calling handler._download_with_fallback
assert success is True and assert the recorded protocol attempts equals or
begins with ["aspera", "s3"] to lock in the fallback behavior for the
protocol_sequence argument.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee7939da-352a-4a48-bf06-34fb23fcef42

📥 Commits

Reviewing files that changed from the base of the PR and between b046bd0 and 7493bc8.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • .github/workflows/python-app.yml
  • .github/workflows/python-package.yml
  • .github/workflows/python-publish.yml
  • README.md
  • pridepy/files/files.py
  • pridepy/pridepy.py
  • pridepy/tests/test_download_resilience.py
  • pyproject.toml
  • recipe/meta.yaml

Comment thread .github/workflows/python-publish.yml
Comment thread pridepy/files/files.py
Comment on lines +134 to +149
@staticmethod
def validate_download(file_path: str, expected_checksum: Optional[str] = None) -> Tuple[bool, str]:
"""
Validate a local file exists, is non-empty, and checksum matches when provided.
"""
if not os.path.exists(file_path):
return False, "file does not exist"
if os.path.getsize(file_path) == 0:
return False, "file is empty"
if expected_checksum:
actual_checksum = Files.compute_md5(file_path)
if actual_checksum.lower() != expected_checksum.lower():
return False, (
f"checksum mismatch (expected={expected_checksum.lower()}, actual={actual_checksum.lower()})"
)
return True, "ok"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial/corrupt downloads pass validation when checksum_check=False.

validate_download only checks existence and non-empty size when expected_checksum is None. Combined with the sub-download functions (download_files_from_ftp/aspera/globus/s3) that swallow exceptions internally and don't re-raise, a partially-written file from a truncated transfer will look "valid" to _download_with_fallback and protocol fallback will never trigger unless the user opts into checksum_check=True. Users running with defaults will silently receive corrupt data.

Options: propagate failures from the sub-download functions so the outer try/except at Line 894 sees them, or compare against a known size (e.g., ftp.size / S3 content_length / Content-Length) in validate_download even when no checksum is available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 134 - 149, validate_download currently
treats any non-empty file as valid when expected_checksum is None, allowing
truncated downloads to pass; fix by adding an expected_size check and by
ensuring download routines surface failures: update validate_download to accept
an optional expected_size parameter and return False if
os.path.getsize(file_path) != expected_size when expected_size is provided (keep
current empty-file check and checksum behavior), and update the
protocol-specific download functions (download_files_from_ftp,
download_files_from_aspera, download_files_from_globus, download_files_from_s3)
to (a) obtain remote content length (ftp.size / S3 content_length /
Content-Length) and pass it into validate_download, and (b) stop swallowing
exceptions silently—re-raise or return an error so _download_with_fallback can
detect failures and trigger protocol fallback.

Comment thread pridepy/files/files.py
Comment thread pridepy/files/files.py Outdated
Comment thread pridepy/files/files.py Outdated
Comment thread pridepy/pridepy.py
Comment thread README.md Outdated
Comment thread README.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens pridepy’s download flow by defaulting to an automatic protocol fallback strategy and adding checksum-based validation, while also migrating packaging and CI/docs from Poetry to a uv + Setuptools (PEP 621) setup.

Changes:

  • Implemented protocol fallback (auto) and local-file validation (non-empty + optional checksum) in the downloader.
  • Migrated packaging metadata/build backend from Poetry to Setuptools and updated CI to use uv.
  • Refreshed README installation/dev instructions and added tests for the new download resilience behavior.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
recipe/meta.yaml Updates conda build requirements to align with Setuptools-based builds and updates license metadata.
pyproject.toml Converts Poetry config to PEP 621 [project] metadata and Setuptools build backend; adds optional dev deps.
pridepy/files/files.py Adds checksum parsing/validation + protocol fallback download orchestration and updates default protocol.
pridepy/pridepy.py Changes CLI defaults/docs to use auto protocol and mentions fallback + checksum options.
pridepy/tests/test_download_resilience.py Adds tests covering checksum parsing, validation, protocol sequencing, and fallback behavior.
README.md Updates install/dev instructions (uv), expands usage examples, and documents the new robust download mode.
.github/workflows/python-publish.yml Switches build step to uv build and modernizes setup actions.
.github/workflows/python-package.yml Updates CI to install/build/lint/test via uv.
.github/workflows/python-app.yml Updates CI to install/lint/test via uv.
poetry.lock Removes Poetry lockfile as part of migration away from Poetry.
Comments suppressed due to low confidence (1)

.github/workflows/python-publish.yml:30

  • This workflow publishes using an API token secret (user: __token__ + PYPI_API_TOKEN). If the intent is PyPI Trusted Publishing (OIDC), this should be switched to trusted publishing and the workflow should request permissions: id-token: write (and avoid long-lived secrets). Otherwise, please update the README text that claims trusted publishing.
    - name: Publish package
      uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
      with:
        user: __token__
        password: ${{ secrets.PYPI_API_TOKEN }}
        verbose: true

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyproject.toml
Comment thread pridepy/files/files.py
Comment on lines +124 to +126
def compute_md5(file_path: str, chunk_size: int = 4 * 1024 * 1024) -> str:
md5 = hashlib.md5()
with open(file_path, "rb") as file_handle:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_md5() uses hashlib.md5() directly. On some OpenSSL/FIPS-enabled environments this can raise a ValueError (MD5 disabled), which would make --checksum-check unusable. Consider using the usedforsecurity=False flag when available and/or gracefully handling MD5-unavailable environments so downloads can still proceed (at least with the non-empty validation).

Copilot uses AI. Check for mistakes.
Comment thread pridepy/files/files.py
Comment on lines 801 to +804
output_path = os.path.join(output_folder, f"{accession}-checksum.tsv")
with open(output_path, "w") as file:
file.write(data)
return output_path
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_checksum_file() writes the checksum file into output_folder without ensuring the directory exists, which can raise FileNotFoundError when --checksum-check is enabled (since checksums are fetched before any protocol-specific downloader creates the directory). Also, the checksum endpoint used in this function points at wwwdev.ebi.ac.uk; if that host isn’t intended for end users, please switch to the production www.ebi.ac.uk base (or derive from V3_API_BASE_URL).

Copilot uses AI. Check for mistakes.
Comment thread pridepy/files/files.py Outdated
Comment thread pridepy/files/files.py Outdated
Comment thread README.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
pridepy/files/files.py (3)

954-957: ⚠️ Potential issue | 🟡 Minor

Raise on unsupported protocols instead of returning success.

A silent return leaves API callers and CLI wrappers with no failure signal for an invalid protocol value.

Proposed fix
-        protocols_supported = ["ftp", "aspera", "globus", "s3", "auto"]
+        protocols_supported = {"ftp", "aspera", "globus", "s3", "auto"}
         if protocol not in protocols_supported:
-            logging.error("Protocol should be one of ftp, aspera, globus, s3, auto")
-            return
+            raise ValueError(
+                f"Unsupported protocol '{protocol}'. Must be one of: {sorted(protocols_supported)}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 954 - 957, The code currently logs an
error and returns silently when an unsupported protocol is provided (check
protocols_supported and protocol), which masks failures; replace the
logging.error + return with raising an exception (e.g., ValueError) that
includes the invalid protocol value and the allowed list (ftp, aspera, globus,
s3, auto) so callers/CLI receive a proper failure signal; keep or log the error
as secondary but ensure the function raises the error instead of returning.

879-886: ⚠️ Potential issue | 🟡 Minor

Guard the fallback URL extraction for empty locations.

file_record.get("publicFileLocations", [{}])[0] still raises IndexError when the API returns an empty list, aborting the batch before fallback can report a clean failure.

Proposed fix
         try:
             canonical_url = Files._get_download_url(file_record, "ftp")
         except ValueError:
-            canonical_url = file_record.get("publicFileLocations", [{}])[0].get("value", "")
+            locations = file_record.get("publicFileLocations") or []
+            canonical_url = locations[0].get("value", "") if locations else ""
         if canonical_url:
             local_path = Files.get_output_file_name(canonical_url, file_record, output_folder)
         else:
-            local_path = os.path.join(output_folder, file_record["fileName"])
+            local_path = os.path.join(output_folder, file_record.get("fileName", "unknown"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 879 - 886, The fallback extraction in
the except block can still raise IndexError because
file_record.get("publicFileLocations", [{}])[0] assumes a non-empty list; update
the fallback to safely handle empty or missing lists by retrieving
publicFileLocations into a variable (e.g., locations =
file_record.get("publicFileLocations") or []) and then set canonical_url =
locations[0].get("value", "") only if locations is truthy, otherwise set
canonical_url = "" (so Files.get_output_file_name and the subsequent else branch
won't crash); modify the logic around Files._get_download_url and
Files.get_output_file_name accordingly to use this guarded canonical_url.

140-155: ⚠️ Potential issue | 🟠 Major

Non-empty truncated files still pass when checksum validation is off.

With expected_checksum=None, validation accepts any non-empty file. Since checksum checking is optional, a partial transfer can still be treated as successful instead of triggering fallback.

Suggested direction
-    def validate_download(file_path: str, expected_checksum: Optional[str] = None) -> Tuple[bool, str]:
+    def validate_download(
+        file_path: str,
+        expected_checksum: Optional[str] = None,
+        expected_size: Optional[int] = None,
+    ) -> Tuple[bool, str]:
@@
         if os.path.getsize(file_path) == 0:
             return False, "file is empty"
+        if expected_size is not None and os.path.getsize(file_path) != expected_size:
+            return False, "file size mismatch"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 140 - 155, validate_download currently
treats any non-empty file as valid when expected_checksum is None, allowing
truncated partial downloads to pass; update Files.validate_download to accept an
optional expected_size: Optional[int] parameter and change validation logic so
that if expected_checksum is provided use Files.compute_md5 as before, else if
expected_size is provided compare os.path.getsize(file_path) to expected_size
and fail on mismatch, and only accept the file when one of these verifications
passes; reference the method name Files.validate_download and Files.compute_md5
when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pridepy/files/files.py`:
- Around line 81-93: The parser currently replaces tabs with spaces then splits
(tokens = clean.replace("\t", " ").split()), which breaks tab-delimited
filenames containing spaces and causes checksum detection to use the wrong
basename; update the parsing in the function that sets
tokens/checksum/path_token so that if the original line contains a tab you split
on the last tab (preserving the full filename left of the tab) to produce [path,
checksum], otherwise fall back to whitespace split and the existing checksum
detection logic (use the variables tokens, checksum, path_token to locate and
set values accordingly).
- Around line 128-131: Replace the try/except fallback that creates md5 with a
single secure call: remove the try/except around hashlib.md5 and always
instantiate md5 using hashlib.md5(usedforsecurity=False); update the variable
md5 in the same scope (where md5 is currently assigned) so there is no raw
hashlib.md5() fallback and S324 warnings are eliminated.

In `@pridepy/tests/test_download_resilience.py`:
- Line 1: Replace the computed MD5 fixture with its known digest literal and
remove the now-unused hashlib import: find where the test uses hashlib.md5() to
produce the deterministic fixture (in tests in this file) and substitute the
literal hex digest string directly in the fixture/value, then delete the import
hashlib line so the Ruff S324 lint violation is resolved.

---

Duplicate comments:
In `@pridepy/files/files.py`:
- Around line 954-957: The code currently logs an error and returns silently
when an unsupported protocol is provided (check protocols_supported and
protocol), which masks failures; replace the logging.error + return with raising
an exception (e.g., ValueError) that includes the invalid protocol value and the
allowed list (ftp, aspera, globus, s3, auto) so callers/CLI receive a proper
failure signal; keep or log the error as secondary but ensure the function
raises the error instead of returning.
- Around line 879-886: The fallback extraction in the except block can still
raise IndexError because file_record.get("publicFileLocations", [{}])[0] assumes
a non-empty list; update the fallback to safely handle empty or missing lists by
retrieving publicFileLocations into a variable (e.g., locations =
file_record.get("publicFileLocations") or []) and then set canonical_url =
locations[0].get("value", "") only if locations is truthy, otherwise set
canonical_url = "" (so Files.get_output_file_name and the subsequent else branch
won't crash); modify the logic around Files._get_download_url and
Files.get_output_file_name accordingly to use this guarded canonical_url.
- Around line 140-155: validate_download currently treats any non-empty file as
valid when expected_checksum is None, allowing truncated partial downloads to
pass; update Files.validate_download to accept an optional expected_size:
Optional[int] parameter and change validation logic so that if expected_checksum
is provided use Files.compute_md5 as before, else if expected_size is provided
compare os.path.getsize(file_path) to expected_size and fail on mismatch, and
only accept the file when one of these verifications passes; reference the
method name Files.validate_download and Files.compute_md5 when making these
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 865b02be-6ef0-412c-a426-f75abca30c43

📥 Commits

Reviewing files that changed from the base of the PR and between 7493bc8 and ec6c03d.

📒 Files selected for processing (9)
  • .flake8
  • .github/workflows/python-app.yml
  • .github/workflows/python-package.yml
  • .github/workflows/python-publish.yml
  • README.md
  • pridepy/files/files.py
  • pridepy/pridepy.py
  • pridepy/tests/test_download_resilience.py
  • pyproject.toml
✅ Files skipped from review due to trivial changes (2)
  • .flake8
  • .github/workflows/python-publish.yml
🚧 Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/python-package.yml
  • .github/workflows/python-app.yml
  • pridepy/pridepy.py
  • pyproject.toml

Comment thread pridepy/files/files.py
Comment on lines +81 to +93
tokens = clean.replace("\t", " ").split()
if len(tokens) < 2:
return None

checksum = None
path_token = None
for idx, token in enumerate(tokens):
normalized = token.lower()
if len(normalized) == 32 and all(c in "0123456789abcdef" for c in normalized):
checksum = normalized
remaining = [t for i, t in enumerate(tokens) if i != idx]
if remaining:
path_token = remaining[-1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve tab-delimited filenames with spaces.

Replacing tabs with spaces before splitting can turn file name.raw<TAB><md5> into the basename name.raw, so --checksum-check silently misses the real file.

Proposed parser adjustment
-        tokens = clean.replace("\t", " ").split()
-        if len(tokens) < 2:
-            return None
-
-        checksum = None
-        path_token = None
-        for idx, token in enumerate(tokens):
-            normalized = token.lower()
-            if len(normalized) == 32 and all(c in "0123456789abcdef" for c in normalized):
-                checksum = normalized
-                remaining = [t for i, t in enumerate(tokens) if i != idx]
-                if remaining:
-                    path_token = remaining[-1]
-                break
+        def is_md5(token: str) -> bool:
+            normalized = token.lower()
+            return len(normalized) == 32 and all(c in "0123456789abcdef" for c in normalized)
+
+        if "\t" in clean:
+            fields = [field.strip() for field in clean.split("\t") if field.strip()]
+            candidates = [(fields[0], fields[-1]), (fields[-1], fields[0])] if len(fields) >= 2 else []
+        else:
+            candidates = []
+            first_split = clean.split(maxsplit=1)
+            last_split = clean.rsplit(maxsplit=1)
+            if len(first_split) == 2:
+                candidates.append((first_split[0], first_split[1]))
+            if len(last_split) == 2:
+                candidates.append((last_split[1], last_split[0]))
+
+        checksum = None
+        path_token = None
+        for checksum_token, candidate_path in candidates:
+            if is_md5(checksum_token):
+                checksum = checksum_token.lower()
+                path_token = candidate_path
+                break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 81 - 93, The parser currently replaces
tabs with spaces then splits (tokens = clean.replace("\t", " ").split()), which
breaks tab-delimited filenames containing spaces and causes checksum detection
to use the wrong basename; update the parsing in the function that sets
tokens/checksum/path_token so that if the original line contains a tab you split
on the last tab (preserving the full filename left of the tab) to produce [path,
checksum], otherwise fall back to whitespace split and the existing checksum
detection logic (use the variables tokens, checksum, path_token to locate and
set values accordingly).

Comment thread pridepy/files/files.py
Comment thread pridepy/tests/test_download_resilience.py
@ypriverol ypriverol self-requested a review April 21, 2026 15:49
ypriverol and others added 5 commits April 21, 2026 16:52
…n packaging

- files.py: split download orchestration into a batch-first primary pass
  (reuses one FTP/S3 connection for the whole project, as before) plus a
  per-file fallback for invalid/missing files. Removes the per-file
  reconnect regression from the initial PR version.
- files.py: make `_get_download_url` symmetric — raise cleanly when the
  requested protocol's location is unavailable instead of silently
  falling back to an arbitrary (possibly Aspera) URL for ftp/s3/globus.
- files.py: hoist the globus base URL into a class constant so the
  mapping is defined in one place; call static helpers via `Files.`.
- pyproject.toml: use PEP 621 table form for the license field so it
  works with the pinned `setuptools>=68` and re-add the
  `License :: OSI Approved :: Apache Software License` classifier.
  Bump boto3/botocore lower bounds to match meta.yaml. Add Python 3.12
  to the supported classifiers.
- python-package.yml: test against 3.9-3.12 and run lint/tests before
  building the wheel.
- tests: update the resilience suite for the renamed helper and add a
  test asserting the batch path skips the per-file fallback on success.

https://claude.ai/code/session_01V9mHYkhG1pggyzJCii8do3
Add resilient multi-protocol download with fallback and checksum validation
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pridepy/files/files.py`:
- Around line 1001-1016: The loop indexes file_record["fileName"] directly which
can raise KeyError; change to a defensive local variable like file_name =
file_record.get("fileName", "<unknown>") and use that variable for checksum
lookup (replace checksum_map.get(file_record["fileName"]) with
checksum_map.get(file_name)), for log messages, and when calling
Files._remove_if_exists/Files.validate_download handling so the loop won't abort
on missing keys; also review Files._resolve_local_path and any calls in
_download_with_fallback that access file_record["fileName"] and apply the same
.get fallback for consistency.

In `@pyproject.toml`:
- Around line 68-81: The package discovery currently uses
[tool.setuptools.packages.find] with include = ["pridepy*"] which also picks up
test packages; update that section by adding an exclude entry to omit test
packages (e.g., add exclude = ["pridepy.tests*"] or similar pattern) so that
pridepy.tests and any subpackages are not shipped in the installed wheel; modify
the same [tool.setuptools.packages.find] block where include = ["pridepy*"] is
declared to add the exclude key.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3d6a8ae-cb03-437c-a6fa-4db41e4580b9

📥 Commits

Reviewing files that changed from the base of the PR and between ec6c03d and 53bf0c5.

📒 Files selected for processing (8)
  • .github/workflows/python-package.yml
  • .gitignore
  • README.md
  • pridepy/files/files.py
  • pridepy/pridepy.py
  • pridepy/tests/test_download_resilience.py
  • pyproject.toml
  • recipe/meta.yaml
✅ Files skipped from review due to trivial changes (2)
  • .gitignore
  • recipe/meta.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/python-package.yml
  • pridepy/pridepy.py

Comment thread pridepy/files/files.py
Comment on lines +1001 to +1016
failed_files: List[str] = []
for file_record in file_list_json:
expected_checksum = checksum_map.get(file_record["fileName"])
local_path = Files._resolve_local_path(file_record, output_folder)
valid, reason = Files.validate_download(local_path, expected_checksum)
if valid:
continue

logging.warning(
f"{file_record['fileName']} invalid after {primary_protocol} ({reason})"
)
elif protocol == "s3":
Files.download_files_from_s3(file_list_json, output_folder, skip_if_downloaded_already)
Files._remove_if_exists(local_path)

if not fallback_sequence:
failed_files.append(file_record.get("fileName", "<unknown>"))
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

file_record["fileName"] can KeyError and abort the entire batch.

Line 1003 indexes fileName directly, and line 1010’s log message does the same. If the API ever returns a file record without fileName, the whole download_files call aborts mid-loop — even though line 1015 already uses .get("fileName", "<unknown>") for the failure summary. Use the same defensive access everywhere.

🛡️ Proposed fix
-        for file_record in file_list_json:
-            expected_checksum = checksum_map.get(file_record["fileName"])
-            local_path = Files._resolve_local_path(file_record, output_folder)
-            valid, reason = Files.validate_download(local_path, expected_checksum)
-            if valid:
-                continue
-
-            logging.warning(
-                f"{file_record['fileName']} invalid after {primary_protocol} ({reason})"
-            )
+        for file_record in file_list_json:
+            file_name = file_record.get("fileName", "<unknown>")
+            expected_checksum = checksum_map.get(file_name)
+            local_path = Files._resolve_local_path(file_record, output_folder)
+            valid, reason = Files.validate_download(local_path, expected_checksum)
+            if valid:
+                continue
+
+            logging.warning(
+                f"{file_name} invalid after {primary_protocol} ({reason})"
+            )

Note: _resolve_local_path (line 217) and _download_with_fallback (lines 905, 919, 925, 930, 938) also index file_record["fileName"] directly; consider the same fallback there for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pridepy/files/files.py` around lines 1001 - 1016, The loop indexes
file_record["fileName"] directly which can raise KeyError; change to a defensive
local variable like file_name = file_record.get("fileName", "<unknown>") and use
that variable for checksum lookup (replace
checksum_map.get(file_record["fileName"]) with checksum_map.get(file_name)), for
log messages, and when calling Files._remove_if_exists/Files.validate_download
handling so the loop won't abort on missing keys; also review
Files._resolve_local_path and any calls in _download_with_fallback that access
file_record["fileName"] and apply the same .get fallback for consistency.

Comment thread pyproject.toml
Comment on lines +68 to +81
[tool.setuptools.packages.find]
where = ["."]
include = ["pridepy*"]

[tool.setuptools.package-data]
pridepy = [
"aspera/etc/*",
"aspera/key/*",
"aspera/linux-32/*",
"aspera/linux-64/*",
"aspera/mac-intel/*",
"aspera/windows-32/*",
"aspera/windows-64/*",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Exclude the test package from the installed wheel.

include = ["pridepy*"] also matches pridepy.tests, so the resilience/test modules will ship in the installed wheel and become importable as pridepy.tests.* by end users. Add an explicit exclude to keep tests out of the distribution.

🧹 Proposed fix
 [tool.setuptools.packages.find]
 where = ["."]
 include = ["pridepy*"]
+exclude = ["pridepy.tests*", "tests*"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 68 - 81, The package discovery currently uses
[tool.setuptools.packages.find] with include = ["pridepy*"] which also picks up
test packages; update that section by adding an exclude entry to omit test
packages (e.g., add exclude = ["pridepy.tests*"] or similar pattern) so that
pridepy.tests and any subpackages are not shipped in the installed wheel; modify
the same [tool.setuptools.packages.find] block where include = ["pridepy*"] is
declared to add the exclude key.

@ypriverol ypriverol changed the base branch from master to dev April 21, 2026 17:31
@ypriverol ypriverol merged commit d28d807 into dev Apr 21, 2026
11 checks passed
This was referenced Apr 27, 2026
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants