Skip to content

Improve MassIVE download. #99

Merged
ypriverol merged 4 commits into
masterfrom
dev
May 27, 2026
Merged

Improve MassIVE download. #99
ypriverol merged 4 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

@ypriverol ypriverol commented May 27, 2026

This pull request adds support for downloading public MassIVE datasets directly using their MSV... accessions, alongside existing PRIDE dataset functionality. The changes include both user-facing command-line improvements and substantial updates to the core file handling logic to enable MassIVE dataset discovery, file listing, and downloading via anonymous FTP. Documentation is also updated to reflect the new MassIVE capabilities.

MassIVE dataset support:

  • Added logic to detect MassIVE accessions, enumerate public files from the MassIVE FTP server, and map MassIVE collections to standard file categories. New helper methods and a recursive FTP directory walker were introduced in Files to enable this functionality. [1] [2]
  • Updated all major download and file listing methods (e.g., get_all_raw_file_list, download_all_raw_files, download_file_by_name, download_files_by_list, download_all_category_files, get_all_category_file_list) to support MassIVE accessions and route them through the new MassIVE-specific logic. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Command-line interface and documentation:

  • Updated CLI commands and help text to clarify support for both PRIDE and MassIVE accessions, and added new usage examples for MassIVE datasets in the README.md. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

File download and FTP handling improvements:

  • Improved FTP file path handling for both PRIDE and MassIVE downloads, including more robust extraction of file paths from URLs and recursive directory traversal for MassIVE datasets.
  • Added a category mapping for MassIVE collections to align with PRIDE file categories, ensuring consistent filtering and download behavior.

General code improvements:

  • Minor refactoring and docstring updates to clarify that functions now support both PRIDE and MassIVE accessions. [1] [2] [3] [4] [5]

These changes collectively allow users to seamlessly download public datasets from MassIVE in addition to PRIDE, using the same CLI commands and Python API.

ypriverol added 4 commits May 27, 2026 09:57
Resolves conflict in pridepy/files/files.py: drop the now-obsolete
GLOBUS_BASE_URL constant (renamed to PRIDE_ARCHIVE_HTTPS_URL_PREFIX on
dev in 7958d9a), and keep the MASSIVE_ARCHIVE_FTP* constants alongside.
Previously download_files_by_list always called stream_all_files_by_project
(PRIDE-only), which broke filename-based downloads for MassIVE accessions.
Branch on is_massive_accession to list via _list_massive_public_files and
download via _download_massive_file_records.

Also clarify --parallel-files help text: it applies to all protocols, not
only globus.
Add direct download support for MassIVE accessions
@qodo-code-review
Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

pridepy extends file download and listing workflows to support MassIVE dataset accessions (MSV/rmsv format) via anonymous FTP enumeration. Core entry points now detect and branch on MassIVE accessions, reusing new FTP enumeration and file-record building helpers. FTP path extraction logic is also refactored to use URL parsing.

Changes

MassIVE Dataset Download Support

Layer / File(s) Summary
MassIVE configuration and accession detection
pridepy/files/files.py
posixpath is imported and MassIVE configuration constants are defined: FTP hostname, URL prefix, and a collection-to-category mapping for normalizing file records.
MassIVE FTP enumeration and file record building
pridepy/files/files.py
Helper methods detect MassIVE accessions, build FTP root/URL paths, map collections to PRIDE-like categories, construct file records from FTP paths (handling raw/peak/wiff.scan special cases), recursively walk FTP trees, and list/download all public files.
Entry point branching and FTP integration
pridepy/files/files.py
Raw file, category, and single-file listing/download methods branch on MassIVE accessions to use FTP enumeration instead of PRIDE API paths. FTP path extraction refactored to use urlparse with unquoting instead of string replacement.
Unit tests for MassIVE operations
pridepy/tests/test_massive_files.py
New test class verifies accession detection, file record construction with category mapping and raw collection handling, raw file filtering, and download-by-name integration.
Documentation and CLI help updates
README.md, pridepy/pridepy.py
README adds MassIVE download capabilities to the feature list and Quick Start examples. CLI help text across multiple commands updated to refer to "PRIDE or MassIVE dataset" accessions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PRIDE-Archive/pridepy#67: Modifies FTP download plumbing (URL parsing and path handling) for downloading from different dataset sources, overlapping with the urlparse refactoring in this PR.
  • PRIDE-Archive/pridepy#93: Updates download_all_raw_files and download_files_by_list entry points with parallelism and resilience; this PR adds MassIVE branching to the same methods.
  • PRIDE-Archive/pridepy#61: Introduces download-all-public-raw-files and download-all-public-category-files CLI commands; this PR extends the underlying file logic to operate on MassIVE datasets.

Suggested labels

enhancement, feature, tests

Suggested reviewers

  • selvaebi
  • chakrabandla

Poem

🐰 A rabbit hops through data trees,
FTP nodes dance in the breeze,
MassIVE files now find their way,
Alongside PRIDE, hip-hip-hooray!
Tests ensure each path rings true. 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Improve MassIVE download' is vague and generic, using non-descriptive language that doesn't convey meaningful information about the specific changes made. Use a more specific title that describes the actual enhancement, such as 'Add MassIVE dataset download support' or 'Support downloading files directly from MassIVE FTP'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

129-129: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the Quick Start section numbering.

This should be ### 7) instead of ### 6) to maintain sequential numbering after the new MassIVE examples were inserted above.

📝 Proposed fix
-### 6) Download a named subset of files (manifest)
+### 7) Download a named subset of files (manifest)

Also update line 151:

-### 7) Download files from raw URLs
+### 8) Download files from raw URLs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 129, The Quick Start heading "### 6) Download a named
subset of files (manifest)" is misnumbered after inserting MassIVE examples;
update the heading text in README.md to "### 7) Download a named subset of files
(manifest)" and also adjust the subsequent Quick Start numbering (the later
heading referenced in the comment) so all section numbers remain sequential;
locate and update the literal headings (e.g., the "### 6) Download a named
subset of files (manifest)" token and the later Quick Start heading token) to
the corrected numeric values.
🧹 Nitpick comments (1)
pridepy/files/files.py (1)

300-301: 💤 Low value

Add debug logging when falling back from mlsd to LIST.

Silent exception handling here hides why the fallback path is taken. Adding a debug log would help troubleshoot FTP server compatibility issues.

💡 Proposed improvement
         except (AttributeError, ftplib.error_perm):
-            pass
+            logging.debug(f"mlsd not supported for {remote_dir}, falling back to LIST")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/files/files.py` around lines 300 - 301, The except block catching
(AttributeError, ftplib.error_perm) that falls back from using MLSD to LIST
should log a debug message instead of silently passing; update the exception
handler where MLSD -> LIST fallback occurs to call the module/logger debug
method (e.g., logger.debug or processLogger.debug) and include context like
"falling back from MLSD to LIST" plus the exception details and the FTP server
address or path if available; keep behavior the same otherwise so the fallback
continues after logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@README.md`:
- Line 129: The Quick Start heading "### 6) Download a named subset of files
(manifest)" is misnumbered after inserting MassIVE examples; update the heading
text in README.md to "### 7) Download a named subset of files (manifest)" and
also adjust the subsequent Quick Start numbering (the later heading referenced
in the comment) so all section numbers remain sequential; locate and update the
literal headings (e.g., the "### 6) Download a named subset of files (manifest)"
token and the later Quick Start heading token) to the corrected numeric values.

---

Nitpick comments:
In `@pridepy/files/files.py`:
- Around line 300-301: The except block catching (AttributeError,
ftplib.error_perm) that falls back from using MLSD to LIST should log a debug
message instead of silently passing; update the exception handler where MLSD ->
LIST fallback occurs to call the module/logger debug method (e.g., logger.debug
or processLogger.debug) and include context like "falling back from MLSD to
LIST" plus the exception details and the FTP server address or path if
available; keep behavior the same otherwise so the fallback continues after
logging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90f23e9a-b3a8-4362-8f79-67667310ea1e

📥 Commits

Reviewing files that changed from the base of the PR and between 6c554b5 and 79bace3.

📒 Files selected for processing (4)
  • README.md
  • pridepy/files/files.py
  • pridepy/pridepy.py
  • pridepy/tests/test_massive_files.py

@ypriverol ypriverol merged commit c09cca5 into master May 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant