Skip to content

perf(flashtnt): load only the selected proteoform's scan-scoped data#77

Merged
t0mdavid-m merged 2 commits into
developfrom
speedup/flashtnt-viewer-scoped-loading
May 28, 2026
Merged

perf(flashtnt): load only the selected proteoform's scan-scoped data#77
t0mdavid-m merged 2 commits into
developfrom
speedup/flashtnt-viewer-scoped-loading

Conversation

@t0mdavid-m
Copy link
Copy Markdown
Member

@t0mdavid-m t0mdavid-m commented May 28, 2026

Selecting a proteoform now resolves its scan and filters the spectra, mass table, and tag table to that scan instead of shipping every scan's data to the browser. sequence_data is stored one row per proteoform (sequence_data.pq) and the Sequence View pushdown-loads only the selected proteoform's row, replacing the ~40s monolithic load.

Summary by CodeRabbit

  • New Features

    • Persistent, efficient storage for sequence data enabling faster access and selective reads.
    • Proteoform→scan mapping to power per-proteoform selections and per-scan views.
  • Refactor

    • Unified scan-scoped data loading so proteoform mapping and sequence data attach consistently.
    • Selection/filtering now drives per-scan and tag views from mapped proteoform entries for more accurate displays.

Review Change Stack

Selecting a proteoform now resolves its scan and filters the spectra, mass
table, and tag table to that scan instead of shipping every scan's data to the
browser. sequence_data is stored one row per proteoform (sequence_data.pq) and
the Sequence View pushdown-loads only the selected proteoform's row, replacing
the ~40s monolithic load.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 869a53c9-f711-4e50-8c58-4bc7bcd324e7

📥 Commits

Reviewing files that changed from the base of the PR and between 0faee0a and b0bda5a.

📒 Files selected for processing (3)
  • src/parse/tag_resolution.py
  • src/parse/tnt.py
  • src/render/update.py

📝 Walkthrough

Walkthrough

This PR replaces generic sequence_data storage with a PyArrow Parquet-backed store, adds tag-space→proteoform and proteoform→scan mappings, persists sequence_data at parse time, and refactors initialization and runtime filtering to load per-proteoform entries for flashtnt.

Changes

PyArrow Sequence Data Persistence and Scan Filtering

Layer / File(s) Summary
Sequence Data Schema and Read/Write Utilities
src/render/sequence_data_store.py
New module defines an explicit PyArrow schema for one-row-per-proteoform records with nested sequence, coverage, and fragment mass lists, plus modification structs. Provides normalization helpers for numpy scalars, table builders from in-memory mappings, dataset coercion utilities, and read functions for both single-entry filtering and full reconstruction.
Tag-space → Proteoform Mapping
src/parse/tag_resolution.py
Adds _split_ints and build_tagspace_to_proteoform_map(...) implementing a greedy, strictly-increasing assignment from tag-space ProteoformIndex values to protein-space proteoform indices, using intersection (with union fallback) across tag indices.
Proteoform-to-Scan Mapping
src/render/scan_resolution.py
Adds build_proteoform_scan_map(...) to construct a lookup mapping each proteoform index to scan ID and deconvolution row index by deduplicating/indexing the scan table and joining against protein data, omitting unmapped proteoforms.
Parse-Time Sequence Data Persistence
src/parse/tnt.py
Imports sequence-data utilities and switches sequence_data persistence from generic store to Parquet: builds an Arrow table from sequence_data and writes it via file_manager.parquet_sink(...) with configurable row-group sizing. Also derives tagspace_to_proteoform and groups tag ranges by mapped proteoform indices.
Initialization Scan-Scoped Loading and Map Attachment
src/render/initialize.py
Introduces _attach_proteoform_scan_map and _load_scan_scoped to fetch per-scan cached datasets and eagerly attach the proteoform→scan map for flashtnt. Refactors branches (deconv_spectrum, combined_spectrum, anno_spectrum, mass_table, sequence_view, tag_table) to use the new loader and stores the sequence_data dataset path in additional_data['sequence_data_ds'].
Runtime Per-Scan Filtering and Sequence Data Loading
src/render/update.py
Extends filter_data with flashtnt-specific branches that use the proteoform scan map to filter per_scan_data and tag_table by deconvolution index and scan, and loads per-proteoform sequence entries via load_entry(...) instead of slicing a pre-loaded dataset.

Possibly related PRs

  • OpenMS/FLASHApp#77: Implements the same flashtnt scan-scoped workflow—Parquet persistence of sequence_data, proteoform scan mapping, scan-scoped initialization loading, and per-proteoform lazy loading during filtering.

"I’m a rabbit in code, nibbling bytes with care,
Parquet tables snug, saved in tidy rows,
Maps that hop between proteoform and scan,
On selection I fetch just the entry that shows,
Data delivered quick — a carrot for pros!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: optimizing flashtnt by loading only the selected proteoform's scan-scoped data instead of all data.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch speedup/flashtnt-viewer-scoped-loading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/render/update.py`:
- Around line 175-183: The Tag Table branch is applying a proteoform_scan_map
filter for every tool even though proteoform_scan_map is only set for flashtnt;
change the branch so it only applies the scan-based filtering when running
flashtnt (e.g., check additional_data.get('tool') == 'flashtnt' or that
'proteoform_scan_map' exists and is non-empty) before using proteoform_scan_map,
selection_store.get('proteinIndex') and modifying data['tag_table'] so other
tools do not clear the table.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dfdfa9c4-34db-40bc-85c9-ce22e4bd3284

📥 Commits

Reviewing files that changed from the base of the PR and between b18b6d7 and 0faee0a.

📒 Files selected for processing (5)
  • src/parse/tnt.py
  • src/render/initialize.py
  • src/render/scan_resolution.py
  • src/render/sequence_data_store.py
  • src/render/update.py

Comment thread src/render/update.py Outdated
Comment on lines +175 to +183
elif component == 'Tag Table':
# flashtnt-only panel: ship only the selected proteoform's scan's tags.
scan_map = additional_data.get('proteoform_scan_map', {})
entry = scan_map.get(selection_store.get('proteinIndex'))
if entry is None:
data['tag_table'] = data['tag_table'].iloc[0:0, :]
else:
tags = data['tag_table']
data['tag_table'] = tags[tags['Scan'] == entry['scan']]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict this tag-table filter to flashtnt.

This branch currently runs for every Tag Table render, but proteoform_scan_map is only populated in src/render/initialize.py when tool == 'flashtnt'. For other tools, entry is always None, so the table gets cleared on every update.

Suggested fix
-    elif component == 'Tag Table':
+    elif (component == 'Tag Table') and (tool == 'flashtnt'):
         # flashtnt-only panel: ship only the selected proteoform's scan's tags.
         scan_map = additional_data.get('proteoform_scan_map', {})
         entry = scan_map.get(selection_store.get('proteinIndex'))
         if entry is None:
             data['tag_table'] = data['tag_table'].iloc[0:0, :]
         else:
             tags = data['tag_table']
             data['tag_table'] = tags[tags['Scan'] == entry['scan']]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/render/update.py` around lines 175 - 183, The Tag Table branch is
applying a proteoform_scan_map filter for every tool even though
proteoform_scan_map is only set for flashtnt; change the branch so it only
applies the scan-based filtering when running flashtnt (e.g., check
additional_data.get('tool') == 'flashtnt' or that 'proteoform_scan_map' exists
and is non-empty) before using proteoform_scan_map,
selection_store.get('proteinIndex') and modifying data['tag_table'] so other
tools do not clear the table.

The Tag Table and on-spectrum tag overlay came up empty on large datasets.
Tags are scan (spectrum) data, so scope the feed to the selected proteoform's
scan and stamp ProteinIndex so the frontend's
tag.ProteinIndex===selectedProteinIndex filter passes the scan's tags through
to the table and the overlay.

Also correct per-proteoform coverage: tag_dfs.ProteinIndex is FLASHTagger's
tag-space index, which diverges from protein_dfs.index on large runs, so the
coverage loop associated the wrong tags. Map tag-space -> protein-space via
protein.tsv TagIndices and group coverage by protein-space. The stored
tag_dfs is unchanged, so the golden regression is unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@t0mdavid-m t0mdavid-m merged commit b4b9d48 into develop May 28, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant