feat: align QPX parquet output with quantms template schema#8974
feat: align QPX parquet output with quantms template schema#8974timosachsenberg merged 9 commits intodevelopfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces the legacy ConsensusFeature export with QPX-compliant Arrow/Parquet schemas: introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Exporter as Exporter (ConsensusMap / QPXFile / PGExport)
participant SchemaReg as ArrowSchemaRegistry
participant Arrow as Arrow Builders
participant Parquet as Parquet Writer
Exporter->>SchemaReg: request schema (QPXFeature/QPXPSM)
SchemaReg-->>Exporter: return schema definition
Exporter->>Arrow: create builders per schema
Arrow-->>Exporter: builders ready
Exporter->>Arrow: append rows (nested structs, lists, scalars)
Arrow-->>Exporter: row builders filled
Exporter->>Arrow: finish arrays
Arrow-->>Exporter: arrays finalized
Exporter->>Parquet: validate table against schema
Parquet-->>Exporter: validation OK
Exporter->>Parquet: write quantms.*.parquet
Parquet-->>Exporter: write complete / file paths
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp (1)
1030-1033: Assert thescanelement type too.This currently only proves that
scanis a list. A regression tolist<int64>would still pass even though this schema is supposed to pinscantolist<int32>.Suggested assertion
TEST_EQUAL(s->field(13)->type()->id(), arrow::Type::LIST) TEST_EQUAL(s->field(13)->nullable(), false) + const auto scan_type = std::static_pointer_cast<arrow::ListType>(s->field(13)->type()); + TEST_EQUAL(scan_type->value_type()->id(), arrow::Type::INT32)As per coding guidelines,
src/tests/class_tests/**/*_test.cpp: Write unit tests for new functionality and ensure existing tests pass before suggesting changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp` around lines 1030 - 1033, Add an assertion that the "scan" list's element type is int32: locate the existing checks on s->field(13) (name "scan", type LIST) and add an assertion that the list's element type id is arrow::Type::INT32 by retrieving the list's value type (e.g. static_cast/ptr cast to arrow::ListType and call value_type()->id() or value_field()->type()->id()) and comparing it to arrow::Type::INT32 so a regression to list<int64> will fail.src/openms/source/FORMAT/ArrowSchemaRegistry.cpp (1)
403-437: Factor the shared QPX nested types.
modificationsType(),additionalScoresType(), andcvParamsType()are duplicated verbatim betweenQPXPSMSchemaandQPXFeatureSchema. Since both classes are supposed to stay template-identical, extracting these into shared internal helpers will make the next schema update much harder to apply inconsistently.Also applies to: 485-519
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/ArrowSchemaRegistry.cpp` around lines 403 - 437, Extract the duplicated Arrow nested type builders used in QPXPSMSchema::modificationsType, QPXPSMSchema::additionalScoresType, QPXPSMSchema::cvParamsType (and the identical methods in QPXFeatureSchema) into shared internal helper functions (e.g., makeScoresType(), makePositionStruct(), makeModificationsListType(), makeCvParamsListType) placed in an unnamed namespace or as static helpers in this translation unit; replace the bodies of QPXPSMSchema::modificationsType, ::additionalScoresType and ::cvParamsType (and the corresponding QPXFeatureSchema methods) to call those helpers so both classes reuse the same arrow::struct_/arrow::list constructions and avoid duplication during future schema updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp`:
- Around line 607-669: The code currently deduplicates peptide evidences by
accession (seen_accs) which loses repeated mappings; instead, only use seen_accs
when populating the unique protein_accs vector (for anchor_protein/unique), but
collect and push an EvidenceInfo for every entry from
best_hit->getPeptideEvidences() into evidences. Concretely: in the loop over
best_hit->getPeptideEvidences(), always create and push an EvidenceInfo (ei) for
each ev, and only call seen_accs.insert(acc) / push_back into protein_accs when
the accession is first encountered; leave the export loop that appends to
pg_accessions_builder / pga_struct_b / pga_acc_b / pga_start_b / pga_end_b /
pga_pre_b / pga_post_b unchanged so every PeptideEvidence is emitted (also apply
the same change in the analogous block around lines 708-709).
- Around line 537-554: The code currently reads the spectrum reference from the
generic meta value "spectrum_reference" which misses the dedicated
PeptideIdentification member; update the block that builds the scan (the logic
around pep_ids, spec_ref, scan_builder and scan_val_b) to call
PeptideIdentification::getSpectrumReference() first (on pep_ids[0]) and assign
that to spec_ref, and only if that returns empty fall back to checking
metaValueExists("spectrum_reference") / getMetaValue("spectrum_reference"); then
pass spec_ref to extractScan and append scan_num to scan_val_b as before.
In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 199-203: The export currently uses
ProteinIdentification::getPrimaryMSRunPath() and drops the hit metavalue
"reference_file_name", causing every PSM to be stamped with the first
identifier-level file; change the logic in QPXFile.cpp to prefer a per-hit or
per-PeptideIdentification file reference (check hit meta
"run_file_name"/"reference_file_name" or PeptideIdentification::getMetaValue())
when available and only fall back to
ProteinIdentification::getPrimaryMSRunPath() if no per-PSM/PeptideIdentification
file info exists or if the identifier-level path is unambiguous; update the code
paths around the excluded_hit_mvs handling and where run_file_name is resolved
(the blocks that reference excluded_hit_mvs and build PSM output) to read
per-hit/PeptideIdentification meta first and preserve reference_file_name if
used at PSM granularity.
---
Nitpick comments:
In `@src/openms/source/FORMAT/ArrowSchemaRegistry.cpp`:
- Around line 403-437: Extract the duplicated Arrow nested type builders used in
QPXPSMSchema::modificationsType, QPXPSMSchema::additionalScoresType,
QPXPSMSchema::cvParamsType (and the identical methods in QPXFeatureSchema) into
shared internal helper functions (e.g., makeScoresType(), makePositionStruct(),
makeModificationsListType(), makeCvParamsListType) placed in an unnamed
namespace or as static helpers in this translation unit; replace the bodies of
QPXPSMSchema::modificationsType, ::additionalScoresType and ::cvParamsType (and
the corresponding QPXFeatureSchema methods) to call those helpers so both
classes reuse the same arrow::struct_/arrow::list constructions and avoid
duplication during future schema updates.
In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp`:
- Around line 1030-1033: Add an assertion that the "scan" list's element type is
int32: locate the existing checks on s->field(13) (name "scan", type LIST) and
add an assertion that the list's element type id is arrow::Type::INT32 by
retrieving the list's value type (e.g. static_cast/ptr cast to arrow::ListType
and call value_type()->id() or value_field()->type()->id()) and comparing it to
arrow::Type::INT32 so a regression to list<int64> will fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6afe0f20-15eb-4ebe-8460-c8928ec91e9e
📒 Files selected for processing (9)
src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.hsrc/openms/source/FORMAT/ArrowSchemaRegistry.cppsrc/openms/source/FORMAT/ConsensusMapArrowExport.cppsrc/openms/source/FORMAT/ProteinGroupArrowExport.cppsrc/openms/source/FORMAT/QPXFile.cppsrc/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cppsrc/tests/class_tests/openms/source/QPXFile_test.cppsrc/topp/IsobaricWorkflow.cppsrc/topp/ProteomicsLFQ.cpp
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/openms/source/FORMAT/QPXFile.cpp (1)
187-195:⚠️ Potential issue | 🟠 MajorResolve
run_file_nameandscanat PSM granularity.Line 199 strips
reference_file_name/scanout of the metadata path, but Lines 431-454 only rebuild those columns from the identifier-level MS run list and parsedspectrum_reference. In merged IDs or imports that already carry per-hit / per-PeptideIdentificationvalues, this still exports the wrong raw file (first MS run) or an emptyscanlist even though the PSM already has the source metadata.Possible fix
- std::map<String, String> id_to_filename; + std::map<String, StringList> id_to_filenames; for (const auto& prot_id : protein_identifications) { StringList ms_runs; prot_id.getPrimaryMSRunPath(ms_runs); if (!ms_runs.empty()) { - id_to_filename[prot_id.getIdentifier()] = ms_runs[0]; + id_to_filenames[prot_id.getIdentifier()] = ms_runs; } } static const std::unordered_set<std::string> excluded_hit_mvs = { "target_decoy", "predicted_RT", "predicted_rt", "ion_mobility", "IM", - "scan", "reference_file_name" + "scan", "reference_file_name", "run_file_name" }; ... - auto fn_it = id_to_filename.find(pep_id.getIdentifier()); + auto fn_it = id_to_filenames.find(pep_id.getIdentifier()); ... - if (fn_it != id_to_filename.end()) + if (hit.metaValueExists("run_file_name")) + { + (void)run_file_name_builder.Append(hit.getMetaValue("run_file_name").toString()); + } + else if (hit.metaValueExists("reference_file_name")) + { + (void)run_file_name_builder.Append(hit.getMetaValue("reference_file_name").toString()); + } + else if (pep_id.metaValueExists("run_file_name")) + { + (void)run_file_name_builder.Append(pep_id.getMetaValue("run_file_name").toString()); + } + else if (pep_id.metaValueExists("reference_file_name")) + { + (void)run_file_name_builder.Append(pep_id.getMetaValue("reference_file_name").toString()); + } + else if (fn_it != id_to_filenames.end() && fn_it->second.size() == 1) { - (void)run_file_name_builder.Append(fn_it->second); + (void)run_file_name_builder.Append(fn_it->second[0]); } else { (void)run_file_name_builder.Append(""); } ... - if (!spec_ref.empty()) + if (hit.metaValueExists("scan")) + { + (void)scan_value_b->Append(static_cast<int32_t>(static_cast<int>(hit.getMetaValue("scan")))); + } + else if (pep_id.metaValueExists("scan")) + { + (void)scan_value_b->Append(static_cast<int32_t>(static_cast<int>(pep_id.getMetaValue("scan")))); + } + else if (!spec_ref.empty()) { Int scan_num = extractScan(spec_ref);Also applies to: 199-203, 228-229, 431-454
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/QPXFile.cpp` around lines 187 - 195, The export currently uses id_to_filename (built from protein_identifications via getPrimaryMSRunPath) and parsed spectrum_reference to populate reference_file_name/run_file_name and scan, which ignores per-PSM or per-PeptideIdentification metadata; update the logic in QPXFile.cpp (places around id_to_filename population and the column rebuild at the section handling spectrum_reference / scan around the existing spectrum parsing) to first check and use per-hit or per-PeptideIdentification fields (e.g., inspect PeptideHit metadata and PeptideIdentification members for run_file_name, reference_file_name, and scan) and only fall back to id_to_filename[prot_id.getIdentifier()] or parsing spectrum_reference when those per-PSM values are absent, ensuring scan numbers and source file names are resolved at PSM granularity.
🧹 Nitpick comments (2)
src/openms/source/FORMAT/QPXFile.cpp (1)
164-185: Use'\n'in the new error logs.The added reserve/finish error paths still use
std::endl, which forces an unnecessary flush on each branch. Please switch the new messages to'\n'to stay consistent with the repo logging guidance. As per coding guidelines, "Use OpenMS logging macros andOpenMS::LogStream; avoidstd::cout/errdirectly; avoidstd::endlfor performance (use\n)"Also applies to: 526-555, 563-565
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/QPXFile.cpp` around lines 164 - 185, Replace the use of std::endl in the new Reserve/Finish error logging branches with '\n' to avoid unnecessary flushes; specifically update the error messages that reference charge_builder, pep_builder, is_decoy_builder, calculated_mz_builder, observed_mz_builder, mass_error_ppm_builder, predicted_rt_builder, run_file_name_builder, rt_builder, ion_mobility_builder, and missed_cleavages_builder in QPXFile.cpp (and the similar occurrences around the ranges noted: ~526-555 and 563-565) so each OPENMS_LOG_ERROR << ... ends with '\n' instead of std::endl. Ensure no other behavior changes are made and preserve the existing message text and ToString() usage.src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp (1)
1023-1067: Assert the child type of the new primitive-list fields.These sections only verify that
scan,protein_accessions,mz_array,intensity_array,charge_array,ion_type_array,ion_mobility_array,gg_accessions, andgg_namesareLIST. A regression likelist<int32>→list<int64>orlist<float32>→list<float64>would still pass, which weakens coverage for the schema-alignment change.Example assertions
TEST_EQUAL(s->field(13)->type()->id(), arrow::Type::LIST) + TEST_EQUAL(s->field(13)->type()->Equals(arrow::list(arrow::int32())), true) ... TEST_EQUAL(s->field(19)->type()->id(), arrow::Type::LIST) + TEST_EQUAL(s->field(19)->type()->Equals(arrow::list(arrow::float32())), true)As per coding guidelines, "Write unit tests for new functionality and ensure existing tests pass before suggesting changes"
Also applies to: 1191-1194, 1247-1254
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp` around lines 1023 - 1067, The tests only assert that many columns (e.g. "scan", "protein_accessions", "mz_array", "intensity_array", "charge_array", "ion_type_array", "ion_mobility_array", also "gg_accessions" and "gg_names") are LISTs but do not assert the child element type; update the assertions to check the list child type for each field by inspecting s->field(N)->type()->field(0)->type()->id() (or equivalent Arrow API) and compare to the correct arrow::Type (e.g., INT32 for scan/charge_array, FLOAT for mz_array/intensity_array/ion_mobility_array, STRING/UTF8 for protein_accessions/ion_type_array/gg_names, etc.); add these child-type checks adjacent to the existing LIST assertions and apply the same fixes at the other indicated blocks (around the other ranges cited).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 187-195: The export currently uses id_to_filename (built from
protein_identifications via getPrimaryMSRunPath) and parsed spectrum_reference
to populate reference_file_name/run_file_name and scan, which ignores per-PSM or
per-PeptideIdentification metadata; update the logic in QPXFile.cpp (places
around id_to_filename population and the column rebuild at the section handling
spectrum_reference / scan around the existing spectrum parsing) to first check
and use per-hit or per-PeptideIdentification fields (e.g., inspect PeptideHit
metadata and PeptideIdentification members for run_file_name,
reference_file_name, and scan) and only fall back to
id_to_filename[prot_id.getIdentifier()] or parsing spectrum_reference when those
per-PSM values are absent, ensuring scan numbers and source file names are
resolved at PSM granularity.
---
Nitpick comments:
In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 164-185: Replace the use of std::endl in the new Reserve/Finish
error logging branches with '\n' to avoid unnecessary flushes; specifically
update the error messages that reference charge_builder, pep_builder,
is_decoy_builder, calculated_mz_builder, observed_mz_builder,
mass_error_ppm_builder, predicted_rt_builder, run_file_name_builder, rt_builder,
ion_mobility_builder, and missed_cleavages_builder in QPXFile.cpp (and the
similar occurrences around the ranges noted: ~526-555 and 563-565) so each
OPENMS_LOG_ERROR << ... ends with '\n' instead of std::endl. Ensure no other
behavior changes are made and preserve the existing message text and ToString()
usage.
In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp`:
- Around line 1023-1067: The tests only assert that many columns (e.g. "scan",
"protein_accessions", "mz_array", "intensity_array", "charge_array",
"ion_type_array", "ion_mobility_array", also "gg_accessions" and "gg_names") are
LISTs but do not assert the child element type; update the assertions to check
the list child type for each field by inspecting
s->field(N)->type()->field(0)->type()->id() (or equivalent Arrow API) and
compare to the correct arrow::Type (e.g., INT32 for scan/charge_array, FLOAT for
mz_array/intensity_array/ion_mobility_array, STRING/UTF8 for
protein_accessions/ion_type_array/gg_names, etc.); add these child-type checks
adjacent to the existing LIST assertions and apply the same fixes at the other
indicated blocks (around the other ranges cited).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c4dcc616-c842-4d4f-87e9-b4107cedbd53
📒 Files selected for processing (3)
src/openms/source/FORMAT/QPXFile.cppsrc/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cppsrc/tests/topp/CMakeLists.txt
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openms/source/FORMAT/ConsensusMapArrowExport.cpp (1)
434-456:⚠️ Potential issue | 🔴 CriticalGuard
findScoreType(...PEP)here.Unlike
QPXFile::exportToArrow, Line 437 is unguarded.IDScoreSwitcherAlgorithm::findScoreType()throws when no PEP-like score exists, which turns a missing optional score into a full export failure.Possible fix
if (has_id) { - auto result = idsa.findScoreType(pep_ids[0], IDScoreSwitcherAlgorithm::ScoreType::PEP); + IDScoreSwitcherAlgorithm::ScoreSearchResult result{}; + try + { + result = idsa.findScoreType(pep_ids[0], IDScoreSwitcherAlgorithm::ScoreType::PEP); + } + catch (...) + { + } if (!result.score_name.empty() && best_hit) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp` around lines 434 - 456, The call to idsa.findScoreType(pep_ids[0], IDScoreSwitcherAlgorithm::ScoreType::PEP) can throw when no PEP-like score exists; wrap or guard that call in a presence check or try/catch so a missing PEP does not abort export. Specifically, before calling idsa.findScoreType use the same existence check pattern used in QPXFile::exportToArrow (or catch the exception from idsa.findScoreType), and then only access result.score_name/is_main_score_type and use pep_builder.Append, pep_builder.AppendNull, best_hit and pep_ids accordingly when a valid result is returned; otherwise call pep_builder.AppendNull to preserve export.
♻️ Duplicate comments (3)
src/openms/source/FORMAT/ConsensusMapArrowExport.cpp (2)
612-618:⚠️ Potential issue | 🟠 MajorEmit
pg_positionsfrom the evidences you already collected.Lines 621-636 preserve accession/start/end for every peptide mapping, but Line 717 still writes
pg_positions = nullfor every row. That drops the new positional column entirely and loses repeated mappings thatpg_accessionsnow preserves.Possible fix
- // === pg_positions (empty for now — no positional data available at consensus level) === - (void)pg_positions_builder.AppendNull(); + (void)pg_positions_builder.Append(); + for (const auto& ei : evidences) + { + if (ei.start == PeptideEvidence::UNKNOWN_POSITION || ei.end == PeptideEvidence::UNKNOWN_POSITION) + { + continue; + } + (void)pgp_struct_b->Append(); + (void)pgp_acc_b->Append(ei.acc); + (void)pgp_start_b->Append(ei.start); + (void)pgp_end_b->Append(ei.end); + }Also applies to: 640-677, 716-717
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp` around lines 612 - 618, The code collects per-peptide mapping info into protein_accs and evidences (struct EvidenceInfo) but still emits pg_positions as null; update the export logic that writes pg_positions to build and write a positions string from evidences (one entry per protein_accs item, preserving repeats and ordering) using each EvidenceInfo's start/end (and optionally pre/post if expected), matching the pg_accessions output format (i.e., same delimiter and ordering as protein_accs) so repeated mappings are represented; locate the writer that currently sets pg_positions = null and replace it with code that iterates evidences to produce the positions field.
609-610:⚠️ Potential issue | 🟠 MajorPopulate
id_run_file_nameinstead of hard-coding nulls.Line 610 makes this new column permanently null, even when the best hit or
PeptideIdentificationalready carriesreference_file_name/run_file_name. That drops identification-run provenance for multi-run consensus exports.Possible fix
- // === id_run_file_name === - (void)id_run_file_name_builder.AppendNull(); + // === id_run_file_name === + String id_run_file_name; + if (best_hit && best_hit->metaValueExists("reference_file_name")) + { + id_run_file_name = best_hit->getMetaValue("reference_file_name").toString(); + } + else if (best_hit && best_hit->metaValueExists("run_file_name")) + { + id_run_file_name = best_hit->getMetaValue("run_file_name").toString(); + } + else if (!pep_ids.empty() && pep_ids[0].metaValueExists("reference_file_name")) + { + id_run_file_name = pep_ids[0].getMetaValue("reference_file_name").toString(); + } + else if (!pep_ids.empty() && pep_ids[0].metaValueExists("run_file_name")) + { + id_run_file_name = pep_ids[0].getMetaValue("run_file_name").toString(); + } + if (id_run_file_name.empty()) + { + (void)id_run_file_name_builder.AppendNull(); + } + else + { + (void)id_run_file_name_builder.Append(id_run_file_name); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp` around lines 609 - 610, The id_run_file_name column is always set to null by calling id_run_file_name_builder.AppendNull(); instead populate it from the identification provenance: extract run file name from the PeptideIdentification (meta keys "reference_file_name" or "run_file_name") or from the best PeptideHit if present, then call id_run_file_name_builder.Append(<that string>) instead of AppendNull when a value exists; keep AppendNull only as the fallback. Update the logic near ConsensusMapArrowExport.cpp where id_run_file_name_builder is used so the builder appends the actual run name taken from the PeptideIdentification/PeptideHit metadata (use the existing best-hit retrieval code paths to find the value).src/openms/source/FORMAT/QPXFile.cpp (1)
431-451:⚠️ Potential issue | 🟠 MajorOnly use the identifier-level run fallback when it is unambiguous.
The new fallback at Lines 447-450 still goes through
id_to_filename, which currently keeps onlyms_runs[0]. When oneProteinIdentificationcovers multiple primary MS runs, unresolved PSMs get stamped with the first file, sorun_file_namebecomes wrong for merged search results.Possible fix
- std::map<String, String> id_to_filename; + std::map<String, StringList> id_to_filenames; for (const auto& prot_id : protein_identifications) { StringList ms_runs; prot_id.getPrimaryMSRunPath(ms_runs); if (!ms_runs.empty()) { - id_to_filename[prot_id.getIdentifier()] = ms_runs[0]; + id_to_filenames[prot_id.getIdentifier()] = ms_runs; } } ... - auto fn_it = id_to_filename.find(pep_id.getIdentifier()); + auto fn_it = id_to_filenames.find(pep_id.getIdentifier()); ... - if (run_file.empty() && fn_it != id_to_filename.end()) + if (run_file.empty() && fn_it != id_to_filenames.end() && fn_it->second.size() == 1) { - run_file = fn_it->second; + run_file = fn_it->second[0]; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/source/FORMAT/QPXFile.cpp` around lines 431 - 451, The fallback to identifier-level filenames can assign the wrong run when an identifier covers multiple MS runs; modify the block that uses id_to_filename/fn_it so you only append fn_it->second when that mapping is unambiguous: e.g., check that fn_it != id_to_filename.end() AND that the identifier maps to a single filename (or id_to_filename contains exactly one entry for this identifier / a precomputed unambiguous set) before using run_file = fn_it->second; update the logic around run_file, fn_it, id_to_filename and run_file_name_builder in the same scope as pep_id and hit to avoid defaulting to the first ms_runs entry when multiple runs exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp`:
- Around line 434-456: The call to idsa.findScoreType(pep_ids[0],
IDScoreSwitcherAlgorithm::ScoreType::PEP) can throw when no PEP-like score
exists; wrap or guard that call in a presence check or try/catch so a missing
PEP does not abort export. Specifically, before calling idsa.findScoreType use
the same existence check pattern used in QPXFile::exportToArrow (or catch the
exception from idsa.findScoreType), and then only access
result.score_name/is_main_score_type and use pep_builder.Append,
pep_builder.AppendNull, best_hit and pep_ids accordingly when a valid result is
returned; otherwise call pep_builder.AppendNull to preserve export.
---
Duplicate comments:
In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp`:
- Around line 612-618: The code collects per-peptide mapping info into
protein_accs and evidences (struct EvidenceInfo) but still emits pg_positions as
null; update the export logic that writes pg_positions to build and write a
positions string from evidences (one entry per protein_accs item, preserving
repeats and ordering) using each EvidenceInfo's start/end (and optionally
pre/post if expected), matching the pg_accessions output format (i.e., same
delimiter and ordering as protein_accs) so repeated mappings are represented;
locate the writer that currently sets pg_positions = null and replace it with
code that iterates evidences to produce the positions field.
- Around line 609-610: The id_run_file_name column is always set to null by
calling id_run_file_name_builder.AppendNull(); instead populate it from the
identification provenance: extract run file name from the PeptideIdentification
(meta keys "reference_file_name" or "run_file_name") or from the best PeptideHit
if present, then call id_run_file_name_builder.Append(<that string>) instead of
AppendNull when a value exists; keep AppendNull only as the fallback. Update the
logic near ConsensusMapArrowExport.cpp where id_run_file_name_builder is used so
the builder appends the actual run name taken from the
PeptideIdentification/PeptideHit metadata (use the existing best-hit retrieval
code paths to find the value).
In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 431-451: The fallback to identifier-level filenames can assign the
wrong run when an identifier covers multiple MS runs; modify the block that uses
id_to_filename/fn_it so you only append fn_it->second when that mapping is
unambiguous: e.g., check that fn_it != id_to_filename.end() AND that the
identifier maps to a single filename (or id_to_filename contains exactly one
entry for this identifier / a precomputed unambiguous set) before using run_file
= fn_it->second; update the logic around run_file, fn_it, id_to_filename and
run_file_name_builder in the same scope as pep_id and hit to avoid defaulting to
the first ms_runs entry when multiple runs exist.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d36fbe18-3460-40fe-b266-6606c0522728
📒 Files selected for processing (4)
src/openms/source/FORMAT/ArrowSchemaRegistry.cppsrc/openms/source/FORMAT/ConsensusMapArrowExport.cppsrc/openms/source/FORMAT/QPXFile.cppsrc/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/topp/ProteinQuantifier.cpp (1)
996-1009: Extract the shared protein-annotation setup into one helper.This block duplicates the mzTab export preparation above. Pulling the annotate/insert/swap sequence into a shared helper will keep the QPX and mzTab paths from drifting the next time
inference_in_cxmlhandling changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/topp/ProteinQuantifier.cpp` around lines 996 - 1009, Extract the duplicated annotate/insert/swap sequence into a single helper (e.g., annotateAndAttachProteins) and call it from both places: move the logic that obtains protein_quants via quantifier.getProteinResults(), calls quantifier.annotateQuantificationsToProteins(protein_quants, proteins_), and then either inserts proteins_ into consensus.getProteinIdentifications() or swaps with consensus.getProteinIdentifications()[0] depending on inference_in_cxml into that helper; the helper should accept the quantifier (or protein_quants), proteins_ (by reference), consensus (or a reference to consensus.getProteinIdentifications()), and the inference_in_cxml flag so callers simply invoke annotateAndAttachProteins(...) instead of duplicating the block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/topp/ProteinQuantifier.cpp`:
- Around line 1013-1016: Replace the silent error logs after each failed Parquet
export with a thrown OpenMS exception derived from Exception::Base: when
ConsensusMapArrowExport::exportToParquet(...) returns false (and the analogous
checks at the other two locations), throw an appropriate Exception (e.g.,
Exception::UnableToCreateFile or other Exception::Base-derived) constructed with
a clear message, __FILE__, __LINE__, and OPENMS_PRETTY_FUNCTION; remove the
OPENMS_LOG_ERROR-only path so the program fails fast on the first failed export.
- Around line 1062-1066: The thrown Exception::RequiredParameterNotGiven in the
conditional that checks out/peptide_out/out_qpx currently mentions "out_qpx"
even in builds without Parquet support; change the exception message so
"out_qpx" is only included when the same Parquet compile-time macro used to
register the out_qpx option is defined (i.e., mirror the macro used elsewhere
where out_qpx is conditionally registered), by building the message string
conditionally (via `#ifdef` the Parquet macro or by checking the same compile-time
flag) before calling Exception::RequiredParameterNotGiven and otherwise only
mentioning "out/peptide_out".
---
Nitpick comments:
In `@src/topp/ProteinQuantifier.cpp`:
- Around line 996-1009: Extract the duplicated annotate/insert/swap sequence
into a single helper (e.g., annotateAndAttachProteins) and call it from both
places: move the logic that obtains protein_quants via
quantifier.getProteinResults(), calls
quantifier.annotateQuantificationsToProteins(protein_quants, proteins_), and
then either inserts proteins_ into consensus.getProteinIdentifications() or
swaps with consensus.getProteinIdentifications()[0] depending on
inference_in_cxml into that helper; the helper should accept the quantifier (or
protein_quants), proteins_ (by reference), consensus (or a reference to
consensus.getProteinIdentifications()), and the inference_in_cxml flag so
callers simply invoke annotateAndAttachProteins(...) instead of duplicating the
block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5f223088-85b1-4cc3-a280-8c0fb4c4b9c0
📒 Files selected for processing (1)
src/topp/ProteinQuantifier.cpp
| if (!ConsensusMapArrowExport::exportToParquet(consensus, out_qpx + "/quantms.feature.parquet")) | ||
| { | ||
| OPENMS_LOG_ERROR << "Failed to write features Parquet file" << std::endl; | ||
| } |
There was a problem hiding this comment.
Fail fast when a requested Parquet export cannot be written.
Lines 1013, 1031, and 1037 only log and continue, so the tool can still return EXECUTION_OK with a partial quantms.*.parquet set. Please throw on the first failed export instead of continuing. As per coding guidelines, exceptions must derive from Exception::Base; throw with file/line/OPENMS_PRETTY_FUNCTION.
Also applies to: 1031-1034, 1037-1040
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/topp/ProteinQuantifier.cpp` around lines 1013 - 1016, Replace the silent
error logs after each failed Parquet export with a thrown OpenMS exception
derived from Exception::Base: when ConsensusMapArrowExport::exportToParquet(...)
returns false (and the analogous checks at the other two locations), throw an
appropriate Exception (e.g., Exception::UnableToCreateFile or other
Exception::Base-derived) constructed with a clear message, __FILE__, __LINE__,
and OPENMS_PRETTY_FUNCTION; remove the OPENMS_LOG_ERROR-only path so the program
fails fast on the first failed export.
| if (out.empty() && peptide_out.empty() && out_qpx.empty()) | ||
| { | ||
| throw Exception::RequiredParameterNotGiven(__FILE__, __LINE__, | ||
| OPENMS_PRETTY_FUNCTION, | ||
| "out/peptide_out"); | ||
| "out/peptide_out/out_qpx"); |
There was a problem hiding this comment.
Don't advertise out_qpx in builds that don't compile Parquet support.
In non-Parquet builds, Line 1066 still tells users to provide out_qpx, but that option is not registered there. Please keep the exception text aligned with the compiled feature set.
Suggested fix
+#ifdef WITH_PARQUET
+ const char* required_outputs = "out/peptide_out/out_qpx";
+#else
+ const char* required_outputs = "out/peptide_out";
+#endif
if (out.empty() && peptide_out.empty() && out_qpx.empty())
{
throw Exception::RequiredParameterNotGiven(__FILE__, __LINE__,
OPENMS_PRETTY_FUNCTION,
- "out/peptide_out/out_qpx");
+ required_outputs);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/topp/ProteinQuantifier.cpp` around lines 1062 - 1066, The thrown
Exception::RequiredParameterNotGiven in the conditional that checks
out/peptide_out/out_qpx currently mentions "out_qpx" even in builds without
Parquet support; change the exception message so "out_qpx" is only included when
the same Parquet compile-time macro used to register the out_qpx option is
defined (i.e., mirror the macro used elsewhere where out_qpx is conditionally
registered), by building the message string conditionally (via `#ifdef` the
Parquet macro or by checking the same compile-time flag) before calling
Exception::RequiredParameterNotGiven and otherwise only mentioning
"out/peptide_out".
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h (1)
204-208: Add@ingroupto the new public schema docs.Both new exported structs have
@brief, but they still need the module@ingrouptag so they get attached to the generated FORMAT/API docs.As per coding guidelines,
src/openms/include/**/*.h: "Doxygen documentation with@briefrequired (not auto-generated from first line); use@defgroup/@ingroup."Also applies to: 247-251
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h` around lines 204 - 208, The Doxygen for the new exported struct QPXPSMSchema (and the nearby second exported struct defined just below it) is missing an `@ingroup` tag so they don't appear in the generated FORMAT/API docs; update the Doxygen block for QPXPSMSchema and the adjacent exported struct to include the appropriate module group (e.g., add an "@ingroup FORMAT" line) so both structs are attached to the correct doxygen group per the project's documentation guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h`:
- Around line 204-208: The Doxygen for the new exported struct QPXPSMSchema (and
the nearby second exported struct defined just below it) is missing an `@ingroup`
tag so they don't appear in the generated FORMAT/API docs; update the Doxygen
block for QPXPSMSchema and the adjacent exported struct to include the
appropriate module group (e.g., add an "@ingroup FORMAT" line) so both structs
are attached to the correct doxygen group per the project's documentation
guidelines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8092531c-e3f4-4553-8aa3-c952db116935
📒 Files selected for processing (1)
src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h
Update the QPX feature, PSM, and protein group parquet exports to match the quantms parquet exchange format specification. This ensures interoperability with downstream quantms tools and viewers. Schema changes (all 3 files): - Rename ConsensusFeatureExportSchema -> QPXFeatureSchema - Add new QPXPSMSchema (keep PSMSchema for import path) - Column renames: precursor_charge->charge, reference_file_name-> run_file_name, start/stop_ion_mobility->ion_mobility_start/stop - Type fixes: is_decoy/unique int->bool, charge int32->int16, scan scalar->list<int32>, cv_params string->list<struct>, modifications position string->int32 with amino_acid+scores, intensities field names, pg_accessions list<string>->list<struct> - New columns: mass_error_ppm, missed_cleavages, pg_positions, id_run_file_name, cross_links, peak arrays (mz/intensity/charge/ ion_type/ion_mobility arrays) - Removed columns: quality, score, score_type, spectrum_reference, feature_metavalues, scan_reference_file_name, rank, peptide_identification_index, psm/spectrum_metavalues, run_identifier, higher_score_better - Correct nullability on all fields and struct children - Rename output files: features.parquet->quantms.feature.parquet, psms.parquet->quantms.psm.parquet, protein_groups.parquet->quantms.pg.parquet Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix CMakeLists.txt TOPP test references to use new filenames (quantms.feature/psm/pg.parquet instead of old names) - Add QPXPSMSchema unit tests to ArrowSchemaRegistry_test.cpp (field count, column names, types, nullability, validation) - Use actual finalized array length for MakeArrayOfNull in QPXFile.cpp instead of fragile pre-loop estimate Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Emit all PeptideEvidences in pg_accessions (not just first per accession), preserving repeated accessions at different positions - Use PeptideIdentification::getSpectrumReference() before falling back to metavalue "spectrum_reference" for scan extraction - Prefer per-hit/per-PeptideIdentification file reference for run_file_name in QPXFile, fall back to identifier-level path - Deduplicate shared Arrow type builders: QPXFeatureSchema delegates modificationsType/additionalScoresType/cvParamsType to QPXPSMSchema - Add scan list element type assertion (int32) in schema tests - Fix CMakeLists.txt QPX test filenames to use quantms.* naming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add out_qpx parameter to ProteinQuantifier for exporting QPX parquet files (quantms.feature.parquet, quantms.psm.parquet, quantms.pg.parquet). Only supported for consensusXML input; warns and skips for featureXML and idXML inputs. Reuses existing ConsensusMapArrowExport, QPXFile, and ProteinGroupArrowExport classes. Protein quantifications are annotated to protein groups before export (sharing the same preparation as the mzTab export path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Throw InvalidParameter instead of silently returning success when out_qpx is the only output and input is featureXML/idXML. When other outputs (out/peptide_out) are also set, still warn and skip QPX. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add brief descriptions to struct declarations and all static type helper methods to satisfy docstring coverage requirements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QPXFile::exportToArrow now restores PSMSchema output for lossless round-trips used by FeatureMapArrowIO and ConsensusMapArrowIO. New exportPSMsToQPXArrow produces QPXPSMSchema for exchange format. exportToParquet uses the QPX-specific method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
696c43a to
11ddd49
Compare
- Update QPX output filenames to quantms.* scheme (#8974) - Add GenericWrapper removal (BREAKING) (#8981) - Add ModifiedSincSmoother new algorithm (#8217) - Add experimental BrukerTimsFile/BRUKER_TDF format support (#8975) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
quantms.feature.parquet,quantms.psm.parquet,quantms.pg.parquetConsensusFeatureExportSchematoQPXFeatureSchemaand add newQPXPSMSchema(existingPSMSchemakept unchanged for the import path)Changes
Schema & naming
precursor_charge(int32)charge(int16)reference_file_namerun_file_namestart_ion_mobility/stop_ion_mobilityion_mobility_start/ion_mobility_stopis_decoy(int32)is_decoy(bool)unique(int32)unique(bool)scan(string/int32)scan(list<int32>)cv_params(string)cv_params(list<struct>)intensities(sample_accession/channel)intensities(label/intensity)pg_accessions(list<string>)pg_accessions(list<struct>)New columns
mass_error_ppm,missed_cleavages,pg_positions,id_run_file_namecross_links,mz_array,intensity_array,charge_array,ion_type_array,ion_mobility_arrayRemoved columns
quality,score,score_type,spectrum_reference,feature_metavalues,scan_reference_file_namescore,score_type,higher_score_better,rank,peptide_identification_index,psm_metavalues,spectrum_metavalues,run_identifier,spectrum_referenceProtein groups
not nullon label, intensity, score_name, score_value, etc.)Files modified (9)
ArrowSchemaRegistry.h/.cpp— new QPXFeatureSchema + QPXPSMSchemaConsensusMapArrowExport.cpp— rewritten feature exportQPXFile.cpp— rewritten PSM exportProteinGroupArrowExport.cpp— nullability fixesProteomicsLFQ.cpp/IsobaricWorkflow.cpp— filename updatesArrowSchemaRegistry_test.cpp/QPXFile_test.cpp— updated testsTest plan
ArrowSchemaRegistry_testpasses (schema validation)QPXFile_testpasses (PSM export round-trip)TOPP_ProteomicsLFQ_qpxpasses (end-to-end QPX generation)TOPP_ProteomicsLFQ_qpx_featurespasses (feature schema check)TOPP_ProteomicsLFQ_qpx_psmspasses (PSM schema check)TOPP_ProteomicsLFQ_qpx_pgpasses (PG schema check)🤖 Generated with Claude Code
Summary by CodeRabbit