Skip to content

feat: align QPX parquet output with quantms template schema#8974

Merged
timosachsenberg merged 9 commits intodevelopfrom
feature/qpx-quantms-schema-alignment
Mar 24, 2026
Merged

feat: align QPX parquet output with quantms template schema#8974
timosachsenberg merged 9 commits intodevelopfrom
feature/qpx-quantms-schema-alignment

Conversation

@timosachsenberg
Copy link
Copy Markdown
Contributor

@timosachsenberg timosachsenberg commented Mar 23, 2026

Summary

  • Align all 3 QPX parquet exports (feature, PSM, protein groups) with the quantms parquet exchange format specification
  • Rename output files to quantms.feature.parquet, quantms.psm.parquet, quantms.pg.parquet
  • Rename ConsensusFeatureExportSchema to QPXFeatureSchema and add new QPXPSMSchema (existing PSMSchema kept unchanged for the import path)
  • Fix column names, types, nullability, and struct field schemas to exactly match the quantms template

Changes

Schema & naming

Old New
precursor_charge (int32) charge (int16)
reference_file_name run_file_name
start_ion_mobility / stop_ion_mobility ion_mobility_start / ion_mobility_stop
is_decoy (int32) is_decoy (bool)
unique (int32) unique (bool)
scan (string/int32) scan (list<int32>)
cv_params (string) cv_params (list<struct>)
intensities (sample_accession/channel) intensities (label/intensity)
pg_accessions (list<string>) pg_accessions (list<struct>)

New columns

  • mass_error_ppm, missed_cleavages, pg_positions, id_run_file_name
  • PSM: cross_links, mz_array, intensity_array, charge_array, ion_type_array, ion_mobility_array

Removed columns

  • Feature: quality, score, score_type, spectrum_reference, feature_metavalues, scan_reference_file_name
  • PSM: score, score_type, higher_score_better, rank, peptide_identification_index, psm_metavalues, spectrum_metavalues, run_identifier, spectrum_reference

Protein groups

  • Fixed nullability on all struct child fields to match template (not null on label, intensity, score_name, score_value, etc.)

Files modified (9)

  • ArrowSchemaRegistry.h / .cpp — new QPXFeatureSchema + QPXPSMSchema
  • ConsensusMapArrowExport.cpp — rewritten feature export
  • QPXFile.cpp — rewritten PSM export
  • ProteinGroupArrowExport.cpp — nullability fixes
  • ProteomicsLFQ.cpp / IsobaricWorkflow.cpp — filename updates
  • ArrowSchemaRegistry_test.cpp / QPXFile_test.cpp — updated tests

Test plan

  • ArrowSchemaRegistry_test passes (schema validation)
  • QPXFile_test passes (PSM export round-trip)
  • TOPP_ProteomicsLFQ_qpx passes (end-to-end QPX generation)
  • TOPP_ProteomicsLFQ_qpx_features passes (feature schema check)
  • TOPP_ProteomicsLFQ_qpx_psms passes (PSM schema check)
  • TOPP_ProteomicsLFQ_qpx_pg passes (PG schema check)
  • All 3 output schemas verified identical to quantms template via pyarrow comparison

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added QPX PSM and feature Parquet/Arrow exports with charge, mass-error PPM, missed cleavages, is_decoy, ion mobility, spectral arrays, cross‑links, CV params, richer protein‑group annotations and protein-group positions.
  • Bug Fixes / Improvements
    • Tightened field nullability and numeric types for more consistent, non-null outputs; adjusted export semantics and data shapes.
  • Tests
    • Updated schema and export tests to validate the new PSM/feature/protein-group formats.
  • Chores
    • Parquet output filenames standardized to quantms.*.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the legacy ConsensusFeature export with QPX-compliant Arrow/Parquet schemas: introduces QPXPSMSchema and QPXFeatureSchema, updates exporters to emit new columns and nested types, tightens nullability, revises PSM/feature export logic, and standardizes Parquet filenames to quantms.*.parquet.

Changes

Cohort / File(s) Summary
Arrow Schema Definitions
src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h, src/openms/source/FORMAT/ArrowSchemaRegistry.cpp
Removed ConsensusFeatureExportSchema; added QPXPSMSchema and QPXFeatureSchema with new column constants (e.g., CHARGE, MASS_ERROR_PPM, RUN_FILE_NAME, MZ_ARRAY, INTENSITY_ARRAY, ION_MOBILITY_ARRAY) and new nested type helpers (modificationsType(), additionalScoresType(), cvParamsType(), crossLinksType(), additionalIntensitiesType(), pgAccessionsType(), pgPositionsType()).
Feature / ConsensusMap Export
src/openms/source/FORMAT/ConsensusMapArrowExport.cpp
Switched to QPXFeatureSchema: removed old builders, added run_file_name, charge (int16), missed_cleavages, mass_error_ppm, pg_accessions/pg_positions, ion-mobility start/stop; scan is now list<int32>; reworked modifications and protein-evidence encoding; is_decoy as boolean; introduced Reserve/Finish macros and schema validation against QPXFeatureSchema.
QPX PSM Export
src/openms/source/FORMAT/QPXFile.cpp
Reworked PSM export to QPXPSMSchema: charge as int16, calculated_mz/observed_mz as float (non-null), added mass_error_ppm, missed_cleavages, cv_params, cross_links, spectral-array placeholders (mz_array, intensity_array, charge_array, ion_type_array, ion_mobility_array), scan as list<int32>, and changed default/null handling (e.g., non-null run_file_name, is_decoy).
Protein Group Export
src/openms/source/FORMAT/ProteinGroupArrowExport.cpp
Tightened nullability of nested structs/lists (intensities, additional_intensities, peptides, additional_scores, cv_params); marked several top-level fields non-nullable (pg_accessions, anchor_protein, run_file_name, is_decoy, peptides); changed peptide-count emission (emit 0 instead of Null in specific cases).
Tests: Schema & Export Expectations
src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp, src/tests/class_tests/openms/source/QPXFile_test.cpp
Replaced ConsensusFeatureExportSchema assertions with QPXPSMSchema (24 fields) and QPXFeatureSchema (31 fields) expectations; updated column names, ordering, types (e.g., charge INT16, mz fields FLOAT, scan list), and content checks to match the new schemas and export behavior.
Workflows, CLI & Test Harness
src/topp/IsobaricWorkflow.cpp, src/topp/ProteomicsLFQ.cpp, src/tests/topp/CMakeLists.txt, src/topp/ProteinQuantifier.cpp
Updated Parquet output filenames/help text to quantms.feature.parquet, quantms.psm.parquet, quantms.pg.parquet; added optional out_qpx handling in ProteinQuantifier to emit the three QPX Parquet files and adjusted CTest expected file paths.

Sequence Diagram(s)

sequenceDiagram
    participant Exporter as Exporter (ConsensusMap / QPXFile / PGExport)
    participant SchemaReg as ArrowSchemaRegistry
    participant Arrow as Arrow Builders
    participant Parquet as Parquet Writer
    Exporter->>SchemaReg: request schema (QPXFeature/QPXPSM)
    SchemaReg-->>Exporter: return schema definition
    Exporter->>Arrow: create builders per schema
    Arrow-->>Exporter: builders ready
    Exporter->>Arrow: append rows (nested structs, lists, scalars)
    Arrow-->>Exporter: row builders filled
    Exporter->>Arrow: finish arrays
    Arrow-->>Exporter: arrays finalized
    Exporter->>Parquet: validate table against schema
    Parquet-->>Exporter: validation OK
    Exporter->>Parquet: write quantms.*.parquet
    Parquet-->>Exporter: write complete / file paths
Loading

Possibly related PRs

Suggested reviewers

  • jpfeuffer
  • pjones

Poem

"I hopped through rows with carrot zeal,
New QPX fields tucked in my meal,
Parquet files named neat and bright,
Arrow nests tucked in the night,
The rabbit signs the schema — code feels real."

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: align QPX parquet output with quantms template schema' accurately and concisely summarizes the main change: aligning QPX parquet exports with the quantms specification and renaming output files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/qpx-quantms-schema-alignment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp (1)

1030-1033: Assert the scan element type too.

This currently only proves that scan is a list. A regression to list<int64> would still pass even though this schema is supposed to pin scan to list<int32>.

Suggested assertion
   TEST_EQUAL(s->field(13)->type()->id(), arrow::Type::LIST)
   TEST_EQUAL(s->field(13)->nullable(), false)
+  const auto scan_type = std::static_pointer_cast<arrow::ListType>(s->field(13)->type());
+  TEST_EQUAL(scan_type->value_type()->id(), arrow::Type::INT32)

As per coding guidelines, src/tests/class_tests/**/*_test.cpp: Write unit tests for new functionality and ensure existing tests pass before suggesting changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp` around
lines 1030 - 1033, Add an assertion that the "scan" list's element type is
int32: locate the existing checks on s->field(13) (name "scan", type LIST) and
add an assertion that the list's element type id is arrow::Type::INT32 by
retrieving the list's value type (e.g. static_cast/ptr cast to arrow::ListType
and call value_type()->id() or value_field()->type()->id()) and comparing it to
arrow::Type::INT32 so a regression to list<int64> will fail.
src/openms/source/FORMAT/ArrowSchemaRegistry.cpp (1)

403-437: Factor the shared QPX nested types.

modificationsType(), additionalScoresType(), and cvParamsType() are duplicated verbatim between QPXPSMSchema and QPXFeatureSchema. Since both classes are supposed to stay template-identical, extracting these into shared internal helpers will make the next schema update much harder to apply inconsistently.

Also applies to: 485-519

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/ArrowSchemaRegistry.cpp` around lines 403 - 437,
Extract the duplicated Arrow nested type builders used in
QPXPSMSchema::modificationsType, QPXPSMSchema::additionalScoresType,
QPXPSMSchema::cvParamsType (and the identical methods in QPXFeatureSchema) into
shared internal helper functions (e.g., makeScoresType(), makePositionStruct(),
makeModificationsListType(), makeCvParamsListType) placed in an unnamed
namespace or as static helpers in this translation unit; replace the bodies of
QPXPSMSchema::modificationsType, ::additionalScoresType and ::cvParamsType (and
the corresponding QPXFeatureSchema methods) to call those helpers so both
classes reuse the same arrow::struct_/arrow::list constructions and avoid
duplication during future schema updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp`:
- Around line 607-669: The code currently deduplicates peptide evidences by
accession (seen_accs) which loses repeated mappings; instead, only use seen_accs
when populating the unique protein_accs vector (for anchor_protein/unique), but
collect and push an EvidenceInfo for every entry from
best_hit->getPeptideEvidences() into evidences. Concretely: in the loop over
best_hit->getPeptideEvidences(), always create and push an EvidenceInfo (ei) for
each ev, and only call seen_accs.insert(acc) / push_back into protein_accs when
the accession is first encountered; leave the export loop that appends to
pg_accessions_builder / pga_struct_b / pga_acc_b / pga_start_b / pga_end_b /
pga_pre_b / pga_post_b unchanged so every PeptideEvidence is emitted (also apply
the same change in the analogous block around lines 708-709).
- Around line 537-554: The code currently reads the spectrum reference from the
generic meta value "spectrum_reference" which misses the dedicated
PeptideIdentification member; update the block that builds the scan (the logic
around pep_ids, spec_ref, scan_builder and scan_val_b) to call
PeptideIdentification::getSpectrumReference() first (on pep_ids[0]) and assign
that to spec_ref, and only if that returns empty fall back to checking
metaValueExists("spectrum_reference") / getMetaValue("spectrum_reference"); then
pass spec_ref to extractScan and append scan_num to scan_val_b as before.

In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 199-203: The export currently uses
ProteinIdentification::getPrimaryMSRunPath() and drops the hit metavalue
"reference_file_name", causing every PSM to be stamped with the first
identifier-level file; change the logic in QPXFile.cpp to prefer a per-hit or
per-PeptideIdentification file reference (check hit meta
"run_file_name"/"reference_file_name" or PeptideIdentification::getMetaValue())
when available and only fall back to
ProteinIdentification::getPrimaryMSRunPath() if no per-PSM/PeptideIdentification
file info exists or if the identifier-level path is unambiguous; update the code
paths around the excluded_hit_mvs handling and where run_file_name is resolved
(the blocks that reference excluded_hit_mvs and build PSM output) to read
per-hit/PeptideIdentification meta first and preserve reference_file_name if
used at PSM granularity.

---

Nitpick comments:
In `@src/openms/source/FORMAT/ArrowSchemaRegistry.cpp`:
- Around line 403-437: Extract the duplicated Arrow nested type builders used in
QPXPSMSchema::modificationsType, QPXPSMSchema::additionalScoresType,
QPXPSMSchema::cvParamsType (and the identical methods in QPXFeatureSchema) into
shared internal helper functions (e.g., makeScoresType(), makePositionStruct(),
makeModificationsListType(), makeCvParamsListType) placed in an unnamed
namespace or as static helpers in this translation unit; replace the bodies of
QPXPSMSchema::modificationsType, ::additionalScoresType and ::cvParamsType (and
the corresponding QPXFeatureSchema methods) to call those helpers so both
classes reuse the same arrow::struct_/arrow::list constructions and avoid
duplication during future schema updates.

In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp`:
- Around line 1030-1033: Add an assertion that the "scan" list's element type is
int32: locate the existing checks on s->field(13) (name "scan", type LIST) and
add an assertion that the list's element type id is arrow::Type::INT32 by
retrieving the list's value type (e.g. static_cast/ptr cast to arrow::ListType
and call value_type()->id() or value_field()->type()->id()) and comparing it to
arrow::Type::INT32 so a regression to list<int64> will fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6afe0f20-15eb-4ebe-8460-c8928ec91e9e

📥 Commits

Reviewing files that changed from the base of the PR and between c9a2677 and 1a3b1b8.

📒 Files selected for processing (9)
  • src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h
  • src/openms/source/FORMAT/ArrowSchemaRegistry.cpp
  • src/openms/source/FORMAT/ConsensusMapArrowExport.cpp
  • src/openms/source/FORMAT/ProteinGroupArrowExport.cpp
  • src/openms/source/FORMAT/QPXFile.cpp
  • src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp
  • src/tests/class_tests/openms/source/QPXFile_test.cpp
  • src/topp/IsobaricWorkflow.cpp
  • src/topp/ProteomicsLFQ.cpp

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/openms/source/FORMAT/QPXFile.cpp (1)

187-195: ⚠️ Potential issue | 🟠 Major

Resolve run_file_name and scan at PSM granularity.

Line 199 strips reference_file_name / scan out of the metadata path, but Lines 431-454 only rebuild those columns from the identifier-level MS run list and parsed spectrum_reference. In merged IDs or imports that already carry per-hit / per-PeptideIdentification values, this still exports the wrong raw file (first MS run) or an empty scan list even though the PSM already has the source metadata.

Possible fix
-  std::map<String, String> id_to_filename;
+  std::map<String, StringList> id_to_filenames;
   for (const auto& prot_id : protein_identifications)
   {
     StringList ms_runs;
     prot_id.getPrimaryMSRunPath(ms_runs);
     if (!ms_runs.empty())
     {
-      id_to_filename[prot_id.getIdentifier()] = ms_runs[0];
+      id_to_filenames[prot_id.getIdentifier()] = ms_runs;
     }
   }

   static const std::unordered_set<std::string> excluded_hit_mvs = {
     "target_decoy", "predicted_RT", "predicted_rt", "ion_mobility", "IM",
-    "scan", "reference_file_name"
+    "scan", "reference_file_name", "run_file_name"
   };
 ...
-  auto fn_it = id_to_filename.find(pep_id.getIdentifier());
+  auto fn_it = id_to_filenames.find(pep_id.getIdentifier());
 ...
-      if (fn_it != id_to_filename.end())
+      if (hit.metaValueExists("run_file_name"))
+      {
+        (void)run_file_name_builder.Append(hit.getMetaValue("run_file_name").toString());
+      }
+      else if (hit.metaValueExists("reference_file_name"))
+      {
+        (void)run_file_name_builder.Append(hit.getMetaValue("reference_file_name").toString());
+      }
+      else if (pep_id.metaValueExists("run_file_name"))
+      {
+        (void)run_file_name_builder.Append(pep_id.getMetaValue("run_file_name").toString());
+      }
+      else if (pep_id.metaValueExists("reference_file_name"))
+      {
+        (void)run_file_name_builder.Append(pep_id.getMetaValue("reference_file_name").toString());
+      }
+      else if (fn_it != id_to_filenames.end() && fn_it->second.size() == 1)
       {
-        (void)run_file_name_builder.Append(fn_it->second);
+        (void)run_file_name_builder.Append(fn_it->second[0]);
       }
       else
       {
         (void)run_file_name_builder.Append("");
       }
 ...
-      if (!spec_ref.empty())
+      if (hit.metaValueExists("scan"))
+      {
+        (void)scan_value_b->Append(static_cast<int32_t>(static_cast<int>(hit.getMetaValue("scan"))));
+      }
+      else if (pep_id.metaValueExists("scan"))
+      {
+        (void)scan_value_b->Append(static_cast<int32_t>(static_cast<int>(pep_id.getMetaValue("scan"))));
+      }
+      else if (!spec_ref.empty())
       {
         Int scan_num = extractScan(spec_ref);

Also applies to: 199-203, 228-229, 431-454

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/QPXFile.cpp` around lines 187 - 195, The export
currently uses id_to_filename (built from protein_identifications via
getPrimaryMSRunPath) and parsed spectrum_reference to populate
reference_file_name/run_file_name and scan, which ignores per-PSM or
per-PeptideIdentification metadata; update the logic in QPXFile.cpp (places
around id_to_filename population and the column rebuild at the section handling
spectrum_reference / scan around the existing spectrum parsing) to first check
and use per-hit or per-PeptideIdentification fields (e.g., inspect PeptideHit
metadata and PeptideIdentification members for run_file_name,
reference_file_name, and scan) and only fall back to
id_to_filename[prot_id.getIdentifier()] or parsing spectrum_reference when those
per-PSM values are absent, ensuring scan numbers and source file names are
resolved at PSM granularity.
🧹 Nitpick comments (2)
src/openms/source/FORMAT/QPXFile.cpp (1)

164-185: Use '\n' in the new error logs.

The added reserve/finish error paths still use std::endl, which forces an unnecessary flush on each branch. Please switch the new messages to '\n' to stay consistent with the repo logging guidance. As per coding guidelines, "Use OpenMS logging macros and OpenMS::LogStream; avoid std::cout/err directly; avoid std::endl for performance (use \n)"

Also applies to: 526-555, 563-565

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/QPXFile.cpp` around lines 164 - 185, Replace the use
of std::endl in the new Reserve/Finish error logging branches with '\n' to avoid
unnecessary flushes; specifically update the error messages that reference
charge_builder, pep_builder, is_decoy_builder, calculated_mz_builder,
observed_mz_builder, mass_error_ppm_builder, predicted_rt_builder,
run_file_name_builder, rt_builder, ion_mobility_builder, and
missed_cleavages_builder in QPXFile.cpp (and the similar occurrences around the
ranges noted: ~526-555 and 563-565) so each OPENMS_LOG_ERROR << ... ends with
'\n' instead of std::endl. Ensure no other behavior changes are made and
preserve the existing message text and ToString() usage.
src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp (1)

1023-1067: Assert the child type of the new primitive-list fields.

These sections only verify that scan, protein_accessions, mz_array, intensity_array, charge_array, ion_type_array, ion_mobility_array, gg_accessions, and gg_names are LIST. A regression like list<int32>list<int64> or list<float32>list<float64> would still pass, which weakens coverage for the schema-alignment change.

Example assertions
   TEST_EQUAL(s->field(13)->type()->id(), arrow::Type::LIST)
+  TEST_EQUAL(s->field(13)->type()->Equals(arrow::list(arrow::int32())), true)
...
   TEST_EQUAL(s->field(19)->type()->id(), arrow::Type::LIST)
+  TEST_EQUAL(s->field(19)->type()->Equals(arrow::list(arrow::float32())), true)

As per coding guidelines, "Write unit tests for new functionality and ensure existing tests pass before suggesting changes"

Also applies to: 1191-1194, 1247-1254

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp` around
lines 1023 - 1067, The tests only assert that many columns (e.g. "scan",
"protein_accessions", "mz_array", "intensity_array", "charge_array",
"ion_type_array", "ion_mobility_array", also "gg_accessions" and "gg_names") are
LISTs but do not assert the child element type; update the assertions to check
the list child type for each field by inspecting
s->field(N)->type()->field(0)->type()->id() (or equivalent Arrow API) and
compare to the correct arrow::Type (e.g., INT32 for scan/charge_array, FLOAT for
mz_array/intensity_array/ion_mobility_array, STRING/UTF8 for
protein_accessions/ion_type_array/gg_names, etc.); add these child-type checks
adjacent to the existing LIST assertions and apply the same fixes at the other
indicated blocks (around the other ranges cited).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 187-195: The export currently uses id_to_filename (built from
protein_identifications via getPrimaryMSRunPath) and parsed spectrum_reference
to populate reference_file_name/run_file_name and scan, which ignores per-PSM or
per-PeptideIdentification metadata; update the logic in QPXFile.cpp (places
around id_to_filename population and the column rebuild at the section handling
spectrum_reference / scan around the existing spectrum parsing) to first check
and use per-hit or per-PeptideIdentification fields (e.g., inspect PeptideHit
metadata and PeptideIdentification members for run_file_name,
reference_file_name, and scan) and only fall back to
id_to_filename[prot_id.getIdentifier()] or parsing spectrum_reference when those
per-PSM values are absent, ensuring scan numbers and source file names are
resolved at PSM granularity.

---

Nitpick comments:
In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 164-185: Replace the use of std::endl in the new Reserve/Finish
error logging branches with '\n' to avoid unnecessary flushes; specifically
update the error messages that reference charge_builder, pep_builder,
is_decoy_builder, calculated_mz_builder, observed_mz_builder,
mass_error_ppm_builder, predicted_rt_builder, run_file_name_builder, rt_builder,
ion_mobility_builder, and missed_cleavages_builder in QPXFile.cpp (and the
similar occurrences around the ranges noted: ~526-555 and 563-565) so each
OPENMS_LOG_ERROR << ... ends with '\n' instead of std::endl. Ensure no other
behavior changes are made and preserve the existing message text and ToString()
usage.

In `@src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp`:
- Around line 1023-1067: The tests only assert that many columns (e.g. "scan",
"protein_accessions", "mz_array", "intensity_array", "charge_array",
"ion_type_array", "ion_mobility_array", also "gg_accessions" and "gg_names") are
LISTs but do not assert the child element type; update the assertions to check
the list child type for each field by inspecting
s->field(N)->type()->field(0)->type()->id() (or equivalent Arrow API) and
compare to the correct arrow::Type (e.g., INT32 for scan/charge_array, FLOAT for
mz_array/intensity_array/ion_mobility_array, STRING/UTF8 for
protein_accessions/ion_type_array/gg_names, etc.); add these child-type checks
adjacent to the existing LIST assertions and apply the same fixes at the other
indicated blocks (around the other ranges cited).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c4dcc616-c842-4d4f-87e9-b4107cedbd53

📥 Commits

Reviewing files that changed from the base of the PR and between 1a3b1b8 and d8a9360.

📒 Files selected for processing (3)
  • src/openms/source/FORMAT/QPXFile.cpp
  • src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp
  • src/tests/topp/CMakeLists.txt

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openms/source/FORMAT/ConsensusMapArrowExport.cpp (1)

434-456: ⚠️ Potential issue | 🔴 Critical

Guard findScoreType(...PEP) here.

Unlike QPXFile::exportToArrow, Line 437 is unguarded. IDScoreSwitcherAlgorithm::findScoreType() throws when no PEP-like score exists, which turns a missing optional score into a full export failure.

Possible fix
     if (has_id)
     {
-      auto result = idsa.findScoreType(pep_ids[0], IDScoreSwitcherAlgorithm::ScoreType::PEP);
+      IDScoreSwitcherAlgorithm::ScoreSearchResult result{};
+      try
+      {
+        result = idsa.findScoreType(pep_ids[0], IDScoreSwitcherAlgorithm::ScoreType::PEP);
+      }
+      catch (...)
+      {
+      }
       if (!result.score_name.empty() && best_hit)
       {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp` around lines 434 - 456,
The call to idsa.findScoreType(pep_ids[0],
IDScoreSwitcherAlgorithm::ScoreType::PEP) can throw when no PEP-like score
exists; wrap or guard that call in a presence check or try/catch so a missing
PEP does not abort export. Specifically, before calling idsa.findScoreType use
the same existence check pattern used in QPXFile::exportToArrow (or catch the
exception from idsa.findScoreType), and then only access
result.score_name/is_main_score_type and use pep_builder.Append,
pep_builder.AppendNull, best_hit and pep_ids accordingly when a valid result is
returned; otherwise call pep_builder.AppendNull to preserve export.
♻️ Duplicate comments (3)
src/openms/source/FORMAT/ConsensusMapArrowExport.cpp (2)

612-618: ⚠️ Potential issue | 🟠 Major

Emit pg_positions from the evidences you already collected.

Lines 621-636 preserve accession/start/end for every peptide mapping, but Line 717 still writes pg_positions = null for every row. That drops the new positional column entirely and loses repeated mappings that pg_accessions now preserves.

Possible fix
-    // === pg_positions (empty for now — no positional data available at consensus level) ===
-    (void)pg_positions_builder.AppendNull();
+    (void)pg_positions_builder.Append();
+    for (const auto& ei : evidences)
+    {
+      if (ei.start == PeptideEvidence::UNKNOWN_POSITION || ei.end == PeptideEvidence::UNKNOWN_POSITION)
+      {
+        continue;
+      }
+      (void)pgp_struct_b->Append();
+      (void)pgp_acc_b->Append(ei.acc);
+      (void)pgp_start_b->Append(ei.start);
+      (void)pgp_end_b->Append(ei.end);
+    }

Also applies to: 640-677, 716-717

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp` around lines 612 - 618,
The code collects per-peptide mapping info into protein_accs and evidences
(struct EvidenceInfo) but still emits pg_positions as null; update the export
logic that writes pg_positions to build and write a positions string from
evidences (one entry per protein_accs item, preserving repeats and ordering)
using each EvidenceInfo's start/end (and optionally pre/post if expected),
matching the pg_accessions output format (i.e., same delimiter and ordering as
protein_accs) so repeated mappings are represented; locate the writer that
currently sets pg_positions = null and replace it with code that iterates
evidences to produce the positions field.

609-610: ⚠️ Potential issue | 🟠 Major

Populate id_run_file_name instead of hard-coding nulls.

Line 610 makes this new column permanently null, even when the best hit or PeptideIdentification already carries reference_file_name / run_file_name. That drops identification-run provenance for multi-run consensus exports.

Possible fix
-    // === id_run_file_name ===
-    (void)id_run_file_name_builder.AppendNull();
+    // === id_run_file_name ===
+    String id_run_file_name;
+    if (best_hit && best_hit->metaValueExists("reference_file_name"))
+    {
+      id_run_file_name = best_hit->getMetaValue("reference_file_name").toString();
+    }
+    else if (best_hit && best_hit->metaValueExists("run_file_name"))
+    {
+      id_run_file_name = best_hit->getMetaValue("run_file_name").toString();
+    }
+    else if (!pep_ids.empty() && pep_ids[0].metaValueExists("reference_file_name"))
+    {
+      id_run_file_name = pep_ids[0].getMetaValue("reference_file_name").toString();
+    }
+    else if (!pep_ids.empty() && pep_ids[0].metaValueExists("run_file_name"))
+    {
+      id_run_file_name = pep_ids[0].getMetaValue("run_file_name").toString();
+    }
+    if (id_run_file_name.empty())
+    {
+      (void)id_run_file_name_builder.AppendNull();
+    }
+    else
+    {
+      (void)id_run_file_name_builder.Append(id_run_file_name);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp` around lines 609 - 610,
The id_run_file_name column is always set to null by calling
id_run_file_name_builder.AppendNull(); instead populate it from the
identification provenance: extract run file name from the PeptideIdentification
(meta keys "reference_file_name" or "run_file_name") or from the best PeptideHit
if present, then call id_run_file_name_builder.Append(<that string>) instead of
AppendNull when a value exists; keep AppendNull only as the fallback. Update the
logic near ConsensusMapArrowExport.cpp where id_run_file_name_builder is used so
the builder appends the actual run name taken from the
PeptideIdentification/PeptideHit metadata (use the existing best-hit retrieval
code paths to find the value).
src/openms/source/FORMAT/QPXFile.cpp (1)

431-451: ⚠️ Potential issue | 🟠 Major

Only use the identifier-level run fallback when it is unambiguous.

The new fallback at Lines 447-450 still goes through id_to_filename, which currently keeps only ms_runs[0]. When one ProteinIdentification covers multiple primary MS runs, unresolved PSMs get stamped with the first file, so run_file_name becomes wrong for merged search results.

Possible fix
-  std::map<String, String> id_to_filename;
+  std::map<String, StringList> id_to_filenames;
   for (const auto& prot_id : protein_identifications)
   {
     StringList ms_runs;
     prot_id.getPrimaryMSRunPath(ms_runs);
     if (!ms_runs.empty())
     {
-      id_to_filename[prot_id.getIdentifier()] = ms_runs[0];
+      id_to_filenames[prot_id.getIdentifier()] = ms_runs;
     }
   }
...
-    auto fn_it = id_to_filename.find(pep_id.getIdentifier());
+    auto fn_it = id_to_filenames.find(pep_id.getIdentifier());
...
-        if (run_file.empty() && fn_it != id_to_filename.end())
+        if (run_file.empty() && fn_it != id_to_filenames.end() && fn_it->second.size() == 1)
         {
-          run_file = fn_it->second;
+          run_file = fn_it->second[0];
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/source/FORMAT/QPXFile.cpp` around lines 431 - 451, The fallback to
identifier-level filenames can assign the wrong run when an identifier covers
multiple MS runs; modify the block that uses id_to_filename/fn_it so you only
append fn_it->second when that mapping is unambiguous: e.g., check that fn_it !=
id_to_filename.end() AND that the identifier maps to a single filename (or
id_to_filename contains exactly one entry for this identifier / a precomputed
unambiguous set) before using run_file = fn_it->second; update the logic around
run_file, fn_it, id_to_filename and run_file_name_builder in the same scope as
pep_id and hit to avoid defaulting to the first ms_runs entry when multiple runs
exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp`:
- Around line 434-456: The call to idsa.findScoreType(pep_ids[0],
IDScoreSwitcherAlgorithm::ScoreType::PEP) can throw when no PEP-like score
exists; wrap or guard that call in a presence check or try/catch so a missing
PEP does not abort export. Specifically, before calling idsa.findScoreType use
the same existence check pattern used in QPXFile::exportToArrow (or catch the
exception from idsa.findScoreType), and then only access
result.score_name/is_main_score_type and use pep_builder.Append,
pep_builder.AppendNull, best_hit and pep_ids accordingly when a valid result is
returned; otherwise call pep_builder.AppendNull to preserve export.

---

Duplicate comments:
In `@src/openms/source/FORMAT/ConsensusMapArrowExport.cpp`:
- Around line 612-618: The code collects per-peptide mapping info into
protein_accs and evidences (struct EvidenceInfo) but still emits pg_positions as
null; update the export logic that writes pg_positions to build and write a
positions string from evidences (one entry per protein_accs item, preserving
repeats and ordering) using each EvidenceInfo's start/end (and optionally
pre/post if expected), matching the pg_accessions output format (i.e., same
delimiter and ordering as protein_accs) so repeated mappings are represented;
locate the writer that currently sets pg_positions = null and replace it with
code that iterates evidences to produce the positions field.
- Around line 609-610: The id_run_file_name column is always set to null by
calling id_run_file_name_builder.AppendNull(); instead populate it from the
identification provenance: extract run file name from the PeptideIdentification
(meta keys "reference_file_name" or "run_file_name") or from the best PeptideHit
if present, then call id_run_file_name_builder.Append(<that string>) instead of
AppendNull when a value exists; keep AppendNull only as the fallback. Update the
logic near ConsensusMapArrowExport.cpp where id_run_file_name_builder is used so
the builder appends the actual run name taken from the
PeptideIdentification/PeptideHit metadata (use the existing best-hit retrieval
code paths to find the value).

In `@src/openms/source/FORMAT/QPXFile.cpp`:
- Around line 431-451: The fallback to identifier-level filenames can assign the
wrong run when an identifier covers multiple MS runs; modify the block that uses
id_to_filename/fn_it so you only append fn_it->second when that mapping is
unambiguous: e.g., check that fn_it != id_to_filename.end() AND that the
identifier maps to a single filename (or id_to_filename contains exactly one
entry for this identifier / a precomputed unambiguous set) before using run_file
= fn_it->second; update the logic around run_file, fn_it, id_to_filename and
run_file_name_builder in the same scope as pep_id and hit to avoid defaulting to
the first ms_runs entry when multiple runs exist.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d36fbe18-3460-40fe-b266-6606c0522728

📥 Commits

Reviewing files that changed from the base of the PR and between d8a9360 and 3e623f1.

📒 Files selected for processing (4)
  • src/openms/source/FORMAT/ArrowSchemaRegistry.cpp
  • src/openms/source/FORMAT/ConsensusMapArrowExport.cpp
  • src/openms/source/FORMAT/QPXFile.cpp
  • src/tests/class_tests/openms/source/ArrowSchemaRegistry_test.cpp

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/topp/ProteinQuantifier.cpp (1)

996-1009: Extract the shared protein-annotation setup into one helper.

This block duplicates the mzTab export preparation above. Pulling the annotate/insert/swap sequence into a shared helper will keep the QPX and mzTab paths from drifting the next time inference_in_cxml handling changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/topp/ProteinQuantifier.cpp` around lines 996 - 1009, Extract the
duplicated annotate/insert/swap sequence into a single helper (e.g.,
annotateAndAttachProteins) and call it from both places: move the logic that
obtains protein_quants via quantifier.getProteinResults(), calls
quantifier.annotateQuantificationsToProteins(protein_quants, proteins_), and
then either inserts proteins_ into consensus.getProteinIdentifications() or
swaps with consensus.getProteinIdentifications()[0] depending on
inference_in_cxml into that helper; the helper should accept the quantifier (or
protein_quants), proteins_ (by reference), consensus (or a reference to
consensus.getProteinIdentifications()), and the inference_in_cxml flag so
callers simply invoke annotateAndAttachProteins(...) instead of duplicating the
block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/topp/ProteinQuantifier.cpp`:
- Around line 1013-1016: Replace the silent error logs after each failed Parquet
export with a thrown OpenMS exception derived from Exception::Base: when
ConsensusMapArrowExport::exportToParquet(...) returns false (and the analogous
checks at the other two locations), throw an appropriate Exception (e.g.,
Exception::UnableToCreateFile or other Exception::Base-derived) constructed with
a clear message, __FILE__, __LINE__, and OPENMS_PRETTY_FUNCTION; remove the
OPENMS_LOG_ERROR-only path so the program fails fast on the first failed export.
- Around line 1062-1066: The thrown Exception::RequiredParameterNotGiven in the
conditional that checks out/peptide_out/out_qpx currently mentions "out_qpx"
even in builds without Parquet support; change the exception message so
"out_qpx" is only included when the same Parquet compile-time macro used to
register the out_qpx option is defined (i.e., mirror the macro used elsewhere
where out_qpx is conditionally registered), by building the message string
conditionally (via `#ifdef` the Parquet macro or by checking the same compile-time
flag) before calling Exception::RequiredParameterNotGiven and otherwise only
mentioning "out/peptide_out".

---

Nitpick comments:
In `@src/topp/ProteinQuantifier.cpp`:
- Around line 996-1009: Extract the duplicated annotate/insert/swap sequence
into a single helper (e.g., annotateAndAttachProteins) and call it from both
places: move the logic that obtains protein_quants via
quantifier.getProteinResults(), calls
quantifier.annotateQuantificationsToProteins(protein_quants, proteins_), and
then either inserts proteins_ into consensus.getProteinIdentifications() or
swaps with consensus.getProteinIdentifications()[0] depending on
inference_in_cxml into that helper; the helper should accept the quantifier (or
protein_quants), proteins_ (by reference), consensus (or a reference to
consensus.getProteinIdentifications()), and the inference_in_cxml flag so
callers simply invoke annotateAndAttachProteins(...) instead of duplicating the
block.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5f223088-85b1-4cc3-a280-8c0fb4c4b9c0

📥 Commits

Reviewing files that changed from the base of the PR and between 3e623f1 and 980cf1f.

📒 Files selected for processing (1)
  • src/topp/ProteinQuantifier.cpp

Comment on lines +1013 to +1016
if (!ConsensusMapArrowExport::exportToParquet(consensus, out_qpx + "/quantms.feature.parquet"))
{
OPENMS_LOG_ERROR << "Failed to write features Parquet file" << std::endl;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when a requested Parquet export cannot be written.

Lines 1013, 1031, and 1037 only log and continue, so the tool can still return EXECUTION_OK with a partial quantms.*.parquet set. Please throw on the first failed export instead of continuing. As per coding guidelines, exceptions must derive from Exception::Base; throw with file/line/OPENMS_PRETTY_FUNCTION.

Also applies to: 1031-1034, 1037-1040

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/topp/ProteinQuantifier.cpp` around lines 1013 - 1016, Replace the silent
error logs after each failed Parquet export with a thrown OpenMS exception
derived from Exception::Base: when ConsensusMapArrowExport::exportToParquet(...)
returns false (and the analogous checks at the other two locations), throw an
appropriate Exception (e.g., Exception::UnableToCreateFile or other
Exception::Base-derived) constructed with a clear message, __FILE__, __LINE__,
and OPENMS_PRETTY_FUNCTION; remove the OPENMS_LOG_ERROR-only path so the program
fails fast on the first failed export.

Comment on lines +1062 to +1066
if (out.empty() && peptide_out.empty() && out_qpx.empty())
{
throw Exception::RequiredParameterNotGiven(__FILE__, __LINE__,
OPENMS_PRETTY_FUNCTION,
"out/peptide_out");
"out/peptide_out/out_qpx");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't advertise out_qpx in builds that don't compile Parquet support.

In non-Parquet builds, Line 1066 still tells users to provide out_qpx, but that option is not registered there. Please keep the exception text aligned with the compiled feature set.

Suggested fix
+#ifdef WITH_PARQUET
+    const char* required_outputs = "out/peptide_out/out_qpx";
+#else
+    const char* required_outputs = "out/peptide_out";
+#endif
     if (out.empty() && peptide_out.empty() && out_qpx.empty())
     {
       throw Exception::RequiredParameterNotGiven(__FILE__, __LINE__,
                                                  OPENMS_PRETTY_FUNCTION,
-                                                 "out/peptide_out/out_qpx");
+                                                 required_outputs);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/topp/ProteinQuantifier.cpp` around lines 1062 - 1066, The thrown
Exception::RequiredParameterNotGiven in the conditional that checks
out/peptide_out/out_qpx currently mentions "out_qpx" even in builds without
Parquet support; change the exception message so "out_qpx" is only included when
the same Parquet compile-time macro used to register the out_qpx option is
defined (i.e., mirror the macro used elsewhere where out_qpx is conditionally
registered), by building the message string conditionally (via `#ifdef` the
Parquet macro or by checking the same compile-time flag) before calling
Exception::RequiredParameterNotGiven and otherwise only mentioning
"out/peptide_out".

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h (1)

204-208: Add @ingroup to the new public schema docs.

Both new exported structs have @brief, but they still need the module @ingroup tag so they get attached to the generated FORMAT/API docs.

As per coding guidelines, src/openms/include/**/*.h: "Doxygen documentation with @brief required (not auto-generated from first line); use @defgroup/@ingroup."

Also applies to: 247-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h` around lines 204 -
208, The Doxygen for the new exported struct QPXPSMSchema (and the nearby second
exported struct defined just below it) is missing an `@ingroup` tag so they don't
appear in the generated FORMAT/API docs; update the Doxygen block for
QPXPSMSchema and the adjacent exported struct to include the appropriate module
group (e.g., add an "@ingroup FORMAT" line) so both structs are attached to the
correct doxygen group per the project's documentation guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h`:
- Around line 204-208: The Doxygen for the new exported struct QPXPSMSchema (and
the nearby second exported struct defined just below it) is missing an `@ingroup`
tag so they don't appear in the generated FORMAT/API docs; update the Doxygen
block for QPXPSMSchema and the adjacent exported struct to include the
appropriate module group (e.g., add an "@ingroup FORMAT" line) so both structs
are attached to the correct doxygen group per the project's documentation
guidelines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8092531c-e3f4-4553-8aa3-c952db116935

📥 Commits

Reviewing files that changed from the base of the PR and between 980cf1f and 9b4b2b4.

📒 Files selected for processing (1)
  • src/openms/include/OpenMS/FORMAT/ArrowSchemaRegistry.h

timosachsenberg and others added 9 commits March 24, 2026 10:05
Update the QPX feature, PSM, and protein group parquet exports to
match the quantms parquet exchange format specification. This ensures
interoperability with downstream quantms tools and viewers.

Schema changes (all 3 files):
- Rename ConsensusFeatureExportSchema -> QPXFeatureSchema
- Add new QPXPSMSchema (keep PSMSchema for import path)
- Column renames: precursor_charge->charge, reference_file_name->
  run_file_name, start/stop_ion_mobility->ion_mobility_start/stop
- Type fixes: is_decoy/unique int->bool, charge int32->int16,
  scan scalar->list<int32>, cv_params string->list<struct>,
  modifications position string->int32 with amino_acid+scores,
  intensities field names, pg_accessions list<string>->list<struct>
- New columns: mass_error_ppm, missed_cleavages, pg_positions,
  id_run_file_name, cross_links, peak arrays (mz/intensity/charge/
  ion_type/ion_mobility arrays)
- Removed columns: quality, score, score_type, spectrum_reference,
  feature_metavalues, scan_reference_file_name, rank,
  peptide_identification_index, psm/spectrum_metavalues,
  run_identifier, higher_score_better
- Correct nullability on all fields and struct children
- Rename output files: features.parquet->quantms.feature.parquet,
  psms.parquet->quantms.psm.parquet,
  protein_groups.parquet->quantms.pg.parquet

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix CMakeLists.txt TOPP test references to use new filenames
  (quantms.feature/psm/pg.parquet instead of old names)
- Add QPXPSMSchema unit tests to ArrowSchemaRegistry_test.cpp
  (field count, column names, types, nullability, validation)
- Use actual finalized array length for MakeArrayOfNull in
  QPXFile.cpp instead of fragile pre-loop estimate

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Emit all PeptideEvidences in pg_accessions (not just first per
  accession), preserving repeated accessions at different positions
- Use PeptideIdentification::getSpectrumReference() before falling
  back to metavalue "spectrum_reference" for scan extraction
- Prefer per-hit/per-PeptideIdentification file reference for
  run_file_name in QPXFile, fall back to identifier-level path
- Deduplicate shared Arrow type builders: QPXFeatureSchema delegates
  modificationsType/additionalScoresType/cvParamsType to QPXPSMSchema
- Add scan list element type assertion (int32) in schema tests
- Fix CMakeLists.txt QPX test filenames to use quantms.* naming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add out_qpx parameter to ProteinQuantifier for exporting QPX parquet
files (quantms.feature.parquet, quantms.psm.parquet, quantms.pg.parquet).
Only supported for consensusXML input; warns and skips for featureXML
and idXML inputs.

Reuses existing ConsensusMapArrowExport, QPXFile, and
ProteinGroupArrowExport classes. Protein quantifications are annotated
to protein groups before export (sharing the same preparation as the
mzTab export path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Throw InvalidParameter instead of silently returning success when
out_qpx is the only output and input is featureXML/idXML. When other
outputs (out/peptide_out) are also set, still warn and skip QPX.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add brief descriptions to struct declarations and all static type
helper methods to satisfy docstring coverage requirements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QPXFile::exportToArrow now restores PSMSchema output for lossless
round-trips used by FeatureMapArrowIO and ConsensusMapArrowIO.
New exportPSMsToQPXArrow produces QPXPSMSchema for exchange format.
exportToParquet uses the QPX-specific method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… names

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timosachsenberg timosachsenberg force-pushed the feature/qpx-quantms-schema-alignment branch from 696c43a to 11ddd49 Compare March 24, 2026 09:05
@timosachsenberg timosachsenberg enabled auto-merge (squash) March 24, 2026 10:53
@timosachsenberg timosachsenberg merged commit bae1edb into develop Mar 24, 2026
41 of 42 checks passed
@timosachsenberg timosachsenberg deleted the feature/qpx-quantms-schema-alignment branch March 24, 2026 10:55
github-actions bot pushed a commit that referenced this pull request Mar 25, 2026
- Update QPX output filenames to quantms.* scheme (#8974)
- Add GenericWrapper removal (BREAKING) (#8981)
- Add ModifiedSincSmoother new algorithm (#8217)
- Add experimental BrukerTimsFile/BRUKER_TDF format support (#8975)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
timosachsenberg pushed a commit that referenced this pull request Mar 25, 2026
- Update QPX output filenames to quantms.* scheme (#8974)
- Add GenericWrapper removal (BREAKING) (#8981)
- Add ModifiedSincSmoother new algorithm (#8217)
- Add experimental BrukerTimsFile/BRUKER_TDF format support (#8975)

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant