Add/datastructures by singjc · Pull Request #1 · OpenMS/pyopenms_viz

singjc · 2024-04-25T01:06:15Z

Added data structure schema for chromatogram and spectrum data frames.

singjc · 2024-04-25T12:26:38Z

@axelwalter this is the general data frame schema we use in MassDash. We generally deal with a group of traces for a specific precursor, so we have additional meta-data columns that differentiate which traces come from which ions (precursor or fragments). I'm not sure if some of this is stored in the MSChromatogram in the meta value interface, so I've split up the required columns and additional optional meta-data and feature columns.

I am working on adapting the current bokeh plotting we have for data frame input in another branch, I will upload a test tsv file of the chromatogram data frame. You could use this during the implementation of the the to_df methods in pyOpenMS.

jcharkow · 2024-04-25T14:08:01Z

Originally I was thinking we should just be using the pyOpenMS MSChromatogram and MSSpectrum object instead of creating new objects. In that case I'm not sure if we would need these data structures

singjc · 2024-04-25T14:35:52Z

Originally I was thinking we should just be using the pyOpenMS MSChromatogram and MSSpectrum object instead of creating new objects. In that case I'm not sure if we would need these data structures

I don't think we would be creating new objects, just adding conversion methods to the MSChromatogram and MSSpectrum objects to convert the structured data into a dataframe (similar to MSExperiment). We discussed yesterday that the core plotting should take a dataframe as input, so that it can be more usable if someone already has a dataframe derived from something other than an MSChromatogram object.

jcharkow · 2024-04-25T14:59:07Z

Hmm on first thought I do not know if I am in favor of having plotting methods from DataFrames rather than OpenMS objects. Some reasons why I am against this are:

Dataframes are so customizable, it might be difficult to ensure that a Dataframe in the correct format has been provided to the plotter.
DataFrames as input isolates the package from the OpenMS ecosystem
DataFrames are not a good structure to store metadata in.

Having said that I do think it is good to have DataFrame support possibly conversion functions from DataFrames to OpenMS objects.

@axelwalter @singjc please let me know your thoughts.

singjc · 2024-04-25T15:51:14Z

I don't know, I am partially more in favour of DataFrames rather than OpenMS objects, but this might be more of a preference since I prefer the grammar of graphics way of plotting.

Dataframes are so customizable, it might be difficult to ensure that a Dataframe in the correct format has been provided to the plotter.

This may be true, but I don't think it's a major issue. I think the major concern with the flexibility of DataFrames are maybe the column names and the data types. But I guess this is why we define a schema to ensure that at least the required columns are present.

DataFrames as input isolates the package from the OpenMS ecosystem

I guess if we are strictly speaking about this visualization package, then I can see how it would feel isolated from the main OpenMS ecosystem (similar to pyprophet, DIAlignR).

DataFrames are not a good structure to store metadata in.

I'm not sure if this is entirely true. DataFrames allow you to store multiple data types in a single DataFrame, I think maybe the concern would be how much metadata you are planning to store. Then it could be an issue.

I do agree with having conversion methods for going from DataFrames to OpenMS objects, and vice-versa though. I think this would be useful for using pyOpenMS for more exploratory, data wrangling and algorithmic development stuff.

jcharkow · 2024-04-25T17:10:13Z

@singjc You make a good point that DataFrames might be a more intuitive structure to perform plotting from rather than the pyopenms types so lets stick with the DataFrames. Furthermore with the conversion functions it should make it pretty easy to use either or as input.

In that case we might need to rename MSChromatogram and MSSpectrum to MSChromatogramDf and MSSpectrumDf so they are not confused with the pyopenms types? Also is the idea that multiple chromatograms will be in a single dataframe or just a single chromatogram per dataframe because that affects my comments on the schema.

jcharkow · 2024-04-25T17:11:14Z

Having multiple chromatograms/spectra might make it more confusing but it allows for groupby functions which likely makes plotting easier?

singjc · 2024-04-25T17:48:10Z

Yes, I was thinking that the DataFrame can contain data for more than one ion, so multiple chromatogram ion traces per fragment and precursor. This is similar to how we represent the extracted featuremap as a dataframe in MassDash.

So something like this: test/test_data/ionMobilityTestChromatogramDf.tsv

ms_level	mz	rt	int	precursor_mz	product_mz	Annotation
1	642.3342	6225.005	229.0117	642.3295	642.3295	prec
2	504.2620	6225.111	152.0026	642.3295	504.2664	y4^1
2	591.2981	6225.111	273.0037	642.3295	591.2984	y5^1
2	704.3849	6225.111	41.0010	642.3295	704.3825	y6^1

if the meta data columns (precursor_mz, product_mz and Annotation are not in the DataFrame, then we have to assume all of the data belongs to a single target ion. Unless we try infer from the mz column.

The groupby makes plotting subsets of the data easier, so we don't have to extract the arrays per individual structure, which is currently also how we do it in MassDash.

I have a branch (PR) here that I'm working on for the plotting of Chromatograms using DataFrames as the input

jcharkow

Looks good. Assuming here that FEATURE_CHROMATOGRAM columns will be present in the chromatogram table. Some suggestions for changing schema.

jcharkow · 2024-04-25T18:32:20Z

pyopenms_viz/datastructures/MSChromatogram.py

+"""
+
+REQUIRED_CHROMATOGRAM_DATAFRAME_COLUMNS = {
+    "mz": "Numeric column representing the mass-to-charge ratio (m/z) of the extracted retention time point.",


Do not know if mz is required? Maybe put in optional?

Agree, should be optional. The bare minimum should be time and intensity for chromatograms.

Even though it's called MSChromatogram, we use that to store any kind of chromatogram data, not necessary from MS data.

jcharkow · 2024-04-25T18:32:44Z

pyopenms_viz/datastructures/MSChromatogram.py

+
+REQUIRED_CHROMATOGRAM_DATAFRAME_COLUMNS = {
+    "mz": "Numeric column representing the mass-to-charge ratio (m/z) of the extracted retention time point.",
+    "time": "Numeric column representing the retention time (in minutes) of the chromatographic peaks.",


in minutes or seconds?

Would prefer seconds, since all OpenMS datastructures use seconds for RT.

jcharkow · 2024-04-25T18:33:39Z

pyopenms_viz/datastructures/MSChromatogram.py

+
+OPTIONAL_METADATA_CHROMATOGRAM_DATAFRAME_COLUMNS = {
+    "sequence": "String column representing the peptide sequence.",
+    "modified_sequence": "String column representing the modified peptide sequence.",


Add in OpenMS format? or I guess it does not matter for plotting how the modifications are represented

I don't think it will matter how the modifications are represented (unimod or codename), unless the df's are passed through some other tool that utilizes the modified sequences. But we could add for clarity I guess

jcharkow · 2024-04-25T18:38:35Z

pyopenms_viz/datastructures/MSChromatogram.py

+    "right_width": "Numeric column representing the width of the peak on the right side of the apex.",
+    "area": "Numeric column representing the area under the peak.",
+    "q_value": "Numeric column representing the q-value of the peak."
+}


possible column additions:

Rank (rank of chromatogram feature based on pyprophet)

Ion Mobility

other column suggestion:

identification (e.g. metabolite annotation for feature)

adduct_annotation

signal_to_noise

feature_id (to match chromatograms back to original features)

jcharkow · 2024-04-25T18:42:03Z

pyopenms_viz/datastructures/MSChromatogram.py

+    "rt_apex": "Numeric column representing the retention time (in minutes) of the peak apex.",
+    "left_width": "Numeric column representing the width of the peak on the left side of the apex.",
+    "right_width": "Numeric column representing the width of the peak on the right side of the apex.",


Would it be useful to represent rt_apex, left_width and right_width as booleans rather than numeric (e.g. 1 if this point is the left width 0 if otherwise).
Or (might be cleaner) there is a single column named feature_characteristic (or something similar) and if the point is the left width the value is LW, this would allow for easily adding other characteristics like FWHM. (Possibly want an array so a single point can have more than one characteristic)

Any kind of free custom column could be nice. Just in case the user wants to add some annotation.

jcharkow · 2024-04-25T18:42:25Z

pyopenms_viz/datastructures/MSChromatogram.py

+
+OPTIONAL_FEATURE_CHROMATOGRAM_DATAFRAME_COLUMNS = {
+    "rt_apex": "Numeric column representing the retention time (in minutes) of the peak apex.",
+    "left_width": "Numeric column representing the width of the peak on the left side of the apex.",


I know left_width and right_width is consistent with OpenSwath but might be better to use left_boundary and right_boundary?

Just for clarification: If the whole chromatogram is derived from one feature, will these columns just be filled with the same values in all rows?

FWHM could be a value to add here.

pyopenms_viz/datastructures/MSSpectrum.py

jcharkow · 2024-04-25T18:44:48Z

pyopenms_viz/datastructures/MSSpectrum.py

+    "product_mz": "Numeric column representing the mass-to-charge ratio (m/z) of the product ion.",
+    "product_charge": "Integer column representing the charge state of the product ion.",


What do these mean in the context of a spectrum?

In the context of a spectrum, if you are extracting spectra for a peptide precursor ion with a specific mass, you extract peaks around the targeted m/z with some ppm tolerance. So These optional columns are just a mapping to say the extract peak maps to the target fragment/precursor m/z for this specific peptide.

ms_level mz rt int precursor_mz product_mz Annotation

1 642.3342 6225.005 229.0117 642.3295 642.3295 prec

2 504.2620 6225.111 152.0026 642.3295 504.2664 y4^1

2 591.2981 6225.111 273.0037 642.3295 591.2984 y5^1

2 704.3849 6225.111 41.0010 642.3295 704.3825 y6^1

i.e. a peak was identified with an m/z of 642.3342 which maps to the target precursor m/z of 642.3295

Ok makes sense. Would these be useful in plotting though?

It depends, if you are plotting from an aggregated data frame with all target m/z's and want to color the spectra based on which signal is mapping to which target m/z or annotation, then it's useful for that.

pyopenms_viz/datastructures/MSSpectrum.py

axelwalter · 2024-04-29T12:02:21Z

Hmm on first thought I do not know if I am in favor of having plotting methods from DataFrames rather than OpenMS objects. Some reasons why I am against this are:

Dataframes are so customizable, it might be difficult to ensure that a Dataframe in the correct format has been provided to the plotter.

DataFrames as input isolates the package from the OpenMS ecosystem

DataFrames are not a good structure to store metadata in.

Having said that I do think it is good to have DataFrame support possibly conversion functions from DataFrames to OpenMS objects.

@axelwalter @singjc please let me know your thoughts.

First of all thanks for the great work already! I am in favour of DataFrames simply because they are convenient to work with and universal in the Python data science ecosystem.

Some thoughts to your points here:

We should of course specify how the df should look like and leave it to the user to adhere to that. And we can make sure all pyOpenMS data structures export to the given df schema.
I think it's a huge plus if the module is usable for anyone in the mass spec field, regardless which data processing tool the have used. It would still be OpenMS branded and using pyOpenMS to export your data will make it usable out of the box.
No strong opinion here and no expert. I guess using efficient formats to store dfs such as parquet will be good enough.

Totally agree with you, we should offer pyOpenMS <--> DataFrames in both directions.

axelwalter · 2024-04-29T12:06:52Z

In that case we might need to rename MSChromatogram and MSSpectrum to MSChromatogramDf and MSSpectrumDf so they are not confused with the pyopenms types?

The pyOpenMS types are also named similar to that already, such as MSExperimentDF and FeatureMapDF.

https://github.com/OpenMS/OpenMS/blob/d67a9037587db7a20bebce686c278df8da6d9faa/src/pyOpenMS/pyopenms/_dataframes.py#L299

axelwalter · 2024-04-29T12:16:42Z

pyopenms_viz/datastructures/MSChromatogram.py

+    "mz": "Numeric column representing the mass-to-charge ratio (m/z) of the extracted retention time point.",
+    "time": "Numeric column representing the retention time (in minutes) of the chromatographic peaks.",
+    "intensity": "Numeric column representing the intensity (abundance) of the signal at each time point.",
+    "ms_level": "Integer column indicating the MS level (1 for MS1, 2 for MS2, etc.)."


Same as with mz, ms_level should be optional. Often times there is chromatogram data without MS data attached.

axelwalter · 2024-04-29T12:40:45Z

pyopenms_viz/datastructures/MSChromatogram.py

+    "left_width": "Numeric column representing the width of the peak on the left side of the apex.",
+    "right_width": "Numeric column representing the width of the peak on the right side of the apex.",
+    "area": "Numeric column representing the area under the peak.",
+    "q_value": "Numeric column representing the q-value of the peak."


Is q_value a term from OpenSWATH? Could be more general just quality.

axelwalter · 2024-04-29T12:48:10Z

pyopenms_viz/datastructures/MSSpectrum.py

+}
+
+OPTIONAL_METADATA_SPECTRUM_DATAFRAME_COLUMNS = {
+    "sequence": "String column representing the peptide sequence.",


What if there are multiple peptides associated with a spectrum? Add all just separated by ";" ?

That's a good question. I usually only deal with targeted data extraction for a single precursor. But if the spectrum is associated with multiple peptide precursors, then I think we can just separate by ";".

Philosophy - only add columns that would be used for plotting.

not sure if this schema is still relevant though

axelwalter · 2024-05-06T07:55:00Z

pyopenms_viz/datastructures/MSChromatogram.py


 OPTIONAL_METADATA_CHROMATOGRAM_DATAFRAME_COLUMNS = {
+    "native_id" : "Chromatogram id, necessary if multiple chromatograms are in the same dataframe."
+    "chromatogram_type": "Type of chromatogram must be one of: MASS_CHROMATOGRAM, TOTAL_ION_CURRENT_CHROMATOGRAM, MASS_CHROMATOGRAM, TOTAL_ION_CURRENT_CHROMATOGRAM, SELECTED_ION_CURRENT_CHROMATOGRAM, BASEPEAK_CHROMATOGRAM, SELECTED_ION_MONITORING_CHROMATOGRAM, SELECTED_REACTION_MONITORING_CHROMATOGRAM, ELECTROMAGNETIC_RADIATION_CHROMATOGRAM, ABSORPTION_CHROMATOGRAM, EMISSION_CHROMATOGRAM, SIZE_OF_CHROMATOGRAM_TYPE"


There are some duplicates:

MASS_CHROMATOGRAM
TOTAL_ION_CURRENT_CHROMATOGRAM

Should SIZE_OF_CHROMATOGRAM_TYPE be in there?

I don't think SIZE_OF_CHROMATOGRAM_TYPE should be there, I removed it for now. @jcharkow were you thinking of having this as a separate column maybe?

I don't think SIZE_OF_CHROMATOGRAM_TYPE should be there, I removed it for now. @jcharkow were you thinking of having this as a separate column maybe?

I added SIZE_OF_CHROMATOGRAM_TYPE to be consistent with the OpenMS Chromatogram export however I don't know if it is useful in a plotting context so it can likely be removed.

singjc added 2 commits April 24, 2024 21:01

add: schema for chromatogram and spectrum dataframes

7ea089b

minor

7cf8136

singjc requested a review from jcharkow April 25, 2024 01:06

jcharkow reviewed Apr 25, 2024

View reviewed changes

axelwalter reviewed Apr 29, 2024

View reviewed changes

jcharkow added 3 commits May 3, 2024 12:03

add class based implementation of dataframe schema

aa26ebc

Philosophy - only add columns that would be used for plotting.

add annotation column to spectrum

1831c57

update schema

3391b67

not sure if this schema is still relevant though

axelwalter reviewed May 6, 2024

View reviewed changes

singjc added 6 commits May 6, 2024 10:57

clean: remove duplicate types

e8a92b5

add: clarify modiication naming convention

7f2e37d

move: native id and ms_level to optional

7f91722

fix: annotation of peak not spectrum

623bc4c

add: optional time and ion mobility to MSSpectrum schema

63239f1

fix: missing commas

ea1dbd7

jcharkow merged commit 50e1d75 into OpenMS:main Jun 6, 2024

		"product_mz": "Numeric column representing the mass-to-charge ratio (m/z) of the product ion.",
		"product_charge": "Integer column representing the charge state of the product ion.",

Conversation

singjc commented Apr 25, 2024

Uh oh!

singjc commented Apr 25, 2024

Uh oh!

jcharkow commented Apr 25, 2024

Uh oh!

singjc commented Apr 25, 2024

Uh oh!

jcharkow commented Apr 25, 2024

Uh oh!

singjc commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcharkow commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcharkow commented Apr 25, 2024

Uh oh!

singjc commented Apr 25, 2024

Uh oh!

jcharkow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

axelwalter Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

axelwalter commented Apr 29, 2024

Uh oh!

axelwalter commented Apr 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

singjc commented Apr 25, 2024 •

edited

Loading

jcharkow commented Apr 25, 2024 •

edited

Loading

axelwalter Apr 29, 2024 •

edited

Loading