Skip to content

Add/datastructures#1

Merged
jcharkow merged 11 commits intoOpenMS:mainfrom
singjc:add/datastructures
Jun 6, 2024
Merged

Add/datastructures#1
jcharkow merged 11 commits intoOpenMS:mainfrom
singjc:add/datastructures

Conversation

@singjc
Copy link
Collaborator

@singjc singjc commented Apr 25, 2024

Added data structure schema for chromatogram and spectrum data frames.

@singjc singjc requested a review from jcharkow April 25, 2024 01:06
@singjc
Copy link
Collaborator Author

singjc commented Apr 25, 2024

@axelwalter this is the general data frame schema we use in MassDash. We generally deal with a group of traces for a specific precursor, so we have additional meta-data columns that differentiate which traces come from which ions (precursor or fragments). I'm not sure if some of this is stored in the MSChromatogram in the meta value interface, so I've split up the required columns and additional optional meta-data and feature columns.

I am working on adapting the current bokeh plotting we have for data frame input in another branch, I will upload a test tsv file of the chromatogram data frame. You could use this during the implementation of the the to_df methods in pyOpenMS.

@jcharkow
Copy link
Collaborator

Originally I was thinking we should just be using the pyOpenMS MSChromatogram and MSSpectrum object instead of creating new objects. In that case I'm not sure if we would need these data structures

@singjc
Copy link
Collaborator Author

singjc commented Apr 25, 2024

Originally I was thinking we should just be using the pyOpenMS MSChromatogram and MSSpectrum object instead of creating new objects. In that case I'm not sure if we would need these data structures

I don't think we would be creating new objects, just adding conversion methods to the MSChromatogram and MSSpectrum objects to convert the structured data into a dataframe (similar to MSExperiment). We discussed yesterday that the core plotting should take a dataframe as input, so that it can be more usable if someone already has a dataframe derived from something other than an MSChromatogram object.

@jcharkow
Copy link
Collaborator

Hmm on first thought I do not know if I am in favor of having plotting methods from DataFrames rather than OpenMS objects. Some reasons why I am against this are:

  1. Dataframes are so customizable, it might be difficult to ensure that a Dataframe in the correct format has been provided to the plotter.
  2. DataFrames as input isolates the package from the OpenMS ecosystem
  3. DataFrames are not a good structure to store metadata in.

Having said that I do think it is good to have DataFrame support possibly conversion functions from DataFrames to OpenMS objects.

@axelwalter @singjc please let me know your thoughts.

@singjc
Copy link
Collaborator Author

singjc commented Apr 25, 2024

I don't know, I am partially more in favour of DataFrames rather than OpenMS objects, but this might be more of a preference since I prefer the grammar of graphics way of plotting.

Dataframes are so customizable, it might be difficult to ensure that a Dataframe in the correct format has been provided to the plotter.

This may be true, but I don't think it's a major issue. I think the major concern with the flexibility of DataFrames are maybe the column names and the data types. But I guess this is why we define a schema to ensure that at least the required columns are present.

DataFrames as input isolates the package from the OpenMS ecosystem

I guess if we are strictly speaking about this visualization package, then I can see how it would feel isolated from the main OpenMS ecosystem (similar to pyprophet, DIAlignR).

DataFrames are not a good structure to store metadata in.

I'm not sure if this is entirely true. DataFrames allow you to store multiple data types in a single DataFrame, I think maybe the concern would be how much metadata you are planning to store. Then it could be an issue.

I do agree with having conversion methods for going from DataFrames to OpenMS objects, and vice-versa though. I think this would be useful for using pyOpenMS for more exploratory, data wrangling and algorithmic development stuff.

@jcharkow
Copy link
Collaborator

jcharkow commented Apr 25, 2024

@singjc You make a good point that DataFrames might be a more intuitive structure to perform plotting from rather than the pyopenms types so lets stick with the DataFrames. Furthermore with the conversion functions it should make it pretty easy to use either or as input.

In that case we might need to rename MSChromatogram and MSSpectrum to MSChromatogramDf and MSSpectrumDf so they are not confused with the pyopenms types? Also is the idea that multiple chromatograms will be in a single dataframe or just a single chromatogram per dataframe because that affects my comments on the schema.

@jcharkow
Copy link
Collaborator

Having multiple chromatograms/spectra might make it more confusing but it allows for groupby functions which likely makes plotting easier?

@singjc
Copy link
Collaborator Author

singjc commented Apr 25, 2024

Yes, I was thinking that the DataFrame can contain data for more than one ion, so multiple chromatogram ion traces per fragment and precursor. This is similar to how we represent the extracted featuremap as a dataframe in MassDash.

So something like this: test/test_data/ionMobilityTestChromatogramDf.tsv

native_id ms_level mz rt int precursor_mz product_mz Annotation
1 642.3342 6225.005 229.0117 642.3295 642.3295 prec
2 504.2620 6225.111 152.0026 642.3295 504.2664 y4^1
2 591.2981 6225.111 273.0037 642.3295 591.2984 y5^1
2 704.3849 6225.111 41.0010 642.3295 704.3825 y6^1

if the meta data columns (precursor_mz, product_mz and Annotation are not in the DataFrame, then we have to assume all of the data belongs to a single target ion. Unless we try infer from the mz column.

The groupby makes plotting subsets of the data easier, so we don't have to extract the arrays per individual structure, which is currently also how we do it in MassDash.

I have a branch (PR) here that I'm working on for the plotting of Chromatograms using DataFrames as the input

Copy link
Collaborator

@jcharkow jcharkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Assuming here that FEATURE_CHROMATOGRAM columns will be present in the chromatogram table. Some suggestions for changing schema.

"""

REQUIRED_CHROMATOGRAM_DATAFRAME_COLUMNS = {
"mz": "Numeric column representing the mass-to-charge ratio (m/z) of the extracted retention time point.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not know if mz is required? Maybe put in optional?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, should be optional. The bare minimum should be time and intensity for chromatograms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's called MSChromatogram, we use that to store any kind of chromatogram data, not necessary from MS data.


REQUIRED_CHROMATOGRAM_DATAFRAME_COLUMNS = {
"mz": "Numeric column representing the mass-to-charge ratio (m/z) of the extracted retention time point.",
"time": "Numeric column representing the retention time (in minutes) of the chromatographic peaks.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in minutes or seconds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer seconds, since all OpenMS datastructures use seconds for RT.


OPTIONAL_METADATA_CHROMATOGRAM_DATAFRAME_COLUMNS = {
"sequence": "String column representing the peptide sequence.",
"modified_sequence": "String column representing the modified peptide sequence.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add in OpenMS format? or I guess it does not matter for plotting how the modifications are represented

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will matter how the modifications are represented (unimod or codename), unless the df's are passed through some other tool that utilizes the modified sequences. But we could add for clarity I guess

"right_width": "Numeric column representing the width of the peak on the right side of the apex.",
"area": "Numeric column representing the area under the peak.",
"q_value": "Numeric column representing the q-value of the peak."
} No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible column additions:

  1. Rank (rank of chromatogram feature based on pyprophet)
  2. Ion Mobility

Copy link
Collaborator

@axelwalter axelwalter Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other column suggestion:

  • identification (e.g. metabolite annotation for feature)
  • adduct_annotation
  • signal_to_noise
  • feature_id (to match chromatograms back to original features)

Comment on lines +23 to +25
"rt_apex": "Numeric column representing the retention time (in minutes) of the peak apex.",
"left_width": "Numeric column representing the width of the peak on the left side of the apex.",
"right_width": "Numeric column representing the width of the peak on the right side of the apex.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to represent rt_apex, left_width and right_width as booleans rather than numeric (e.g. 1 if this point is the left width 0 if otherwise).
Or (might be cleaner) there is a single column named feature_characteristic (or something similar) and if the point is the left width the value is LW, this would allow for easily adding other characteristics like FWHM. (Possibly want an array so a single point can have more than one characteristic)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any kind of free custom column could be nice. Just in case the user wants to add some annotation.


OPTIONAL_FEATURE_CHROMATOGRAM_DATAFRAME_COLUMNS = {
"rt_apex": "Numeric column representing the retention time (in minutes) of the peak apex.",
"left_width": "Numeric column representing the width of the peak on the left side of the apex.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know left_width and right_width is consistent with OpenSwath but might be better to use left_boundary and right_boundary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification: If the whole chromatogram is derived from one feature, will these columns just be filled with the same values in all rows?

FWHM could be a value to add here.

Comment on lines +17 to +18
"product_mz": "Numeric column representing the mass-to-charge ratio (m/z) of the product ion.",
"product_charge": "Integer column representing the charge state of the product ion.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these mean in the context of a spectrum?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of a spectrum, if you are extracting spectra for a peptide precursor ion with a specific mass, you extract peaks around the targeted m/z with some ppm tolerance. So These optional columns are just a mapping to say the extract peak maps to the target fragment/precursor m/z for this specific peptide.

ms_level mz rt int precursor_mz product_mz Annotation
1 642.3342 6225.005 229.0117 642.3295 642.3295 prec
2 504.2620 6225.111 152.0026 642.3295 504.2664 y4^1
2 591.2981 6225.111 273.0037 642.3295 591.2984 y5^1
2 704.3849 6225.111 41.0010 642.3295 704.3825 y6^1

i.e. a peak was identified with an m/z of 642.3342 which maps to the target precursor m/z of 642.3295

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense. Would these be useful in plotting though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, if you are plotting from an aggregated data frame with all target m/z's and want to color the spectra based on which signal is mapping to which target m/z or annotation, then it's useful for that.

@axelwalter
Copy link
Collaborator

Hmm on first thought I do not know if I am in favor of having plotting methods from DataFrames rather than OpenMS objects. Some reasons why I am against this are:

  1. Dataframes are so customizable, it might be difficult to ensure that a Dataframe in the correct format has been provided to the plotter.
  2. DataFrames as input isolates the package from the OpenMS ecosystem
  3. DataFrames are not a good structure to store metadata in.

Having said that I do think it is good to have DataFrame support possibly conversion functions from DataFrames to OpenMS objects.

@axelwalter @singjc please let me know your thoughts.

First of all thanks for the great work already! I am in favour of DataFrames simply because they are convenient to work with and universal in the Python data science ecosystem.

Some thoughts to your points here:

  1. We should of course specify how the df should look like and leave it to the user to adhere to that. And we can make sure all pyOpenMS data structures export to the given df schema.

  2. I think it's a huge plus if the module is usable for anyone in the mass spec field, regardless which data processing tool the have used. It would still be OpenMS branded and using pyOpenMS to export your data will make it usable out of the box.

  3. No strong opinion here and no expert. I guess using efficient formats to store dfs such as parquet will be good enough.

Totally agree with you, we should offer pyOpenMS <--> DataFrames in both directions.

@axelwalter
Copy link
Collaborator

In that case we might need to rename MSChromatogram and MSSpectrum to MSChromatogramDf and MSSpectrumDf so they are not confused with the pyopenms types?

The pyOpenMS types are also named similar to that already, such as MSExperimentDF and FeatureMapDF.

https://github.com/OpenMS/OpenMS/blob/d67a9037587db7a20bebce686c278df8da6d9faa/src/pyOpenMS/pyopenms/_dataframes.py#L299

"mz": "Numeric column representing the mass-to-charge ratio (m/z) of the extracted retention time point.",
"time": "Numeric column representing the retention time (in minutes) of the chromatographic peaks.",
"intensity": "Numeric column representing the intensity (abundance) of the signal at each time point.",
"ms_level": "Integer column indicating the MS level (1 for MS1, 2 for MS2, etc.)."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with mz, ms_level should be optional. Often times there is chromatogram data without MS data attached.

"left_width": "Numeric column representing the width of the peak on the left side of the apex.",
"right_width": "Numeric column representing the width of the peak on the right side of the apex.",
"area": "Numeric column representing the area under the peak.",
"q_value": "Numeric column representing the q-value of the peak."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is q_value a term from OpenSWATH? Could be more general just quality.

}

OPTIONAL_METADATA_SPECTRUM_DATAFRAME_COLUMNS = {
"sequence": "String column representing the peptide sequence.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple peptides associated with a spectrum? Add all just separated by ";" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I usually only deal with targeted data extraction for a single precursor. But if the spectrum is associated with multiple peptide precursors, then I think we can just separate by ";".

jcharkow added 3 commits May 3, 2024 12:03
Philosophy - only add columns that would be used for plotting.
not sure if this schema is still relevant though

OPTIONAL_METADATA_CHROMATOGRAM_DATAFRAME_COLUMNS = {
"native_id" : "Chromatogram id, necessary if multiple chromatograms are in the same dataframe."
"chromatogram_type": "Type of chromatogram must be one of: MASS_CHROMATOGRAM, TOTAL_ION_CURRENT_CHROMATOGRAM, MASS_CHROMATOGRAM, TOTAL_ION_CURRENT_CHROMATOGRAM, SELECTED_ION_CURRENT_CHROMATOGRAM, BASEPEAK_CHROMATOGRAM, SELECTED_ION_MONITORING_CHROMATOGRAM, SELECTED_REACTION_MONITORING_CHROMATOGRAM, ELECTROMAGNETIC_RADIATION_CHROMATOGRAM, ABSORPTION_CHROMATOGRAM, EMISSION_CHROMATOGRAM, SIZE_OF_CHROMATOGRAM_TYPE"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some duplicates:

MASS_CHROMATOGRAM
TOTAL_ION_CURRENT_CHROMATOGRAM

Should SIZE_OF_CHROMATOGRAM_TYPE be in there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think SIZE_OF_CHROMATOGRAM_TYPE should be there, I removed it for now. @jcharkow were you thinking of having this as a separate column maybe?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think SIZE_OF_CHROMATOGRAM_TYPE should be there, I removed it for now. @jcharkow were you thinking of having this as a separate column maybe?

I added SIZE_OF_CHROMATOGRAM_TYPE to be consistent with the OpenMS Chromatogram export however I don't know if it is useful in a plotting context so it can likely be removed.

@jcharkow jcharkow merged commit 50e1d75 into OpenMS:main Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants