Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Josh's Suggestions for rawData refactoring #56

Merged
merged 94 commits into from
Jan 4, 2024

Conversation

jcharkow
Copy link
Collaborator

@jcharkow jcharkow commented Dec 21, 2023

Here is my attempt at refactoring the rawData a bit. The main idea of this refactor is to make the loader interface more consistent between SqMassLoader and the Raw data extraction for easier python usage.

Note: This has not been tested and is likely very buggy but I wanted to start the PR early so everyone is up to date.

Major Changes include:

  1. Major refactoring of loaders
    • create new "access" folder which is meant to contain more "low level" access methods
      - e.g. this is direct pyopenms methods and SQL queries
    • mzMLDataLoader.py - links a mzML file, results file and spectral library to do the heavy lifting of targeted extraction. Main method is loadFeatureMaps(). This provides more consistency with how SqMassLoader is implemented. This replaces OSWLoader, DIANNLoader, TargetedExtractionLoader and MzMLLoader.
    • Reporttsv is renamed to ResultsTSVDataAccess. Currently only DIA-NN TSV is supported. Will work on adding OSW .tsv
    • Changes to GenericLoader to be parent of both mzMLDataLoader and SqMassDataLoader (and future .d loader)
    • TransitionGroupFeature now stores more meta info so that this data structure can be used more widely

Server methods will have to be adjusted based on this which I am working on currently.

given class functionality mzMLDataAccess is a more appropriate name
GenericResultsAcess - abstract class outlining methods that should be
implemented in results files current children are OSWDataAccess and
ResultsTSVDataAccess

ResultsTSVDataAccess - implenmentation of GenericResultsAccess for
DIA-NN .tsv file
results TSV data access does not load all memory as per Justin's
implementation
loosely based off reportLoader, similar to SqMassLoader but for
mzMLFiles
these are more low level loaders so should be access, high level
implementation is SpectralLibraryLoader
mzMLLoader links a spectral library, results file and mzMLfile to do on
the fly extraction of a given peptide and charge. Note that the peptide
must be found in the experiment to know where the feature is.
@jcharkow jcharkow requested a review from singjc December 21, 2023 16:10
@jcharkow
Copy link
Collaborator Author

@singjc I know that this is still in progress but can you please have a quick look (even just at the description) to let me know if this refactoring sounds ok? E.g. not going to screw everything up?

refactor for usage with new mzMLDataLoader interface
@singjc
Copy link
Collaborator

singjc commented Dec 21, 2023

@jcharkow Looks fine / makes sense. I renamed the reportLoader already in the main branch, so there may be some conflict there. I just pulled in the most recent changes from the main feature/rawdata branch, but there are some conflicts with some of the changes I was working on. I will fix those and then probably leave it for you to work on the rest of the refactoring.

I did update the oswDataAccess for the get_top_rank_feature methods to use a feature hash table to only index on indices.

@jcharkow
Copy link
Collaborator Author

Thanks for looking it over and letting me refactor your code. I'm sorry that it seems we are getting a lot of conflicts but hopefully, this leads to a more unified interface overall :)

jcharkow and others added 27 commits December 31, 2023 16:11
fix bug where crash if annotation column not present. Now will generate
annotation column if not present or it is NA
apply same fix as TransitionTSVLoader that cannot access TransitionPQPLoader dataframe from the SpectrumLibraryLoader class
rename columns in returned dataframe
Also minor documentation code linting
implement loadTopTransitionGroupFeatureDf and fix __init__ function
restructed checking for the need to generate the annotaiton column
previous structure would pass the check for anntation column being
present, but would fail inner if statement checking for NULL or NA
assignment in the annotation column. This results in stmt variable
never being defined.
SpectralLibraryLoader is mean to be the main class for loading
transition files. Use the Individual access class to retrieve the data
based on file type, and then return the data as a dataframe and store
that as the `data` attribute in the SpectralLibraryLoader class. This
avoids instances of `data.data.pd.DataFrame` and avoids issues when
caching data in streamlit. On first execution, the `data.data` attribute
contains the Df from the access class. However, after caching
`data.data` no longer represents the access class, but the actual
retrieved pd.Dataframe.
Added caching where necessary, and added checks to see if caches need ot
be cleared on new input interaction
Changed time_block to MeasureBlock for performance metrics
add show() methods
add context property to config
m/z heatmap with fragments split up had wrong axis, fix this
allow for 2 columns on multiplots
@singjc singjc merged commit c1ca78e into feature/rawdata Jan 4, 2024
@singjc singjc deleted the patch/josh_rawdata_2 branch January 7, 2024 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants