Study Submission Process Refactor #753

hepcat72 · 2023-09-25T22:38:34Z

formerly... "Automatic stub creation"

This issue has been converted to an issue-tracking issue. Requirements and design may be modified (and this issue's contents stale), so refer to the issues linked under dependencies for the latest design/requirements.

FEATURE REQUEST

Inspiration

I noted that issue #705 was similar to my submission process proposal, so it inspired me to codify my proposal in an issue.

Description

Based on my submission process proposal from March and compiled and annotated in my full proposal, I think that we should use the version 3.0 effort as an opportunity to streamline the sheets in the excel file like I have in my (process) proposal:

New "Study" (e.g. Excel) doc

Study Sheet (all manually filled in - applies to all animals/samples in the submission)
- new: Study ID (can use as prefix)
- add: Name
- add: Description
Animal sheet (all manually filled in - all the usual columns except study cols - and the infusate will be a dropdown based on the infusate sheet contents)
- new: Study ID
- remove: Study Name (moved to Study sheet)
- remove: Study Description (moved to Study sheet)
- remove: Tracer Concentrations (moved to Tracers sheet)
Sample Sheet (partially pre-populated by the validation interface)
- pre-populated: sample name (will be pre-populated using the accucor/isocorr column headers, but with suffixes [e.g. "_pos"] removed)
- new: skip (boolean)
Infusate sheet (fully pre-populated automatically by function using the tracer sheet)
- Infusate Number (only for temporary use to associate sheets)
- Tracer Group Name (e.g. "eaas". May repeat if combined in different concentrations and/or labels)
- name (readonly, populated by function using the tracers sheet's contents)
Tracer sheet (all manually filled in) See discussion. The basic idea is to un-encode the data. Columns repeat, except the last 4 (and if they repeat, they are associated with different data).
- Tracer Number (only for temporary use to associate tracer rows at specific concentrations)
- Infusate Number (only for temporary use to associate sheets)
- tracer concentration
- compound name
- element
- mass_number
- label_count
- positions (optional)
Compound sheet (mostly pre-populated by the validation interface using the accucor/isocorr data, with user-required input for new compounds). Same contents as compounds.tsv.
Tissue sheet (mostly pre-populated by the validation interface using the database, with user-required input for new tissues)
Peak Annotation Files sheet (similar to the LCMS metadata file)
- peak annotation filename (pre-populated by the validation interface where possible)
- peak annotation filetype dropdown (accucor or isocorr) (pre-populated by the validation interface where possible)
- Sample Name Prefix
Peak Annotation Details sheet (similar to the LCMS metadata file)
- Sample Name (pre-populated by the validation interface where possible)
- Sample Data Header (pre-populated by the validation interface where possible)
- mzXML filename (pre-populated by the validation interface where possible)
- peak annotation filename (pre-populated by the validation interface where possible)
- Polarity
- Sequence Number (only for temporary use to associate sheets)
Sequences sheet
- Sequence Number (only for temporary use to associate sheets)
- operator
- date
- instrument
- LC method
- LC Run Length
- LC Description
- Notes

Much of the above will be pre-populated (see discussion) by the validation interface. All the validation interface will require will be:

Accucor/isocorr files

Optional additional inputs for validation:

Study Doc
mzXML files (they will not be uploaded - their names will be compiled and submitted)

The process will go like this:

User submits accucor/isocorr & mzXML files and gets back errors and a "Study" excel spreadsheet
User can fix errors and go back to step 1
Once there are no errors, they can download the excel file, fill in the missing data and submit again (this time also including the "Study" excel file) and gets back new errors associated with the correctness/completeness of the Study excel file
User can fix errors and go back to step 3
Proceed to submission

Alternatives

None

Dependencies

This is an issue-tracking issue for the following issues:

Comment

The LCMS Metadata tab should probably be broken up so that sequence notes and LC descriptions can be provided efficiently.
Validating an entire study may not be possible in a live back and forth. The web browser would time out. Either this would need to run a handful of files (or 1) at a time, a celery progress bar would have to be used, or the user would have to be emailed a (link to a) report.
The process should be verbose about warnings WRT enforcing sample uniqueness and try to catch when samples appear to have different names, but come from the same sample (e.g. warn about sample1_pos being the same as sample1).
Data that needs to be manually entered should be highlighted in some way.
It may or may not be too cumbersome to actually upload the mzXML files. If we did, we could parse and use the data in the files.
Many existing sample duplicates already exist in the database with different "sample names". A heuristic to discern the actual sample name (e.g. by removing suffixes like "_pos") will not preclude the necessity of the user to have to validate all of the sample names. It should issue warnings when sample names look too much alike (e.g. one sample name contains another).

I created an example version of the Study Excel doc:

animal_sample_table.xlsx

ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

change: accucor_data_loader.py
change: load_study.py
change: validation.py

Requirements

NOTE: These requirements are NOT final/complete. They were originally drafted in this issue, but then regrouped and fleshed out in the issues that were created to break up this issue (see the Dependencies section).

DESIGN

Interface Change description

The validation page will
- Additionally take a zip file of all mzXML files.
- Take an optional email address (to email results - otherwise, it will attempt to return processed results)
- Allow peak annotation files only
- Generate and auto-populate a study doc
Every loading script will
- Take either tsv or xlsx files for every tab of the study doc
- No longer take yaml files
The animal sample table loader will
- only load the Study, Animals, Samples, Infusates, Tracers, and Sequences
The accucor loader will require the Study doc (and use the tabs: Peak Annotation Files, Peak Annotation Details, Sequences, and Defaults) instead of the LCMS metadata file
The Compounds loader will load novel compounds, update the consolidated compounds file, and output a new Study doc with the compounds tab removed
The tissues loader will load novel tissues, update the consolidated tissues file, and output a new Study doc with the tissues tab removed
The treatments loader will load novel treatments, update the consolidated treatments file, and output a new Study doc with the treatments tab removed

Code Change Description

The validation page
- will determine the mode based on what's provided
- If no study doc is provided, it generates a study doc with pre-populated fields. If a field is required but cannot be pre-populated, placeholder values will be entered. And missing or placeholder values will be highlighted for required manual entry.
The existing excel parsing code will change the way it accesses tabs from index to name
The accucor data loader will
- Require the study doc to be supplied and...
  - Combine the Peak Annotation Details and Sequences tabs and re-use the LCMS Metadata code to process it
  - Use the Defaults tab's values instead of the command line options

I intend to perform the work in phases (with separate PRs):

Requirement 1.
Requirement 2.
Requirement 9., 7.1., and 8.1.
Requirements 8.6. and 8.7. (adding those tabs and implementing all sub-items, and leaving the other sheets alone)
Requirement 5.1.1., 5.2, and 8.8.
Requirements 8.11. and 8.12.
Requirement 8.10.
Requirements 8.9. (Peak Annotation Files Tab), 8.10. (Peak Annotation Details Tab), and 4. (accucor loader)
Requirement 5.1.2., 5.2, and 8.5.
Requirement 5.1.3., 5.2, and 8.4.
Requirements 7.2., 7.3., 8.2., and 8.3.
Requirements 3.

Tests

A test for each requirement

The text was updated successfully, but these errors were encountered:

hepcat72 · 2024-08-11T18:39:31Z

All items completed, changed, or transferred to separate issues.

hepcat72 added the type:feature New feature or request label Sep 25, 2023

lparsons added this to the Streamline Submission Process milestone Sep 26, 2023

hepcat72 mentioned this issue Sep 26, 2023

Defer rollback for protocols, tissues, and compounds #758

Merged

8 tasks

hepcat72 self-assigned this Sep 29, 2023

hepcat72 mentioned this issue Nov 30, 2023

Modify the data submission form for LCMethod #759

Merged

8 tasks

hepcat72 changed the title ~~Automatic stub creation of data submission~~ Study Submission Process refactor (aka, Automatic stub creation) Dec 28, 2023

hepcat72 changed the title ~~Study Submission Process refactor (aka, Automatic stub creation)~~ Study Submission Process Refactor Dec 29, 2023

This was referenced Dec 30, 2023

Make every table-based loader take either be a tab-delimited or excel file #820

Closed

Study Data Loader #821

Closed

Tracer and infusates loader #822

Closed

Modify Compounds Loader #823

Closed

Sequences Loader #824

Closed

hepcat72 added the issue-tracking Meta-issue to coordinate/organize other issues label Dec 30, 2023

hepcat72 mentioned this issue Jan 9, 2024

Study table loader #831

Merged

8 tasks

This was referenced Jan 18, 2024

Added tests for loading compounds via excel spreadsheet #838

Merged

Modify load_study.py to account for the new excel sheets #839

Closed

hepcat72 mentioned this issue Jan 26, 2024

Load Consistency Refactor #846

Merged

8 tasks

hepcat72 pinned this issue Jan 26, 2024

hepcat72 mentioned this issue Feb 7, 2024

Load consistency refactor 2 (adding the protocols loader) #852

Merged

8 tasks

hepcat72 mentioned this issue Feb 22, 2024

Sequence load 1d (Add SequencesLoader class) #882

Merged

8 tasks

This was referenced Mar 14, 2024

TracersLoader #900

Merged

AnimalsLoader #916

Closed

hepcat72 mentioned this issue Apr 3, 2024

Remove errors #934

Merged

8 tasks

hepcat72 closed this as completed Aug 11, 2024

hepcat72 unpinned this issue Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study Submission Process Refactor #753

Study Submission Process Refactor #753

hepcat72 commented Sep 25, 2023 •

edited

Loading

hepcat72 commented Aug 11, 2024

Study Submission Process Refactor #753

Study Submission Process Refactor #753

Comments

hepcat72 commented Sep 25, 2023 • edited Loading

FEATURE REQUEST

Inspiration

Description

New "Study" (e.g. Excel) doc

Alternatives

Dependencies

Comment

ISSUE OWNER SECTION

Assumptions

Limitations

Affected Components

Requirements

DESIGN

Interface Change description

Code Change Description

Tests

hepcat72 commented Aug 11, 2024

hepcat72 commented Sep 25, 2023 •

edited

Loading