Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study Submission Process Refactor #753

Closed
hepcat72 opened this issue Sep 25, 2023 · 1 comment
Closed

Study Submission Process Refactor #753

hepcat72 opened this issue Sep 25, 2023 · 1 comment
Assignees
Labels
issue-tracking Meta-issue to coordinate/organize other issues type:feature New feature or request

Comments

@hepcat72
Copy link
Collaborator

hepcat72 commented Sep 25, 2023

formerly... "Automatic stub creation"

This issue has been converted to an issue-tracking issue. Requirements and design may be modified (and this issue's contents stale), so refer to the issues linked under dependencies for the latest design/requirements.

FEATURE REQUEST

Inspiration

I noted that issue #705 was similar to my submission process proposal, so it inspired me to codify my proposal in an issue.

Description

Based on my submission process proposal from March and compiled and annotated in my full proposal, I think that we should use the version 3.0 effort as an opportunity to streamline the sheets in the excel file like I have in my (process) proposal:

New "Study" (e.g. Excel) doc

  • Study Sheet (all manually filled in - applies to all animals/samples in the submission)
    • new: Study ID (can use as prefix)
    • add: Name
    • add: Description
  • Animal sheet (all manually filled in - all the usual columns except study cols - and the infusate will be a dropdown based on the infusate sheet contents)
    • new: Study ID
    • remove: Study Name (moved to Study sheet)
    • remove: Study Description (moved to Study sheet)
    • remove: Tracer Concentrations (moved to Tracers sheet)
  • Sample Sheet (partially pre-populated by the validation interface)
    • pre-populated: sample name (will be pre-populated using the accucor/isocorr column headers, but with suffixes [e.g. "_pos"] removed)
    • new: skip (boolean)
  • Infusate sheet (fully pre-populated automatically by function using the tracer sheet)
    • Infusate Number (only for temporary use to associate sheets)
    • Tracer Group Name (e.g. "eaas". May repeat if combined in different concentrations and/or labels)
    • name (readonly, populated by function using the tracers sheet's contents)
  • Tracer sheet (all manually filled in) See discussion. The basic idea is to un-encode the data. Columns repeat, except the last 4 (and if they repeat, they are associated with different data).
    • Tracer Number (only for temporary use to associate tracer rows at specific concentrations)
    • Infusate Number (only for temporary use to associate sheets)
    • tracer concentration
    • compound name
    • element
    • mass_number
    • label_count
    • positions (optional)
  • Compound sheet (mostly pre-populated by the validation interface using the accucor/isocorr data, with user-required input for new compounds). Same contents as compounds.tsv.
  • Tissue sheet (mostly pre-populated by the validation interface using the database, with user-required input for new tissues)
  • Peak Annotation Files sheet (similar to the LCMS metadata file)
    • peak annotation filename (pre-populated by the validation interface where possible)
    • peak annotation filetype dropdown (accucor or isocorr) (pre-populated by the validation interface where possible)
    • Sample Name Prefix
  • Peak Annotation Details sheet (similar to the LCMS metadata file)
    • Sample Name (pre-populated by the validation interface where possible)
    • Sample Data Header (pre-populated by the validation interface where possible)
    • mzXML filename (pre-populated by the validation interface where possible)
    • peak annotation filename (pre-populated by the validation interface where possible)
    • Polarity
    • Sequence Number (only for temporary use to associate sheets)
  • Sequences sheet
    • Sequence Number (only for temporary use to associate sheets)
    • operator
    • date
    • instrument
    • LC method
    • LC Run Length
    • LC Description
    • Notes

Much of the above will be pre-populated (see discussion) by the validation interface. All the validation interface will require will be:

  • Accucor/isocorr files

Optional additional inputs for validation:

  • Study Doc
  • mzXML files (they will not be uploaded - their names will be compiled and submitted)

The process will go like this:

  1. User submits accucor/isocorr & mzXML files and gets back errors and a "Study" excel spreadsheet
  2. User can fix errors and go back to step 1
  3. Once there are no errors, they can download the excel file, fill in the missing data and submit again (this time also including the "Study" excel file) and gets back new errors associated with the correctness/completeness of the Study excel file
  4. User can fix errors and go back to step 3
  5. Proceed to submission

Alternatives

None

Dependencies

This is an issue-tracking issue for the following issues:

Comment

  • The LCMS Metadata tab should probably be broken up so that sequence notes and LC descriptions can be provided efficiently.
  • Validating an entire study may not be possible in a live back and forth. The web browser would time out. Either this would need to run a handful of files (or 1) at a time, a celery progress bar would have to be used, or the user would have to be emailed a (link to a) report.
  • The process should be verbose about warnings WRT enforcing sample uniqueness and try to catch when samples appear to have different names, but come from the same sample (e.g. warn about sample1_pos being the same as sample1).
  • Data that needs to be manually entered should be highlighted in some way.
  • It may or may not be too cumbersome to actually upload the mzXML files. If we did, we could parse and use the data in the files.
  • Many existing sample duplicates already exist in the database with different "sample names". A heuristic to discern the actual sample name (e.g. by removing suffixes like "_pos") will not preclude the necessity of the user to have to validate all of the sample names. It should issue warnings when sample names look too much alike (e.g. one sample name contains another).

I created an example version of the Study Excel doc:

animal_sample_table.xlsx


ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

  • change: accucor_data_loader.py
  • change: load_study.py
  • change: validation.py

Requirements

NOTE: These requirements are NOT final/complete. They were originally drafted in this issue, but then regrouped and fleshed out in the issues that were created to break up this issue (see the Dependencies section).

  • 1. Every table-based input file can either be a tab-delimited or excel file
  • 2. All excel tabs are accessed by name (not by index)
  • 3. Validation interface modes
  • 3.1. Peak annotation only (with optional mzXML files) Make load_study accept mzxml files (in addition to the study doc) #1086
  • 3.1.1. Runs with only accucor/isocorr file(s) (currently, it requires the sample table file, I think)
  • 3.1.2. Generates a stubbed-out study doc with the following tabs' pre-populated fields (see 8. for all changed columns)
  • 3.1.2.1. Samples Pre-populated Columns (based on peak annotation file contents)
  • 3.1.2.1.1. Sample Name (a heuristic will be used to remove _scan and _charge suffixes)
  • 3.1.2.2. Treatments (optional - required if any are new) Pre-populated Columns
  • 3.1.2.2.1. Animal Treatment (based on Study doc, Animals tab, Treatment column contents)
  • 3.1.2.2.2. Description (based on database, empty/required if not in DB)
  • 3.1.2.3. Tissues (optional - required if any are new) Pre-populated Columns
  • 3.1.2.3.1. TraceBase Tissue Name (based on Study doc, Animals tab, Tissue column contents)
  • 3.1.2.3.2. Description (based on database, empty/required if not in DB)
  • 3.1.2.4. Infusates Pre-populated Columns
  • 3.1.2.4.1. Infusate Number (based on Study doc, Tracers tab contents)
  • 3.1.2.4.2. Tracer Group Name (if exists in the database)
  • 3.1.2.4.3. Infusate Name (based on Study doc, Infusates tab's Tracer Group Name and Tracer Name columns)
  • 3.1.2.5. Tracers Pre-populated Columns
  • 3.1.2.5.1. Tracer Name
  • 3.1.2.6. Compounds (optional - required if any are new) Pre-populated Columns
  • 3.1.2.6.1. Compound (based on peak annotation file contents)
  • 3.1.2.6.2. Formula (based on peak annotation file contents)
  • 3.1.2.6.3. HMDB ID (if exists in the database)
  • 3.1.2.6.4. Synonyms (if exists in the database)
  • 3.1.2.7. Peak Annotation Files Pre-populated Columns
  • 3.1.2.7.1. Peak Annotation File Name (based on peak annotation file names)
  • 3.1.2.7.2. Peak Annotation File Type (inferred from peak annotation header contents)
  • [ ] 3.1.2.7.3. Sample Name Prefix (if not unique, uses study ID, if still not unique, uses animal ID, if still not unique, uses both. If not unique after that, it will keep both, but an error will prompt the user to manually change it.) Prefix is not necessary, given the Peak Annotation Details sheet explicitly maps sample to sample header
  • 3.1.2.8. Peak Annotation Details Pre-populated Columns
  • 3.1.2.8.1. Sample Name (based on heuristically modified peak annotation file contents)
  • 3.1.2.8.2. Sample Data Header (based on peak annotation file contents)
  • [ ] 3.1.2.8.3. mzXML File Name (based on peak annotation file contents and omitted if mzXML files supplied and no match) Decided not to autofill this. It could end up wrong. The default behavior would find the file anyway.
  • 3.1.2.8.4. Peak Annotation File Name (based on peak annotation file name and sample header)
  • [ ] 3.1.2.8.5. Polarity (based on mzXML file content - empty if no matching file) Polarity now only comes from mzXML files
  • 3.1.2.9. Defaults (optional - required if any data is missing or generates errors/warnings, e.g. researcher name variation) Pre-populated Columns Add the defaults sheet to the downloaded template #1099
  • [ ] 3.1.2.9.1. Researchers Confirmed (True if all are existing, empty/required if warnings/errors)
  • 3.2. Study doc only (with optional mzXML files) Cannot accept mzXMLs in the form. They're too big.
  • 3.3. Full mode: Study doc and Peak annotation (with optional mzXML files)
  • 3.4. Fields in the stub that require manual entry should be highlighted Color excel sheet cells that have errors with cell locations #1105
  • [ ] 3.5. Each pre-population action will be a separate method or a method that takes the tab name, column header, and row
  • 4. The accucor data loader will
  • 4.1. Take the study doc instead of an LCMS Metadata file
  • [ ] 4.1.1. Merge the Peak Annotation Details and Sequences sheet
  • [ ] 4.1.2. Re-use the LCMS metadata processing code with the new merged sheets
  • 4.2. Use the defaults tab instead of command line options
  • 5. The following loaders (in 5.1.) will take the study doc and meet the requirements under 5.2.
  • 5.1. Ancillary Data Loaders
  • 5.1.1. Compounds
  • 5.1.2. Tissues
  • 5.1.3. Treatments
  • 5.2. Ancillary Data Loading Requirements
  • 5.2.1 Take either a tab-delimited file or the Study excel file
  • 5.2.2. If no errors and not in validate mode, append rows to the consolidated data file (e.g. compounds.tsv)
  • 5.2.3. If no errors and not in validate mode, remove the tab from the study doc
  • 6. New loader scripts
  • 6.1. Tracers Loader
  • 6.2. Infusates Loader
  • 7. The animal sample loader will be broken up into
  • 7.1. A study data loader
  • 7.2. An animal loader
  • 7.3. A sample loader
  • 8. New Study doc (augmenting the existing animals/samples table) with the following tabs
  • 8.1. Study Tab
  • 8.1.1. Add Columns
  • [ ] 8.1.1.1. Study ID
  • 8.1.1.2. Name
  • 8.1.1.3. Description
  • 8.2. Animals Tab
  • 8.2.1. Remove Columns
  • [ ] 8.2.1.1. Study Name (moved to Study sheet)
  • 8.2.1.2. Study Description (moved to Study sheet)
  • 8.2.1.3. Tracer Concentrations (moved to Tracers sheet)
  • [ ] 8.2.2. Add Columns
  • [ ] 8.2.2.1. Study ID
  • 8.3. Samples Tab
  • [ ] 8.3.1. Add Columns
  • [ ] 8.3.1.1. Skip Moved to Peak Annotation Details sheet
  • 8.4. TreatmentsTab** (optional - required if any are new)
  • 8.5. Tissues Tab (optional - required if any are new)
  • 8.6. Infusates Tab
  • 8.6.1. Add Columns
  • 8.6.1.1. Infusate Number
  • 8.6.1.2. Tracer Group Name
  • 8.6.1.3. Infusate Name
  • 8.6.1.4. Tracer Number
  • 8.6.1.5. Tracer Concentration
  • 8.7. Tracers Tab
  • 8.7.1. Add Columns
  • 8.7.1.1. Tracer Number
  • 8.7.1.2. Compound Name
  • 8.7.1.3. Element
  • 8.7.1.4. Mass Number
  • 8.7.1.5. Label Count
  • 8.7.1.6. Label Positions
  • 8.7.1.7. Tracer Name (based on Study doc, Tracers tab contents)
  • 8.8. Compounds Tab (optional - required if any are new)
  • 8.9. Peak Annotation Files tab
  • 8.9.1. Add Columns
  • 8.9.1.1. Peak Annotation File Name
  • 8.9.1.2. Peak Annotation File Type
  • 8.9.1.3. Sample Name Prefix
  • 8.10. Peak Annotation Details Tab
  • 8.10.1. Add Columns
  • 8.10.1.1. Sample Name
  • 8.10.1.2. Sample Data Header
  • 8.10.1.3. mzXML File Name
  • 8.10.1.4. Peak Annotation File Name
  • [ ] 8.10.1.5. Polarity
  • 8.10.1.6. Sequence Number
  • 8.11. Sequences Tab
  • 8.11.1. Add Columns
  • 8.11.1.1. Sequence Number
  • 8.11.1.2. Operator
  • 8.11.1.3. Date
  • 8.11.1.4. Instrument
  • [ ] 8.11.1.5. LC Protocol Replaced with sequence name
  • [ ] 8.11.1.6. LC Run Length Replaced with sequence name
  • [ ] 8.11.1.7. LC Description Replaced with sequence name
  • 8.11.1.8. Notes
  • [ ] 8.12. Defaults Tab (optional - required if any data is missing or generates errors/warnings, e.g. researcher name variation) Columns were changed to sheet, header, and value
  • [ ] 8.12.1. Add Columns
  • [ ] 8.12.1.1. Researcher
  • [ ] 8.12.1.2. Researchers Confirmed
  • [ ] 8.12.1.3. Peak Annotation Format
  • [ ] 8.12.1.4. Polarity
  • [ ] 8.12.1.5. Sequence Date
  • [ ] 8.12.1.6. LC Protocol Name
  • [ ] 8.12.1.7. Instrument
  • 9. Add a Study ID field to the Study model

DESIGN

Interface Change description

  1. The validation page will
    • Additionally take a zip file of all mzXML files.
    • Take an optional email address (to email results - otherwise, it will attempt to return processed results)
    • Allow peak annotation files only
    • Generate and auto-populate a study doc
  2. Every loading script will
    • Take either tsv or xlsx files for every tab of the study doc
    • No longer take yaml files
  3. The animal sample table loader will
    • only load the Study, Animals, Samples, Infusates, Tracers, and Sequences
  4. The accucor loader will require the Study doc (and use the tabs: Peak Annotation Files, Peak Annotation Details, Sequences, and Defaults) instead of the LCMS metadata file
  5. The Compounds loader will load novel compounds, update the consolidated compounds file, and output a new Study doc with the compounds tab removed
  6. The tissues loader will load novel tissues, update the consolidated tissues file, and output a new Study doc with the tissues tab removed
  7. The treatments loader will load novel treatments, update the consolidated treatments file, and output a new Study doc with the treatments tab removed

Code Change Description

  • The validation page
    • will determine the mode based on what's provided
    • If no study doc is provided, it generates a study doc with pre-populated fields. If a field is required but cannot be pre-populated, placeholder values will be entered. And missing or placeholder values will be highlighted for required manual entry.
  • The existing excel parsing code will change the way it accesses tabs from index to name
  • The accucor data loader will
    • Require the study doc to be supplied and...
      • Combine the Peak Annotation Details and Sequences tabs and re-use the LCMS Metadata code to process it
      • Use the Defaults tab's values instead of the command line options

I intend to perform the work in phases (with separate PRs):

  1. Requirement 1.
  2. Requirement 2.
  3. Requirement 9., 7.1., and 8.1.
  4. Requirements 8.6. and 8.7. (adding those tabs and implementing all sub-items, and leaving the other sheets alone)
  5. Requirement 5.1.1., 5.2, and 8.8.
  6. Requirements 8.11. and 8.12.
  7. Requirement 8.10.
  8. Requirements 8.9. (Peak Annotation Files Tab), 8.10. (Peak Annotation Details Tab), and 4. (accucor loader)
  9. Requirement 5.1.2., 5.2, and 8.5.
  10. Requirement 5.1.3., 5.2, and 8.4.
  11. Requirements 7.2., 7.3., 8.2., and 8.3.
  12. Requirements 3.

Tests

A test for each requirement

@hepcat72 hepcat72 added the type:feature New feature or request label Sep 25, 2023
@hepcat72 hepcat72 self-assigned this Sep 29, 2023
@hepcat72 hepcat72 changed the title Automatic stub creation of data submission Study Submission Process refactor (aka, Automatic stub creation) Dec 28, 2023
@hepcat72 hepcat72 changed the title Study Submission Process refactor (aka, Automatic stub creation) Study Submission Process Refactor Dec 29, 2023
@hepcat72 hepcat72 added the issue-tracking Meta-issue to coordinate/organize other issues label Dec 30, 2023
@hepcat72 hepcat72 mentioned this issue Jan 9, 2024
8 tasks
@hepcat72 hepcat72 mentioned this issue Jan 26, 2024
8 tasks
@hepcat72 hepcat72 pinned this issue Jan 26, 2024
This was referenced Mar 14, 2024
@hepcat72 hepcat72 mentioned this issue Apr 3, 2024
8 tasks
@hepcat72
Copy link
Collaborator Author

All items completed, changed, or transferred to separate issues.

@hepcat72 hepcat72 unpinned this issue Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue-tracking Meta-issue to coordinate/organize other issues type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants