Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: Evaluate data submission process #1034

Closed
3 tasks done
lparsons opened this issue Jul 11, 2024 · 19 comments
Closed
3 tasks done

Task: Evaluate data submission process #1034

lparsons opened this issue Jul 11, 2024 · 19 comments
Assignees
Labels
change:split-issue Split this issue into multiple issues component:submission Submission process page and logic priority:1-critical Blocking issue that is critical to functionality type:task Chore not directly resulting in a code change

Comments

@lparsons
Copy link
Contributor

lparsons commented Jul 11, 2024

Goal

Act as a researcher and run through the submission process to better understand the software, file formats, etc. Also, review the process for pain points during the submission process. Identify bugs and suggestions for improvements and share those with the group.

Submission Process

  1. Gather peak annotation files and mzXML files (if they exist). mzXML files can be left on a shared network drive.
  2. Using tracebase-dev, begin creating a submission.
  3. Fill out the study doc as needed.
  4. Submit the study data
    1. Place the files on the msdata shared drive (smb://gen-iota-cifs/msdata/)
    2. Fill out the study submission form.

Our goal right now is to evaluate the submission process. At a later stage, we'll go though and process these submissions to ensure we call get the submitted data loaded.

Below are are sample data sets that we will use for our evaluations. I've selected a variety of formats and assigned some sets to different people. Please check off the datasets when you have completed the submissions.

@lparsons lparsons added the component:submission Submission process page and logic label Jul 11, 2024
@lparsons
Copy link
Contributor Author

lparsons commented Jul 12, 2024

The template generated was helpful, lots of comments explaining the fields, dropdowns for various selections. Overall, a much better experience than the empty template from earlier versions. Below are some comments regarding areas I think we could improve. I had trouble with the validation, and ultimately decided to simply submit, since I think the errors I got were not "real", but I wasn't entirely certain. The errors did change somewhat after successive submissions.

Initial template creation

  • The menu option "Upload" led me to expect to upload my data, but the next page was a bit confusing. The instructions were on that page, but when I clicked "Build a TraceBase Submission", the instructions I was hoping to follow went away.
  • The Build a Submission form asks for two things, and it wasn't straightforward to me that I should only submit peak annotation files.
  • After submission, the form just reloads and a file gets silently downloaded. I looked away and actually missed that it happened. It would be nice to get a page that confirmed what I submited and perhaps showed some stats, like number of samples found, notes about new tissues, compounds, etc. and a list of instructions of what to do next.
  • Separating validation from the part that builds a template would be easier to understand, imo. I wouldn't have to go from step 1, to step 2, and then back to step 1.

Study Doc

  • Start with a page that provides some instructions. Being met with a page that just has three headers asking for a Study ID, Name, and Description is a bit confusing. Making the middle field the only required one is also a bit confusing, perhaps move it to the front.
  • The instructions say to fill out Animals, Samples, and Treatments, but those are not the first three sheets, I suggest putting them in order so that those that need filling out are first.
  • I'd suggest that a study doc be allowed to contain only one study, and thus we could remove the study column from the Animals sheet and perhaps put the name and description fields as bold sections along with instructions.
  • Having concentration as part of the Infusate name was a bit confusing for me. It would be more intuitive to enter those in separate fields, imho. Especially since it seems likely that Infusates will be reused with different concentrations.
  • Suggestion: merge the "Peak Annotation Details" sheet with the sample sheet, putting the extra columns at the end. I submitted two peak annotation files and the samples were put into the worksheet organized by file, which was nice. However, it wasn't easy to see which came from which file. Perhaps add a column to the Samples sheet that indicates which peak annotation file the sample was parsed from. It would presumably be ignored downstream, but would have been helpful for me to get an understanding of how things were organized.
  • The drop-downs for tissue were great, would be great if they auto-filled as I was typing, but that's likely an Excel / LibreOffice limitation.
  • I wasn't sure what to put for "Default Sequence Name", so I left it blank.
  • I was confused by the Sequences sheet, thus I left it blank.
  • The various "data sheets" that filled out by the database could be made a different color. I would also suggest they be made "read only" so it's clear they shouldn't be changed. Instructions for adding new values could be added.

Validation

  • Confusing to click "Download" when what I want to do is upload and validate.
  • Feedback was initially hidden
  • Compounds sheet was unaltered, but resulted in ConflictingValueErrors
  • Tracers were selected from the dropdown, but resulted in "does not match" errors
  • Tracers sheet was unaltered, but resulted in an error that the Tracer Group Name "None" was used for multiple rows.
  • Empty lines in Study sheet resulted in errors that Study records were not found
  • Strange error: "Unable to convert column Run Length to type int64"
  • Strange error: "expected str, bytes or os.PathLike object, not NoneType"
  • Sequence Name missing value, but I don't know how or what to fill in
  • Each sample resulted in an error that the sample names were not found. "21 Sample records were not found (using search field(s): name) with values found in column [Sample Name] of sheet [Peak Annotation Details] in /tmp/study.xlsx:" These sheets were not altered.

Validation results

Study Doc submitted (along with above linked peak annotation files): study.xlsx

  • PASSED: No Files are Missing All Samples
  • FAILED: study.xlsx
    • RequiredColumnValues
      Required column values missing on the indicated rows:
      sheet [Study] in /tmp/tmpi0rgrjj7.upload.xlsx
      Column: [Study ID, Description] on rows: ['2']

    • ConflictingValueErrors
      Conflicting values encountered during loading:
      During the processing of sheet [Compounds] in /tmp/study.xlsx...
      Creation of the following Compound record(s) encountered conflicts:
      File record: {'name': 'aconitate', 'formula': 'C6H6O6', 'hmdb_id': 'HMDB0000072'} (on row(s): 3)
      Database record: {'name': 'aconitate', 'formula': 'C6H10O3', 'hmdb_id': 'HMDB0000072'}
      [formula] values differ:
      - database: [C6H10O3]
      - file: [C6H6O6]

    • InfileError
      Tracer data from columns [Tracer Name: lactate-[13C3], Tracer Concentration: 148.877] on row(s) [5] does not match any of the tracers parsed from the Infusate Name [lactate-[13C3][149]] on row [5] of sheet [Infusates] in /tmp/study.xlsx.

    • InfileError
      Tracer data from columns [Tracer Name: lactate-[13C3], Tracer Concentration: 434.631] on row(s) [13] does not match any of the tracers parsed from the Infusate Name [lactate-[13C3][435]] on row [13] of sheet [Infusates] in /tmp/study.xlsx.

    • InfileError
      sheet [Infusates] in /tmp/study.xlsx:
      Tracer Group Name: 'None' was assigned to infusates containing the following different assortments of tracers, for the indicated 'Infusate Row Group's:
      lactate-[13C3] (on rows with 'Infusate Row Group's: [1, 4, 5, 6, 11, 12, 14, 15])
      glutamine-[13C5,15N2] (on rows with 'Infusate Row Group's: [2, 13])
      alanine-[13C3,15N1] (on rows with 'Infusate Row Group's: [3])
      glutamine-[13C5] (on rows with 'Infusate Row Group's: [7, 8, 16])
      alanine-[13C3] (on rows with 'Infusate Row Group's: [9, 10])
      Suggested resolution: Use a different group name for each 'Infusate Row Group' containing different tracers.

    • MissingRecords
      7 Study records were not found (using search field(s): name) with values found in sheet [Animals] in /tmp/study.xlsx:
      tca_flux_healthy_tissues from row(s): ['2-8']

    • RequiredColumnValues
      Required column values missing on the indicated rows:
      sheet [Samples] in /tmp/tmpi0rgrjj7.upload.xlsx
      Column: [Animal] on rows: ['2-22']

    • ValueError
      Unable to convert column Run Length to type int64

    • TypeError
      expected str, bytes or os.PathLike object, not NoneType

    • RequiredColumnValues
      Required column values missing on the indicated rows:
      sheet [Peak Annotation Details] in /tmp/tmpi0rgrjj7.upload.xlsx
      Column: [Sequence Name] on rows: ['2-23']

    • MissingRecords
      21 Sample records were not found (using search field(s): name) with values found in column [Sample Name] of sheet [Peak Annotation Details] in /tmp/study.xlsx:
      bat4 from row(s): ['2']
      bat5 from row(s): ['3']
      bat6 from row(s): ['4']
      bat7 from row(s): ['5']
      int4 from row(s): ['6']
      int5 from row(s): ['7']
      int6 from row(s): ['8']
      int7 from row(s): ['9']
      sol4 from row(s): ['10']
      sol5 from row(s): ['11']
      sol6 from row(s): ['12']
      sol7 from row(s): ['13']
      bat1 from row(s): ['14']
      bat2 from row(s): ['15']
      bat3 from row(s): ['16']
      int1 from row(s): ['18']
      int2 from row(s): ['19']
      int3 from row(s): ['20']
      sol1 from row(s): ['21']
      sol2 from row(s): ['22']
      sol3 from row(s): ['23']

  • PASSED: 091221_tissues_corrected.xlsx
  • PASSED: 090521_tissues_corrected.xlsx

MSData Share and Submission Form

  • Date of Mass Spec Experiment - there were multiple, but only space for one. How could we better handle this?

@mneinast
Copy link
Collaborator

Great ideas lance! I agree with just about everything. One of the most important parts will be to make the "Samples" and/or "Peak Annotation Details" flow as cleanly as possible.

Basically, if people name their samples in a unique way (as I try to do), then the current "Samples" sheet is ideal. But if people name their samples something like "t1, t2, t3" for every experiment, then they will need to know which data sheet the "Samples" name came from. So we have to support this somehow.

I wonder if we could insert a step after the user uploads data sheets, but before they download this template sample information sheet. Can we insert a way for the user to indicate which files contain the same samples?

Imagine that I upload 6 excel sheets, where samples are all labeled non-unique: "t1, t2, t3". In reality, the first 3 sheets were from one set of samples, and the last three were from a different set:

image

If I can indicate these sample groups before downloading the template, then the Samples tab on the Sample Info sheet could look like this (new column highlighted):
image

@hepcat72
Copy link
Collaborator

The template generated was helpful, lots of comments explaining the fields, dropdowns for various selections. Overall, a much better experience than the empty template from earlier versions. Below are some comments regarding areas I think we could improve. I had trouble with the validation, and ultimately decided to simply submit, since I think the errors I got were not "real", but I wasn't entirely certain. The errors did change somewhat after successive submissions.

Initial template creation

  • The menu option "Upload" led me to expect to upload my data, but the next page was a bit confusing. The instructions were on that page, but when I clicked "Build a TraceBase Submission", the instructions I was hoping to follow went away.

That's a good insight. I think the initial design was intentionally bare-bones and conceptually clean in the initial v3 (where it just downloaded a stubbed sheet and have validation as an afterthought). Perhaps instead of an instructions page and different pages with tasks, it could be more like what you see when you book a flight or fill out a survey. Progressive steps are shown at the top, to give you a sense of how far along in the process you are... meaning, you don't have to go back to the instructions page to get to the next step.

  • The Build a Submission form asks for two things, and it wasn't straightforward to me that I should only submit peak annotation files.

This insight dovetails with my ideal above of a progressive process. (It could also save progress in a cookie.) The first step could be to submit all your peak annotation files (only) and get back the stubbed sheet. The second step could take the 1 study doc and validate it, thereby clearing up that confusion over what to submit when.

There wouldn't be a way to catch multiple representations in the peak annotation files by doing it this way, but maybe that could be a custom check when creating the stub study sheet.

  • After submission, the form just reloads and a file gets silently downloaded. I looked away and actually missed that it happened. It would be nice to get a page that confirmed what I submited and perhaps showed some stats, like number of samples found, notes about new tissues, compounds, etc. and a list of instructions of what to do next.

Yes. That is nearly exactly what I have planned. I want to add stats to the status report with exactly that info (and more).

  • Separating validation from the part that builds a template would be easier to understand, imo. I wouldn't have to go from step 1, to step 2, and then back to step 1.

I agree. It makes more sense to me when you frame it in the context as the user's confusion/first impression. I believe you had suggested this before and I hadn't fully appreciated where it was coming from.

Study Doc

  • Start with a page that provides some instructions. Being met with a page that just has three headers asking for a Study ID, Name, and Description is a bit confusing. Making the middle field the only required one is also a bit confusing, perhaps move it to the front.

Study ID, is slated for removal. I expect it will be removed at some point before this gets to production, but I like the idea of instructions on the initial page of the study doc. Maybe that's not what you meant, but that's immediately where my head went.

  • The instructions say to fill out Animals, Samples, and Treatments, but those are not the first three sheets, I suggest putting them in order so that those that need filling out are first.

Yeah, I haven't touched the stale instructions page yet. I did try to order the tabs in order of filling them out. Perhaps I was thinking top down (study name/desc, animals, samples...), but it sounds like perhaps you're suggesting bottom up? That makes some sense.

  • I'd suggest that a study doc be allowed to contain only one study, and thus we could remove the study column from the Animals sheet and perhaps put the name and description fields as bold sections along with instructions.

That's not a bad idea, conceptually, but each sheet is independent in terms of the loading. The loaders don't access other sheets (with the exception of the defaults sheet - which you haven't seen yet). It keeps everything cleaner and simpler, code-wise. But here's what I suggest. Enforcing 1 study is a decent idea. We could accomplish the same thing you suggest simply by making the study column populated by formula (i.e. read only). The formula could just grab the name on the first row of the study sheet.

  • Having concentration as part of the Infusate name was a bit confusing for me. It would be more intuitive to enter those in separate fields, imho. Especially since it seems likely that Infusates will be reused with different concentrations.

They are entered in separate fields. This is done in the Infusates sheet. No one should ever need to type out the name. I had hoped that it being a drop-down field would make it clear enough, but perhaps not.

But this brings up a peculiarity that I'm just now realizing. So having concentrations in the infusate names is how the names have been represented on Tracebase since we started supporting multiple tracers. It was necessary in tracebase for uniqueness, because we'd run into unique constraint violations in the Infusate model. But now that I think about it, unique constraint violations make no sense. The loader should never have tried to load the same Infusate again. It might have been the result of how the loaders were originally written, and that code solidified with making the name include the concentrations in the database, instead of fixing the loading code. And yep... I just checked. We do have separate infusate records for different sets of tracer concentrations. It's in the Infusate model. There's a separate record for each different set of concentrations. I think that's a schema design flaw that we let slip through.

And that problem of representing unique infusate concentration possibilities was propagated to the sheets/loaders. Now that I see it, I'm realizing that this goes pretty deep. I think that if we fix the schema, we could break those 2 fields apart again, like you're suggesting. In fact, it would make a lot of code much simpler, I think. But the change would affect a lot of things, from the model, to the templates, and the loaders.

That said, concentrations are entered individually, and selected by dropdown, which is the solution that was suggested. It could probably be made clearer that that's what the user should do. For now, that may be the way to go, and we can plan out the overhaul of the Infusate model problem later.

I'm going to break up my response in different comments. This last one here was a big one and I don't want to bury it.

@hepcat72
Copy link
Collaborator

  • Suggestion: merge the "Peak Annotation Details" sheet with the sample sheet, putting the extra columns at the end. I submitted two peak annotation files and the samples were put into the worksheet organized by file, which was nice. However, it wasn't easy to see which came from which file. Perhaps add a column to the Samples sheet that indicates which peak annotation file the sample was parsed from. It would presumably be ignored downstream, but would have been helpful for me to get an understanding of how things were organized.

I don't disagree, from the user's perspective that this might seem weird at first, though there would be a lot of problems to solve to make the merge happen. We had discussed this design decision (separate samples from what had initially been referred to as the "LCMS metadata file" - now the "Peak Annotation Details" sheet) at length a year or so ago, and I do recall that we had flipped back and forth between separate versus merged metadata.

Some of the problems we would have to solve (as we had discussed a year ago) are:

  • Sample name uniqueness. - Right now, the sample column in the samples sheet is required to be unique. Since Sample: Peak Annotation File Sample Header is a 1:M relationship, this would make the Sample column not unique.
  • User confusion over what appears to be a duplicate column (the sample header). - These are not always the same, which is why it has a separate column, but often, they are the same, which will appear redundant in the sheet.
  • Overhead of skipping blanks. - Currently, all the samples in the sample sheet are loaded into the database, which is straightforward in the loader. It is also clearer when a user is looking at say, an exported study that these are all the samples in the DB. It's a subtle/minor issue, but still an issue.
  • The currently required columns will be difficult to enforce as required. - Since we allow unanalyzed mzXML files, unassociated with a peak annotation file, we would either end up with a bunch of duplicated sample metadata or gappy appearing files in order to support unanalyzed files.
  • Repeated data. - One of the main reasons this separation was made was to prevent users from having to enter the same data over and over again, which introduces the possibility of conflicts or variant duplicate records. The entire design of the study doc was motivated by the effort to reduce this redundant data entry.
  • Separate/independent loading of mzXML files. - The separate sheet enables the ability to load the mzXML files any time. All you need is a sequence and a sample. To continue to support this, the sample loading code would have to have increased complexity to go into an mzXML file load mode. Essentially, all the code that was separated from the Accucor loader to reduce the complexity of the loaders would have to be incorporated into the samples loader. At the least, a separate command line flag would be needed to change modes.

I don't disagree that there is a problem here from a user perspective. It seems more complex than it needs to be, and I think we can fix that, just in a different way. The first thing we can take advantage of is that the user should never actually have to touch the "Peak Annotation Details" sheet. It's (or rather will be) completely automated. The only missing entries: (mzXML File Name and Sequence Name - also the skip column - but let me circle back to that later) are almost always unnecessary. The Default Sequence Name in the "Peak Annotation Files" sheet is all that's really needed. I've never seen a case where one peak annotation file utilized data from multiple sequences. And once I add the placeholder record link to handle multiple mzXML's with the same name, the only reason to use that column would be if the sample header didn't match the file name.

And as far as the skip column goes, I have a plan for users to supply skip strings from the build-a-submission page form. I'd considered putting it in the samples sheet, but the Peak AnnotationDetails are needed by the peak annotations loaders to skip loading. And the samples sheet, again, should only have samples in it we actually want to load, IMHO.

  • The drop-downs for tissue were great, would be great if they auto-filled as I was typing, but that's likely an Excel / LibreOffice limitation.

They did for me. Type one letter and the highlighted one (starting with that letter) will be entered when you hit tab. Can't speak to Libre though.

  • I wasn't sure what to put for "Default Sequence Name", so I left it blank.

It will be a dropdown based on entries in the Sequences tab. It's like the same thing as the researcher, protocol, instrument, and date entries in the yaml associated with individual peak annotation files. John suggested putting the Sequences sheet before the files sheet. Do you think that could fix the misunderstanding there?

  • I was confused by the Sequences sheet, thus I left it blank.

See above. Suggestions to make that clearer are welcome, but remember the intent is to link between sheets with a single column. Maybe rename it "Researcher, Protocol, Instrument, and Date"?

  • The various "data sheets" that filled out by the database could be made a different color. I would also suggest they be made "read only" so it's clear they shouldn't be changed. Instructions for adding new values could be added.

The goal was to actually empower the users to add to those sheets without (initially) having to interact with us to move forward, which was a complaint this was meant to address.

I'll have to respond to the rest later...

@fkang-pu
Copy link
Collaborator

For the dataset 13C-Valine_PI3Ki_in_flank-KPC_mice, the Peak Annotation Files are in isocorr format but are .csv files. Tried to build a Tracebase Submission with 2 Peak Annotation files for this set, but got the following error:

  • Peak annotation files must be either Accucor or Isocorr excel files. The file type of the following supplied files: ['13C-val tissues metabolites results 146,148,149 no D-val_cor.csv', '13C-val tissues metabolites results - No D-Val_cor.csv'] could not be validated.

Then click Download button, got the following message:

  • Either an Animal/Sample Table or Peak Annotation file (e.g. Accucor or Isocorr) is required.

@fkang-pu
Copy link
Collaborator

For the dataset 13C-Valine_PI3Ki_in_flank-KPC_mice, the Peak Annotation Files are in isocorr format but are .csv files. Tried to build a Tracebase Submission with 2 Peak Annotation files for this set, but got the following error:

  • Peak annotation files must be either Accucor or Isocorr excel files. The file type of the following supplied files: ['13C-val tissues metabolites results 146,148,149 no D-val_cor.csv', '13C-val tissues metabolites results - No D-Val_cor.csv'] could not be validated.

Then click Download button, got the following message:

  • Either an Animal/Sample Table or Peak Annotation file (e.g. Accucor or Isocorr) is required.

open those 2 .csv files in Excel and saved them as .xlsx files. Then tried to build a submission using those *.xlsx file. Go the following error:

  • Peak annotation files must be either Accucor or Isocorr excel files. The file type of the following supplied files: ['13C-val tissues metabolites results 146,148,149 no D-val_cor.xlsx', '13C-val tissues metabolites results - No D-Val_cor..xlsx'] could not be validated.

@jcmatese
Copy link
Collaborator

jcmatese commented Jul 16, 2024

Just noting that the Google form has a minor typo ("Q Excative 2") and the submission folder section references a "Information Sheet" which I assume refers to the new study workbook formatted/downloaded by https://tracebase.princeton.edu/DataRepo/validate . The link http://tracebase.princeton.edu/DataRepo/upload at the top of the form is not marked up for HTML(not an active link).

@hepcat72
Copy link
Collaborator

  • Suggestion: merge the "Peak Annotation Details" sheet with the sample sheet, putting the extra columns at the end. I submitted two peak annotation files and the samples were put into the worksheet organized by file, which was nice. However, it wasn't easy to see which came from which file.

[Whoops. I just realized I already responded to this and the next comment, but just to keep this train of thought (which I ruminated on all weekend...).] Being able to identify which files each sample is found in would be a new feature. I'm a little surprised this hasn't been asked for before, however it prompts the question: what would a researcher use this information for? The more we know about the answer to that question, the better the design of this new feature. And should we also provide (or do we already and I'm not aware of it), the same feature on the samples page of the website?

Perhaps add a column to the Samples sheet that indicates which peak annotation file the sample was parsed from. It would presumably be ignored downstream, but would have been helpful for me to get an understanding of how things were organized.

I think that's a good simple design option.

Validation

  • Confusing to click "Download" when what I want to do is upload and validate.

I thought more about the common design pattern that flight booking websites use, and I think that could really clarify the process. We could even have a "skip" button (like they often have) for optional steps [i.e. for the validation step].

  • Feedback was initially hidden

Yeah, that was an attempt to not overwhelm the user. I think that once errors are incorporated into the study doc, we won't need to hide anything - and that is where the stats can go.

  • Compounds sheet was unaltered, but resulted in ConflictingValueErrors

Hmmm. Could be a synonym as the primary compound name? Only thing I can think of without seeing the error. If you have that, could you paste it?

This is also one reason why I had considered coloring data either directly from the database and/or from the (dumb) autofill. The only potential change to the compounds sheet in this context should be new compounds appended to the bottom, but since all this data should be in the db, there should be no change.

Maybe... another possibility could be an equivalent (but differently written) formula. Hard to tell without the error.

  • Tracers were selected from the dropdown, but resulted in "does not match" errors

I'm not surprised at the existence of weird errors like this, since I haven't touched the validation aspect of the loaders yet - only the autofill and sheet metadata (i.e. comments, etc). I'm sure I'll see this and other errors that need minor fixes once I get into it. I'll probably compile a list of these minor issues.

If I recall correctly though (from reading the first half last week), you hadn't entered any row group numbers? That may lead to strange errors, although those should (have been) pre-empted by RequiredColumnValue errors and thus not occurred. I'll figure out what's going on once I start focussing on the validation, I'm sure.

  • Tracers sheet was unaltered, but resulted in an error that the Tracer Group Name "None" was used for multiple rows.

Another result of me not having worked on the validation aspect yet. Not unexpected. These should all be simple fixes, though there's one design decision I have not yet decided on. In the first version of the validation code, I skipped processing the accucor loader when there was an issue with the old samples loader. In v2, I realized that there could still be novel errors (like multiple representations), and so I allowed it to run and handled cascading errors by grouping them into their own categories (like missing compounds). In v3, there are more than just 2 (or 3 when you count the old protocols loader) loaders, and there are more possible cascading errors, so I'm sure there are likely to be errors that should be grouped like the missing compounds, tissues, treatments, etc errors - such as with the infusates/tracers.

  • Empty lines in Study sheet resulted in errors that Study records were not found

I'd thought about a feature to skip empty rows awhile back, but assigned it a low priority. I can certainly implement it.

  • Strange error: "Unable to convert column Run Length to type int64"

Was that for the "unknown" row? I thought I'd addressed that in the latest version. Do you know which branch was on dev when you did your test?

  • Strange error: "expected str, bytes or os.PathLike object, not NoneType"

That's another one I think I fixed in the latest version.

  • Sequence Name missing value, but I don't know how or what to fill in

Did you see the dropdown tab. Might again, be one of the previous branches, before I'd added the dynamic dropdowns (which are populated by other sheets - in this case, the sequences sheet).

I thought a lot over the weekend about the "Sequence Name" and "Default Sequence Name" columns. I had a number of thoughts.

  • "Sequence/Sequences" might not be the best term. The tab might be more intuitively understood if I used "Mass Spec Runs" and the column might be more aptly named "Mass Spec Run" (or event "Select Mass Spec Run").
  • The header comment probably should not describe the value, but instead immediately say to "select a mass spec run from the dropdown (as entered in the Mass Spec Runs sheet)."
  • As John had suggested, putting the sheet before the Peak Annotation Files/Details sheets could better contextualize what those columns are (if they work on the sheets in order).

And if you were working from a slightly older branch, I did make automatic comment inclusions that explain the reference to the "Sequences" sheet when it has a dynamic dropdown populated by another sheet.

  • Each sample resulted in an error that the sample names were not found. "21 Sample records were not found (using search field(s): name) with values found in column [Sample Name] of sheet [Peak Annotation Details] in /tmp/study.xlsx:" These sheets were not altered.

Yeah, this is very likely due to errors in the Samples sheet. If any samples had an error, then they don't get (temporarily) loaded, so any reference from the peak annotation files to those samples will have that error. This was part of the initial reason for skipping subsequence loader runs. I think this can be handled either by re-employing the loader skips or by tracking samples with errors from the StudyLoader. Again, I just haven't focussed on this part yet.

Validation results

  • PASSED: No Files are Missing All Samples

  • FAILED: study.xlsx

    • RequiredColumnValues
      Required column values missing on the indicated rows:
      sheet [Study] in /tmp/tmpi0rgrjj7.upload.xlsx
      Column: [Study ID, Description] on rows: ['2']

Study ID is slated for removal, but mmmm... I should probably remove study description as a required value column? That may have been an inherited behavior from the old loader, or perhaps I had just assumed we should require it.

  • ConflictingValueErrors
    Conflicting values encountered during loading:
    During the processing of sheet [Compounds] in /tmp/study.xlsx...
    Creation of the following Compound record(s) encountered conflicts:
    File record: {'name': 'aconitate', 'formula': 'C6H6O6', 'hmdb_id': 'HMDB0000072'} (on row(s): 3)
    Database record: {'name': 'aconitate', 'formula': 'C6H10O3', 'hmdb_id': 'HMDB0000072'}
    [formula] values differ:
    • database: [C6H10O3]
    • file: [C6H6O6]

Ah! Yes. This appears real. I suspect what happened here was a get-or-create from the compound name and formula from one of the peak annotation files, which prompted an autofill to the compounds sheet. And, looking at the dev compound records, my bet is that the wrong formula or compound name was assigned for: Trans-aconitic acid; trans-aconitic acid.

  • InfileError
    Tracer data from columns [Tracer Name: lactate-[13C3], Tracer Concentration: 148.877] on row(s) [5] does not match any of the tracers parsed from the Infusate Name [lactate-[13C3][149]] on row [5] of sheet [Infusates] in /tmp/study.xlsx.

Hmmm.... I had specifically fixed this issue with that specific tracer in one of the latest branches. My guess is you tested before the fix.

  • InfileError
    Tracer data from columns [Tracer Name: lactate-[13C3], Tracer Concentration: 434.631] on row(s) [13] does not match any of the tracers parsed from the Infusate Name [lactate-[13C3][435]] on row [13] of sheet [Infusates] in /tmp/study.xlsx.

Same issue as was fixed for the lactate tracer.

  • InfileError
    sheet [Infusates] in /tmp/study.xlsx:
    Tracer Group Name: 'None' was assigned to infusates containing the following different assortments of tracers, for the indicated 'Infusate Row Group's:
    lactate-[13C3] (on rows with 'Infusate Row Group's: [1, 4, 5, 6, 11, 12, 14, 15])
    glutamine-[13C5,15N2] (on rows with 'Infusate Row Group's: [2, 13])
    alanine-[13C3,15N1] (on rows with 'Infusate Row Group's: [3])
    glutamine-[13C5] (on rows with 'Infusate Row Group's: [7, 8, 16])
    alanine-[13C3] (on rows with 'Infusate Row Group's: [9, 10])
    Suggested resolution: Use a different group name for each 'Infusate Row Group' containing different tracers.

This will be an easy fix, if it's not fixed already.

  • MissingRecords
    7 Study records were not found (using search field(s): name) with values found in sheet [Animals] in /tmp/study.xlsx:
    tca_flux_healthy_tissues from row(s): ['2-8']

It would help to see the study doc that was submitted in this case.

  • RequiredColumnValues
    Required column values missing on the indicated rows:
    sheet [Samples] in /tmp/tmpi0rgrjj7.upload.xlsx
    Column: [Animal] on rows: ['2-22']

Do you think the Animal column in the study sheet should be auto-filled if there's only 1 animal on the animals sheet? I had considered that, but had decided that was a risky assumption that could lead to bad data.

  • ValueError
    Unable to convert column Run Length to type int64

I think I fixed this last week.

  • TypeError
    expected str, bytes or os.PathLike object, not NoneType

This may be fixed in the latest branch.

  • RequiredColumnValues
    Required column values missing on the indicated rows:
    sheet [Peak Annotation Details] in /tmp/tmpi0rgrjj7.upload.xlsx
    Column: [Sequence Name] on rows: ['2-23']

Was this before or after that column had drop-downs?

  • MissingRecords
    21 Sample records were not found (using search field(s): name) with values found in column [Sample Name] of sheet [Peak Annotation Details] in /tmp/study.xlsx:
    bat4 from row(s): ['2']
    bat5 from row(s): ['3']
    bat6 from row(s): ['4']
    bat7 from row(s): ['5']
    int4 from row(s): ['6']
    int5 from row(s): ['7']
    int6 from row(s): ['8']
    int7 from row(s): ['9']
    sol4 from row(s): ['10']
    sol5 from row(s): ['11']
    sol6 from row(s): ['12']
    sol7 from row(s): ['13']
    bat1 from row(s): ['14']
    bat2 from row(s): ['15']
    bat3 from row(s): ['16']
    int1 from row(s): ['18']
    int2 from row(s): ['19']
    int3 from row(s): ['20']
    sol1 from row(s): ['21']
    sol2 from row(s): ['22']
    sol3 from row(s): ['23']

This is due to the errors on the samples sheet (since there was no value in the animal column).

  • PASSED: 091221_tissues_corrected.xlsx
  • PASSED: 090521_tissues_corrected.xlsx

MSData Share and Submission Form

  • Date of Mass Spec Experiment - there were multiple, but only space for one. How could we better handle this?

There is a column for this in the Sequences sheet. Since the goal of this submission refactor was to create a one-stop-shop (so to speak) for a submission (for multiple reasons), we could add a column (to one of the sheets) for the MSData Share - probably in the Sequences sheet (where the date is). I have not yet formally addressed the design for the curator portion of the submission process, but the data share should be a part of it. It also needs to handle the consolidated load files, call the curator's attention to anything new or changed in those sheets, strip those sheets out of the study doc, and in some way address the data share and mzxml loading from that share.

@hepcat72
Copy link
Collaborator

Just noting that the Google form has a minor typo ("Q Excative 2") and the submission folder section references a "Information Sheet" which I assume refers to the new study workbook formatted/Donwloaded by https://tracebase.princeton.edu/DataRepo/validate . The link http://tracebase.princeton.edu/DataRepo/upload at the top of the form is not marked up for HTML(not an active link).

Incidentally, the end result of this submission refactor will be either the elimination of the google form entirely or a very stripped down version. The Study doc was designed to contain anything and everything that previously was entered in the google form. Part of the motivation for this was to create a one-top-shop that combines the disparate elements we have had for creating a submission, to streamline the process. All we really need is the email that the google form submission generates, and possibly just a listing of the files included. I imagined that this could actually be accomplished as a part of the build-a-submission process.

I just haven't gotten yet to the polishing portion of this effort where I update the instructions and stuff.

@hepcat72
Copy link
Collaborator

For the dataset 13C-Valine_PI3Ki_in_flank-KPC_mice, the Peak Annotation Files are in isocorr format but are .csv files. Tried to build a Tracebase Submission with 2 Peak Annotation files for this set, but got the following error:

  • Peak annotation files must be either Accucor or Isocorr excel files. The file type of the following supplied files: ['13C-val tissues metabolites results 146,148,149 no D-val_cor.csv', '13C-val tissues metabolites results - No D-Val_cor.csv'] could not be validated.

Then click Download button, got the following message:

  • Either an Animal/Sample Table or Peak Annotation file (e.g. Accucor or Isocorr) is required.

The second error seems like some sort of browser and/or caching issue. I'd be curious to see if you could reproduce it.

The first one is good to know. The loaders can handle csv, but it appears there's a separate check in the view code that looks at the extension. Should be a simple fix.

@hepcat72
Copy link
Collaborator

open those 2 .csv files in Excel and saved them as .xlsx files. Then tried to build a submission using those *.xlsx file. Go the following error:

  • Peak annotation files must be either Accucor or Isocorr excel files. The file type of the following supplied files: ['13C-val tissues metabolites results 146,148,149 no D-val_cor.xlsx', '13C-val tissues metabolites results - No D-Val_cor..xlsx'] could not be validated.

Type is primarily determined by the sheet names. It falls back to columns, but isocorr and isoautocorr are indistinguishable in that case. However, as with the csv check above, I suspect that this is another separate check in the view code that needs to be removed, because the error in this instance intructs the user to enter the type into the Peak Annotation Files tab.

@hepcat72
Copy link
Collaborator

hepcat72 commented Jul 16, 2024

Great ideas lance! I agree with just about everything. One of the most important parts will be to make the "Samples" and/or "Peak Annotation Details" flow as cleanly as possible.

Basically, if people name their samples in a unique way (as I try to do), then the current "Samples" sheet is ideal. But if people name their samples something like "t1, t2, t3" for every experiment, then they will need to know which data sheet the "Samples" name came from. So we have to support this somehow.

Let me see if I understand this by putting it in my own words and see if it jives with what it is you're conveying here...

You're concerned that samples from different animals in the same study could have the same name in different runs of the mass spec. E.g. Animal 1 could have samples t1 and t2... run on the mass spec on Monday. Then on Tuesday, they could run samples from animal 2, also named t1 and t2...

When they're looking at the sample sheet, they wouldn't know which samples were in which peak annotation files (because the peak annotation files don't have animal names associated with them).

My first question relates to what difficulty this caused in v2. In TraceBase v2, the sample column was (and still is in v3) required to be unique. If a researcher used the same sample names (e.g. t1 and t2), they would have been faced with an error about either the names being duplicated or the animal names conflicting. Is this something that was prompting users to rename their samples (and prepend the animal name)?

If we're on the same page, then this could be handled by auto-generating animal IDs. User's wouldn't be forced to make sample names unique (which I imagine could be a laborious undertaking). And I think revisiting the "Study ID as a prefix seed for ID generation" could eliminate that extra work. It doesn't however solve the next concern...

I wonder if we could insert a step after the user uploads data sheets, but before they download this template sample information sheet. Can we insert a way for the user to indicate which files contain the same samples?

Imagine that I upload 6 excel sheets, where samples are all labeled non-unique: "t1, t2, t3". In reality, the first 3 sheets were from one set of samples, and the last three were from a different set:

If I can indicate these sample groups before downloading the template, then the Samples tab on the Sample Info sheet could look like this (new column highlighted):

I'm curious about how the user would use this information in this context. And should the tracebase website present this information (i.e. what files each sample is in)?

So, in other words, if we provided (in the Samples sheet) a column showing which file(s) each sample is in, at what point is this information useful and how will they act on it? To give you an idea of what I'm asking, I'll tell you my wild guess at when they need this info and how they would use it:

  1. They're fleshing out the samples sheet and they see a sample name that they know is in multiple files and is derived from multiple animals.
  2. They realize that it's not 1 sample (as was autofilled - it's 2), and they need to manually create a new row, change one of the sample names, and enter the animals.

I have a few thoughts if this is the case.

  1. Assuming they're working on the Samples sheet alone, the Peak Annotation Details sheet will subsequently contain incorrect data: it will associate all the peak annotation files with the original sample name and will be missing the added sample name. This will need to be fixed.
  2. If item 1 is correct, the only error that will result from the data change will be that the added sample is missing from the peak annotation details sheet. (That's something I should have a test to ensure that happens.)
  3. A researcher may then attempt to manually add the new sample (instead of update the row associating the file to the wrong sample). That would then produce an error (probably a ConflictingValueError).

If this conjecture is all correct, that's an interesting problem. And it would suggest that merging the Peak Annotation Details and Samples sheets could solve this problem (though it would create a number of other annoying problems...). I'll have to think about this. Ideally, for now, a temporary solution that keeps the sheets separate would be the most expedient solution. The sheets can certainly be merged. The overall design would support that, but it's more of a retool appropriate for a subsequent effort. If we just add a column that comma-delimits the files, it wouldn't be used for loading, but it could be used for a consistency check so that a discrepancy between the sheets doesn't occur.

@hepcat72
Copy link
Collaborator

hepcat72 commented Jul 16, 2024

The following are my rough notes from the reports above. I bolded (and tagged myself in) the ones that I will work on immediately (tomorrow). I avoided most of the issues that are covered in the discussion about the build a submission page/form.

Issues from submission refactor testing:

Command line study loading

Pre-processing script (extract consolidated data and make the curator confirm new entries) #1049

Peak AnnotationsLoader issues

Sequence column/sheet confusion issues:

Build a Submission form issues

Google submission form

Study doc orienting issues

Infusate issues

Peak Annotation Details issues

Samples sheet issues

  • Add an instruction that only the samples that should be loaded should be present in this sheet. I.e. if a blank sample was autofilled here, just delete that row, and use the Peak Annotation Details sheet's Skip column to tell the peak annotations loader to skip it) -> Add a skip column to every sheet via TableLoader #1102

Validation error issues -> #1048

  • Compounds sheet was unaltered, but resulted in ConflictingValueErrors
    • This was a formula difference. This might be a real error. Get-or-create couldn't find the compound using the name and formula perhaps and added a row, causing a difference @hepcat72
  • I think these errors were fixed in the latest version, but check that they are no longer an issue:

Issues harvested from TODO lists in #829

Embedding errors in excel

V2 version support

  • Auto-detect study doc version
  • Error gracefully when version not supported
  • Write a converter for v2 to v3 @hepcat72 Add support for version 2 of the study doc #1047
    • Animal sample sheet (v2) conversion notes:
      • Rename changed column headers
      • Animals
        • Delete tracer concentrations column
        • Add concentrations to the infusate column
        • Delete Study description column
        • Copy Study name and description to study sheet
        • Renames:
          • Animal Genotype -> Genotype
          • Animal Body Weight -> Weight
          • Animal Treatment -> Treatment
          • Study Name -> Study
          • Animal ID -> Animal Name
      • Samples
        • Renames:
          • Sample Name -> Sample
          • Animal ID -> Animal
          • Researcher Name -> Researcher
      • Tissues
        • Renames:
          • TraceBase Tissue Name -> Tissue

Nice-to-haves (misc.)

  • Time- and/or date-stamp the file name -> Add an "About" sheet #1108
  • When building a study doc, if the study exists, automatically add all existing data to the spreadsheet -> Add existing study data to the template #1109
  • In addition to "errors" & "warnings" in the validation report, add a new category: either "info" or "additions" or "updates" or "autofills" or "fixes"...? - any of which could change a FAILED status to PASSED. Perhaps even PASS/FAIL can be changed to some sort of readiness category or a progress status/grid that indicates completeness of: required animal/sample data, suggested animal/samples data,dependent data (tissues/treatments/compounds/etc), run metadata, and a count of errors/warnings state. -> Add stats to the submission page status report #1110
  • Add stats to the results: number of errors, records that would be created, data added, etc. -> Add an "About" sheet #1108 & Add existing study data to the template #1109
  • Change the date format in the exported excel file to conform with our string representation:
    writer = pd.ExcelWriter("pandas_datetime.xlsx",
                            engine='xlsxwriter',
                            datetime_format='mmm d yyyy hh:mm:ss',
                            date_format='mmmm dd yyyy')
    
  • Apply data validation formulas [TBD] (e.g. integers/floats are valid numbers, dates are dates, enums have matching values, etc) -> Develop excel formulas to validate column values #1054
  • Ability to supply suffixes to remove from sample names (e.g. "_pos", "_neg", "_scan[0-9]+") -> Add scan labels input to autofill options/fields #1111
  • [ ] Checkbox (default-selected) to "repair" problems (and notate problems with comments in the excel)
  • [ ] Change the study file field to a text field with "study.xlsx" and a button next to it named something like "add to existing" that changes the text field to a file field
  • Save these things in cookies so that settings are saved and hide in an "advanced"/"settings" widget in the top-right corner -> Save submission page settings in cookies #1112

@hepcat72 hepcat72 added the type:task Chore not directly resulting in a code change label Jul 18, 2024
@lparsons lparsons added this to the Submission - Phase 1 milestone Jul 18, 2024
@lparsons lparsons added the priority:1-critical Blocking issue that is critical to functionality label Jul 18, 2024
@mneinast
Copy link
Collaborator

Rob, I think you mentioned something important.

If there are sample name conflicts within a study (ie the same name representing different samples), then the user can submit data separately (assuming Tracebase is doing something in backend that tracks sample ID upon submission?). So the solution to perform multiple submissions is viable (and acceptable to me).

Am I understanding it correctly?

@hepcat72
Copy link
Collaborator

Sample names are required to be globally unique, so whether they're submitted in 1 submission or separate submissions, 1 study, or separate studies, it doesn't matter.

@lparsons
Copy link
Contributor Author

@hepcat72 Once you are able to run through the process and create a submission, I think we can "close" this issue and create a new one for further testing once we get some of the Phase 1 flow improvements in place (#1040).

@hepcat72
Copy link
Collaborator

@hepcat72 Once you are able to run through the process and create a submission, I think we can "close" this issue and create a new one for further testing once we get some of the Phase 1 flow improvements in place (#1040).

Just as a status update, I've been using my efforts with the v2 support as a driver to fix these issues and get to a working version (in terms of the basic functionality). Let me go through and check off the items in this task that I've fixed on my current branch. I may need to break up the PR, because in order to get 1 thing working, a bunch of other things had to be fixed, so this PR addresses more issues than I'd set out to address.

@lparsons
Copy link
Contributor Author

Thanks Rob. Just to clarify, this issue is simply for keeping track of the testing and feedback from that testing. The bugs and issues uncovered are being tracked in separate issues.

@hepcat72 hepcat72 added the change:split-issue Split this issue into multiple issues label Aug 3, 2024
@hepcat72
Copy link
Collaborator

hepcat72 commented Aug 4, 2024

Now that I created an issue for every checkboxed item where I summarized everything, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:split-issue Split this issue into multiple issues component:submission Submission process page and logic priority:1-critical Blocking issue that is critical to functionality type:task Chore not directly resulting in a code change
Projects
None yet
Development

No branches or pull requests

5 participants