-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task: Evaluate data submission process #1034
Comments
The template generated was helpful, lots of comments explaining the fields, dropdowns for various selections. Overall, a much better experience than the empty template from earlier versions. Below are some comments regarding areas I think we could improve. I had trouble with the validation, and ultimately decided to simply submit, since I think the errors I got were not "real", but I wasn't entirely certain. The errors did change somewhat after successive submissions. Initial template creation
Study Doc
Validation
Validation resultsStudy Doc submitted (along with above linked peak annotation files): study.xlsx
MSData Share and Submission Form
|
That's a good insight. I think the initial design was intentionally bare-bones and conceptually clean in the initial v3 (where it just downloaded a stubbed sheet and have validation as an afterthought). Perhaps instead of an instructions page and different pages with tasks, it could be more like what you see when you book a flight or fill out a survey. Progressive steps are shown at the top, to give you a sense of how far along in the process you are... meaning, you don't have to go back to the instructions page to get to the next step.
This insight dovetails with my ideal above of a progressive process. (It could also save progress in a cookie.) The first step could be to submit all your peak annotation files (only) and get back the stubbed sheet. The second step could take the 1 study doc and validate it, thereby clearing up that confusion over what to submit when. There wouldn't be a way to catch multiple representations in the peak annotation files by doing it this way, but maybe that could be a custom check when creating the stub study sheet.
Yes. That is nearly exactly what I have planned. I want to add stats to the status report with exactly that info (and more).
I agree. It makes more sense to me when you frame it in the context as the user's confusion/first impression. I believe you had suggested this before and I hadn't fully appreciated where it was coming from.
Study ID, is slated for removal. I expect it will be removed at some point before this gets to production, but I like the idea of instructions on the initial page of the study doc. Maybe that's not what you meant, but that's immediately where my head went.
Yeah, I haven't touched the stale instructions page yet. I did try to order the tabs in order of filling them out. Perhaps I was thinking top down (study name/desc, animals, samples...), but it sounds like perhaps you're suggesting bottom up? That makes some sense.
That's not a bad idea, conceptually, but each sheet is independent in terms of the loading. The loaders don't access other sheets (with the exception of the defaults sheet - which you haven't seen yet). It keeps everything cleaner and simpler, code-wise. But here's what I suggest. Enforcing 1 study is a decent idea. We could accomplish the same thing you suggest simply by making the study column populated by formula (i.e. read only). The formula could just grab the name on the first row of the study sheet.
They are entered in separate fields. This is done in the Infusates sheet. No one should ever need to type out the name. I had hoped that it being a drop-down field would make it clear enough, but perhaps not. But this brings up a peculiarity that I'm just now realizing. So having concentrations in the infusate names is how the names have been represented on Tracebase since we started supporting multiple tracers. It was necessary in tracebase for uniqueness, because we'd run into unique constraint violations in the Infusate model. But now that I think about it, unique constraint violations make no sense. The loader should never have tried to load the same Infusate again. It might have been the result of how the loaders were originally written, and that code solidified with making the name include the concentrations in the database, instead of fixing the loading code. And yep... I just checked. We do have separate infusate records for different sets of tracer concentrations. It's in the Infusate model. There's a separate record for each different set of concentrations. I think that's a schema design flaw that we let slip through. And that problem of representing unique infusate concentration possibilities was propagated to the sheets/loaders. Now that I see it, I'm realizing that this goes pretty deep. I think that if we fix the schema, we could break those 2 fields apart again, like you're suggesting. In fact, it would make a lot of code much simpler, I think. But the change would affect a lot of things, from the model, to the templates, and the loaders. That said, concentrations are entered individually, and selected by dropdown, which is the solution that was suggested. It could probably be made clearer that that's what the user should do. For now, that may be the way to go, and we can plan out the overhaul of the Infusate model problem later. I'm going to break up my response in different comments. This last one here was a big one and I don't want to bury it. |
I don't disagree, from the user's perspective that this might seem weird at first, though there would be a lot of problems to solve to make the merge happen. We had discussed this design decision (separate samples from what had initially been referred to as the "LCMS metadata file" - now the "Peak Annotation Details" sheet) at length a year or so ago, and I do recall that we had flipped back and forth between separate versus merged metadata. Some of the problems we would have to solve (as we had discussed a year ago) are:
I don't disagree that there is a problem here from a user perspective. It seems more complex than it needs to be, and I think we can fix that, just in a different way. The first thing we can take advantage of is that the user should never actually have to touch the "Peak Annotation Details" sheet. It's (or rather will be) completely automated. The only missing entries: (mzXML File Name and Sequence Name - also the skip column - but let me circle back to that later) are almost always unnecessary. The Default Sequence Name in the "Peak Annotation Files" sheet is all that's really needed. I've never seen a case where one peak annotation file utilized data from multiple sequences. And once I add the placeholder record link to handle multiple mzXML's with the same name, the only reason to use that column would be if the sample header didn't match the file name. And as far as the skip column goes, I have a plan for users to supply skip strings from the build-a-submission page form. I'd considered putting it in the samples sheet, but the Peak AnnotationDetails are needed by the peak annotations loaders to skip loading. And the samples sheet, again, should only have samples in it we actually want to load, IMHO.
They did for me. Type one letter and the highlighted one (starting with that letter) will be entered when you hit tab. Can't speak to Libre though.
It will be a dropdown based on entries in the Sequences tab. It's like the same thing as the researcher, protocol, instrument, and date entries in the yaml associated with individual peak annotation files. John suggested putting the Sequences sheet before the files sheet. Do you think that could fix the misunderstanding there?
See above. Suggestions to make that clearer are welcome, but remember the intent is to link between sheets with a single column. Maybe rename it "Researcher, Protocol, Instrument, and Date"?
The goal was to actually empower the users to add to those sheets without (initially) having to interact with us to move forward, which was a complaint this was meant to address. I'll have to respond to the rest later... |
For the dataset 13C-Valine_PI3Ki_in_flank-KPC_mice, the Peak Annotation Files are in
Then click Download button, got the following message:
|
open those 2 .csv files in Excel and saved them as .xlsx files. Then tried to build a submission using those *.xlsx file. Go the following error:
|
Just noting that the Google form has a minor typo ("Q Excative 2") and the submission folder section references a "Information Sheet" which I assume refers to the new study workbook formatted/downloaded by https://tracebase.princeton.edu/DataRepo/validate . The link http://tracebase.princeton.edu/DataRepo/upload at the top of the form is not marked up for HTML(not an active link). |
[Whoops. I just realized I already responded to this and the next comment, but just to keep this train of thought (which I ruminated on all weekend...).] Being able to identify which files each sample is found in would be a new feature. I'm a little surprised this hasn't been asked for before, however it prompts the question: what would a researcher use this information for? The more we know about the answer to that question, the better the design of this new feature. And should we also provide (or do we already and I'm not aware of it), the same feature on the samples page of the website?
I think that's a good simple design option.
I thought more about the common design pattern that flight booking websites use, and I think that could really clarify the process. We could even have a "skip" button (like they often have) for optional steps [i.e. for the validation step].
Yeah, that was an attempt to not overwhelm the user. I think that once errors are incorporated into the study doc, we won't need to hide anything - and that is where the stats can go.
Hmmm. Could be a synonym as the primary compound name? Only thing I can think of without seeing the error. If you have that, could you paste it? This is also one reason why I had considered coloring data either directly from the database and/or from the (dumb) autofill. The only potential change to the compounds sheet in this context should be new compounds appended to the bottom, but since all this data should be in the db, there should be no change. Maybe... another possibility could be an equivalent (but differently written) formula. Hard to tell without the error.
I'm not surprised at the existence of weird errors like this, since I haven't touched the validation aspect of the loaders yet - only the autofill and sheet metadata (i.e. comments, etc). I'm sure I'll see this and other errors that need minor fixes once I get into it. I'll probably compile a list of these minor issues. If I recall correctly though (from reading the first half last week), you hadn't entered any row group numbers? That may lead to strange errors, although those should (have been) pre-empted by
Another result of me not having worked on the validation aspect yet. Not unexpected. These should all be simple fixes, though there's one design decision I have not yet decided on. In the first version of the validation code, I skipped processing the accucor loader when there was an issue with the old samples loader. In v2, I realized that there could still be novel errors (like multiple representations), and so I allowed it to run and handled cascading errors by grouping them into their own categories (like missing compounds). In v3, there are more than just 2 (or 3 when you count the old protocols loader) loaders, and there are more possible cascading errors, so I'm sure there are likely to be errors that should be grouped like the missing compounds, tissues, treatments, etc errors - such as with the infusates/tracers.
I'd thought about a feature to skip empty rows awhile back, but assigned it a low priority. I can certainly implement it.
Was that for the "unknown" row? I thought I'd addressed that in the latest version. Do you know which branch was on dev when you did your test?
That's another one I think I fixed in the latest version.
Did you see the dropdown tab. Might again, be one of the previous branches, before I'd added the dynamic dropdowns (which are populated by other sheets - in this case, the sequences sheet). I thought a lot over the weekend about the "Sequence Name" and "Default Sequence Name" columns. I had a number of thoughts.
And if you were working from a slightly older branch, I did make automatic comment inclusions that explain the reference to the "Sequences" sheet when it has a dynamic dropdown populated by another sheet.
Yeah, this is very likely due to errors in the Samples sheet. If any samples had an error, then they don't get (temporarily) loaded, so any reference from the peak annotation files to those samples will have that error. This was part of the initial reason for skipping subsequence loader runs. I think this can be handled either by re-employing the loader skips or by tracking samples with errors from the StudyLoader. Again, I just haven't focussed on this part yet.
Study ID is slated for removal, but mmmm... I should probably remove study description as a required value column? That may have been an inherited behavior from the old loader, or perhaps I had just assumed we should require it.
Ah! Yes. This appears real. I suspect what happened here was a get-or-create from the compound name and formula from one of the peak annotation files, which prompted an autofill to the compounds sheet. And, looking at the dev compound records, my bet is that the wrong formula or compound name was assigned for:
Hmmm.... I had specifically fixed this issue with that specific tracer in one of the latest branches. My guess is you tested before the fix.
Same issue as was fixed for the lactate tracer.
This will be an easy fix, if it's not fixed already.
It would help to see the study doc that was submitted in this case.
Do you think the Animal column in the study sheet should be auto-filled if there's only 1 animal on the animals sheet? I had considered that, but had decided that was a risky assumption that could lead to bad data.
I think I fixed this last week.
This may be fixed in the latest branch.
Was this before or after that column had drop-downs?
This is due to the errors on the samples sheet (since there was no value in the animal column).
There is a column for this in the Sequences sheet. Since the goal of this submission refactor was to create a one-stop-shop (so to speak) for a submission (for multiple reasons), we could add a column (to one of the sheets) for the MSData Share - probably in the Sequences sheet (where the date is). I have not yet formally addressed the design for the curator portion of the submission process, but the data share should be a part of it. It also needs to handle the consolidated load files, call the curator's attention to anything new or changed in those sheets, strip those sheets out of the study doc, and in some way address the data share and mzxml loading from that share. |
Incidentally, the end result of this submission refactor will be either the elimination of the google form entirely or a very stripped down version. The Study doc was designed to contain anything and everything that previously was entered in the google form. Part of the motivation for this was to create a one-top-shop that combines the disparate elements we have had for creating a submission, to streamline the process. All we really need is the email that the google form submission generates, and possibly just a listing of the files included. I imagined that this could actually be accomplished as a part of the build-a-submission process. I just haven't gotten yet to the polishing portion of this effort where I update the instructions and stuff. |
The second error seems like some sort of browser and/or caching issue. I'd be curious to see if you could reproduce it. The first one is good to know. The loaders can handle |
Type is primarily determined by the sheet names. It falls back to columns, but isocorr and isoautocorr are indistinguishable in that case. However, as with the csv check above, I suspect that this is another separate check in the view code that needs to be removed, because the error in this instance intructs the user to enter the type into the Peak Annotation Files tab. |
Let me see if I understand this by putting it in my own words and see if it jives with what it is you're conveying here... You're concerned that samples from different animals in the same study could have the same name in different runs of the mass spec. E.g. Animal 1 could have samples t1 and t2... run on the mass spec on Monday. Then on Tuesday, they could run samples from animal 2, also named t1 and t2... When they're looking at the sample sheet, they wouldn't know which samples were in which peak annotation files (because the peak annotation files don't have animal names associated with them). My first question relates to what difficulty this caused in v2. In TraceBase v2, the sample column was (and still is in v3) required to be unique. If a researcher used the same sample names (e.g. t1 and t2), they would have been faced with an error about either the names being duplicated or the animal names conflicting. Is this something that was prompting users to rename their samples (and prepend the animal name)? If we're on the same page, then this could be handled by auto-generating animal IDs. User's wouldn't be forced to make sample names unique (which I imagine could be a laborious undertaking). And I think revisiting the "Study ID as a prefix seed for ID generation" could eliminate that extra work. It doesn't however solve the next concern...
I'm curious about how the user would use this information in this context. And should the tracebase website present this information (i.e. what files each sample is in)? So, in other words, if we provided (in the Samples sheet) a column showing which file(s) each sample is in, at what point is this information useful and how will they act on it? To give you an idea of what I'm asking, I'll tell you my wild guess at when they need this info and how they would use it:
I have a few thoughts if this is the case.
If this conjecture is all correct, that's an interesting problem. And it would suggest that merging the Peak Annotation Details and Samples sheets could solve this problem (though it would create a number of other annoying problems...). I'll have to think about this. Ideally, for now, a temporary solution that keeps the sheets separate would be the most expedient solution. The sheets can certainly be merged. The overall design would support that, but it's more of a retool appropriate for a subsequent effort. If we just add a column that comma-delimits the files, it wouldn't be used for loading, but it could be used for a consistency check so that a discrepancy between the sheets doesn't occur. |
The following are my rough notes from the reports above. I bolded (and tagged myself in) the ones that I will work on immediately (tomorrow). I avoided most of the issues that are covered in the discussion about the build a submission page/form. Issues from submission refactor testing:Command line study loading
Pre-processing script (extract consolidated data and make the curator confirm new entries) #1049
Peak AnnotationsLoader issues
Sequence column/sheet confusion issues:
Build a Submission form issues
Google submission form
Study doc orienting issues
Infusate issues
Peak Annotation Details issues
Samples sheet issues
Validation error issues -> #1048
Issues harvested from TODO lists in #829Embedding errors in excel
V2 version support
Nice-to-haves (misc.)
|
Rob, I think you mentioned something important. If there are sample name conflicts within a study (ie the same name representing different samples), then the user can submit data separately (assuming Tracebase is doing something in backend that tracks sample ID upon submission?). So the solution to perform multiple submissions is viable (and acceptable to me). Am I understanding it correctly? |
Sample names are required to be globally unique, so whether they're submitted in 1 submission or separate submissions, 1 study, or separate studies, it doesn't matter. |
Just as a status update, I've been using my efforts with the v2 support as a driver to fix these issues and get to a working version (in terms of the basic functionality). Let me go through and check off the items in this task that I've fixed on my current branch. I may need to break up the PR, because in order to get 1 thing working, a bunch of other things had to be fixed, so this PR addresses more issues than I'd set out to address. |
Thanks Rob. Just to clarify, this issue is simply for keeping track of the testing and feedback from that testing. The bugs and issues uncovered are being tracked in separate issues. |
Now that I created an issue for every checkboxed item where I summarized everything, I will close this issue. |
Goal
Act as a researcher and run through the submission process to better understand the software, file formats, etc. Also, review the process for pain points during the submission process. Identify bugs and suggestions for improvements and share those with the group.
Submission Process
Our goal right now is to evaluate the submission process. At a later stage, we'll go though and process these submissions to ensure we call get the submitted data loaded.
Below are are sample data sets that we will use for our evaluations. I've selected a variety of formats and assigned some sets to different people. Please check off the datasets when you have completed the submissions.
- '121105_serum leuile_corrected.xlsx'
- '121105_serum leuile_corrected.xlsx'
- Metadata from 'TraceBase Animal and Sample Table_exp027f.xlsx' and load_study.sh
The text was updated successfully, but these errors were encountered: