-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a checklist for GenBank submission -- begin to automate for high throughput #106
Comments
Of course, we need to follow these best-practices: https://www.nature.com/articles/nbt.4306 I think the fastest bioinformatic path is to use Prokka on virus mode, and generate all of the files necessary for GenBank submission. If we want to have a quality annotation to go along with it (and I strongly advise for this), then we should look to virus-specific annotation resources, as posted on the Wiki. If we want to knock it out of the park, then we should lean on Robert's MUSCLE(s) when it comes to HMM design and search, to build HMMs for all coronavirus conserved proteins, and use that to annotate the novel coronavirus genomes. Of course, to build the HMMs, we need to have a basic systematic annotation of the known coronavirus genomes. Genome annotations can be improved and resubmitted to GenBank, but in reality, unless it is a funded model organism database, it doesn't happen too often. I'd say let's agree on a minimal quality level that we can all be happy with, and then get it done. |
Where is the image from, BTW? |
We'll have HMM models for |
Edit: Deleted premature / uninformed comment by me. |
Edit: RFTM (me). The Prokka tool mentioned by @taltman looks at first glance to be capable of high-throughput annotation with output in Genbank format. My bad. Would be fantastic if someone could volunteer to set up Prokka for this... |
Meta-data Required
Submission CategoryThe category of submission we fall under would be "TPA:Inferential" See: https://www.ncbi.nlm.nih.gov/genbank/tpa/ Annotation Features
We can officially submit sequences without annotation, so there is no lower requirement. We can do a first pass annotation and add the obvious/easy meta-data and note entries where we are not satisfied and that will require better annotation. This is likely to be manual and time-intensive work so I suggest if this ends up LWIA we opt to 'crowd source' it to virologists qualified to do so. We should still aim for a good high-throughput annotation pipeline. |
Note, the TPA page says:
From their FAQ page:
Our workflow is complex, and doesn't fit any of their neat bins exactly. Will reach out to my contacts at NCBI for guidance. |
Our annotations will be TPA inferential: "A database of sequences annotated by inference, where the source molecule or its product(s) have not been the subject of direct experimentation." |
Emails sent, will update as I get more guidance. |
I have received an initial email from the GenBank team. They have asserted the following:
I'm not clear what is meant by inferential evidence that is not experimental nor computational. They provide the following webpage providing a bunch of examples of "TPA:inferential" scenarios, but I'm still unclear about the actual definition: https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/ I've sent a quick reply asking for more description of what constitutes inferential evidence. My rough idea is that it involves indirect experimental evidence for a sequence or the annotation of the sequence. |
Upon further reading of:
So however you slice it, our sequences and annotations seem to need experimental evidence of some flavor in order to submit these TPA:inferential submissions to GenBank. |
This issue encompasses a set of submission issues which can be merged here to close this issue.
|
We are now generating novel CoV sequences that are of high quality (complete assembled genomes or near-complete genomes). High quality sequences like Frank (Fr4NK?) and Ginger need to be deposited into the public GenBank repository ASAP.
As we expand analysis/assembly the volume of data we generate is going to explode and we will need to automate this process.
Examples of good CoV Annotation
Some questions I had on this.
The text was updated successfully, but these errors were encountered: