Create a checklist for GenBank submission -- begin to automate for high throughput #106

ababaian · 2020-05-20T05:36:32Z

We are now generating novel CoV sequences that are of high quality (complete assembled genomes or near-complete genomes). High quality sequences like Frank (Fr4NK?) and Ginger need to be deposited into the public GenBank repository ASAP.

As we expand analysis/assembly the volume of data we generate is going to explode and we will need to automate this process.

Collect the best version of Frank and Ginger and initiate a genbank submission for these sequences.
Create an inventory of the annotations and meta-data which we will need to attach (and how we can automate this process)
With our meta-data 'inventory' we can build a 'annotation' pipeline to generate specifically this data as a "deliverable" sequence. For our own use we can have more annotations but we need a core set required by GenBank.

Examples of good CoV Annotation

Some questions I had on this.

Across distant CoV (i.e. Alpha vs. Delta) are all the proteins more or less conserved. If so then we need a classifier tuned specifically for each of the ~25 ORF in CoV.

taltman · 2020-05-20T06:07:21Z

Of course, we need to follow these best-practices:

https://www.nature.com/articles/nbt.4306

I think the fastest bioinformatic path is to use Prokka on virus mode, and generate all of the files necessary for GenBank submission.

If we want to have a quality annotation to go along with it (and I strongly advise for this), then we should look to virus-specific annotation resources, as posted on the Wiki.

If we want to knock it out of the park, then we should lean on Robert's MUSCLE(s) when it comes to HMM design and search, to build HMMs for all coronavirus conserved proteins, and use that to annotate the novel coronavirus genomes. Of course, to build the HMMs, we need to have a basic systematic annotation of the known coronavirus genomes.

Genome annotations can be improved and resubmitted to GenBank, but in reality, unless it is a funded model organism database, it doesn't happen too often. I'd say let's agree on a minimal quality level that we can all be happy with, and then get it done.

taltman · 2020-05-20T06:10:03Z

Where is the image from, BTW?

ababaian · 2020-05-20T06:18:02Z

Good old UCSC Genome Browser

ababaian · 2020-05-20T06:20:30Z

We'll have HMM models for Pol and Spike hopefully soon (as we needed them badly), once that procedure is hmmered out we can hand it off and have them made for all the other proteins.

rcedgar · 2020-05-20T14:08:25Z

Edit: Deleted premature / uninformed comment by me.

rcedgar · 2020-05-20T14:52:23Z

Edit: RFTM (me). The Prokka tool mentioned by @taltman looks at first glance to be capable of high-throughput annotation with output in Genbank format. My bad. Would be fantastic if someone could volunteer to set up Prokka for this...

ababaian · 2020-05-23T07:56:39Z

taltman · 2020-07-07T18:46:57Z

Note, the TPA page says:

Note: It is required that all new annotations will be experimentally determined to exist, directly or indirectly.

From their FAQ page:

Computational studies on their own do not constitute experimental evidence and must be accompanied by biological experiments that support the new annotation.

Our workflow is complex, and doesn't fit any of their neat bins exactly. Will reach out to my contacts at NCBI for guidance.

rcedgar · 2020-07-07T18:50:00Z

Our annotations will be TPA inferential: "A database of sequences annotated by inference, where the source molecule or its product(s) have not been the subject of direct experimentation."

https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/

taltman · 2020-07-07T19:16:34Z

Emails sent, will update as I get more guidance.

taltman · 2020-07-10T09:51:19Z

I have received an initial email from the GenBank team. They have asserted the following:

The new annotation/assembly must be supported by experimental or inferential
evidence. Sequence similarity, computational, or bioinformatic studies
alone are not sufficient as supporting evidence.

I'm not clear what is meant by inferential evidence that is not experimental nor computational. They provide the following webpage providing a bunch of examples of "TPA:inferential" scenarios, but I'm still unclear about the actual definition:

https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/

I've sent a quick reply asking for more description of what constitutes inferential evidence. My rough idea is that it involves indirect experimental evidence for a sequence or the annotation of the sequence.

taltman · 2020-07-10T10:00:59Z

Upon further reading of:
https://www.ncbi.nlm.nih.gov/genbank/tpafaq/

What is the difference between TPA:experimental and TPA:inferential?
Sequence records in the TPA:experimental database are supported directly by experimental evidence while sequence data and annotation in the TPA:inferential database is indirectly supported by experimental evidence.

So however you slice it, our sequences and annotations seem to need experimental evidence of some flavor in order to submit these TPA:inferential submissions to GenBank.

ababaian · 2020-12-09T21:16:19Z

This issue encompasses a set of submission issues which can be merged here to close this issue.

taltman added this to To do in Serratus Annotation via automation May 20, 2020

taltman added this to the Assembly: Outputs milestone May 20, 2020

taltman assigned taltman, ariahahn and nevetsmallah May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a checklist for GenBank submission -- begin to automate for high throughput #106

Create a checklist for GenBank submission -- begin to automate for high throughput #106

ababaian commented May 20, 2020 •

edited

taltman commented May 20, 2020

taltman commented May 20, 2020

ababaian commented May 20, 2020

ababaian commented May 20, 2020

rcedgar commented May 20, 2020 •

edited

rcedgar commented May 20, 2020 •

edited

ababaian commented May 23, 2020 •

edited

taltman commented Jul 7, 2020 •

edited

rcedgar commented Jul 7, 2020

taltman commented Jul 7, 2020

taltman commented Jul 10, 2020

taltman commented Jul 10, 2020

ababaian commented Dec 9, 2020

Create a checklist for GenBank submission -- begin to automate for high throughput #106

Create a checklist for GenBank submission -- begin to automate for high throughput #106

Comments

ababaian commented May 20, 2020 • edited

Examples of good CoV Annotation

taltman commented May 20, 2020

taltman commented May 20, 2020

ababaian commented May 20, 2020

ababaian commented May 20, 2020

rcedgar commented May 20, 2020 • edited

rcedgar commented May 20, 2020 • edited

ababaian commented May 23, 2020 • edited

Meta-data Required

Submission Category

Annotation Features

taltman commented Jul 7, 2020 • edited

rcedgar commented Jul 7, 2020

taltman commented Jul 7, 2020

taltman commented Jul 10, 2020

taltman commented Jul 10, 2020

ababaian commented Dec 9, 2020

ababaian commented May 20, 2020 •

edited

rcedgar commented May 20, 2020 •

edited

rcedgar commented May 20, 2020 •

edited

ababaian commented May 23, 2020 •

edited

taltman commented Jul 7, 2020 •

edited