Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a checklist for GenBank submission -- begin to automate for high throughput #106

Open
ababaian opened this issue May 20, 2020 · 13 comments

Comments

@ababaian
Copy link
Owner

ababaian commented May 20, 2020

We are now generating novel CoV sequences that are of high quality (complete assembled genomes or near-complete genomes). High quality sequences like Frank (Fr4NK?) and Ginger need to be deposited into the public GenBank repository ASAP.

As we expand analysis/assembly the volume of data we generate is going to explode and we will need to automate this process.

  1. Collect the best version of Frank and Ginger and initiate a genbank submission for these sequences.
  2. Create an inventory of the annotations and meta-data which we will need to attach (and how we can automate this process)
  3. With our meta-data 'inventory' we can build a 'annotation' pipeline to generate specifically this data as a "deliverable" sequence. For our own use we can have more annotations but we need a core set required by GenBank.

Examples of good CoV Annotation


Some questions I had on this.

  • Across distant CoV (i.e. Alpha vs. Delta) are all the proteins more or less conserved. If so then we need a classifier tuned specifically for each of the ~25 ORF in CoV.
    Screenshot from 2020-05-19 22-41-08
@taltman
Copy link
Collaborator

taltman commented May 20, 2020

Of course, we need to follow these best-practices:

https://www.nature.com/articles/nbt.4306

I think the fastest bioinformatic path is to use Prokka on virus mode, and generate all of the files necessary for GenBank submission.

If we want to have a quality annotation to go along with it (and I strongly advise for this), then we should look to virus-specific annotation resources, as posted on the Wiki.

If we want to knock it out of the park, then we should lean on Robert's MUSCLE(s) when it comes to HMM design and search, to build HMMs for all coronavirus conserved proteins, and use that to annotate the novel coronavirus genomes. Of course, to build the HMMs, we need to have a basic systematic annotation of the known coronavirus genomes.

Genome annotations can be improved and resubmitted to GenBank, but in reality, unless it is a funded model organism database, it doesn't happen too often. I'd say let's agree on a minimal quality level that we can all be happy with, and then get it done.

@taltman taltman added this to To do in Serratus Annotation via automation May 20, 2020
@taltman taltman added this to the Assembly: Outputs milestone May 20, 2020
@taltman
Copy link
Collaborator

taltman commented May 20, 2020

Where is the image from, BTW?

@ababaian
Copy link
Owner Author

@ababaian
Copy link
Owner Author

We'll have HMM models for Pol and Spike hopefully soon (as we needed them badly), once that procedure is hmmered out we can hand it off and have them made for all the other proteins.

@rcedgar
Copy link
Collaborator

rcedgar commented May 20, 2020

Edit: Deleted premature / uninformed comment by me.

@rcedgar
Copy link
Collaborator

rcedgar commented May 20, 2020

Edit: RFTM (me). The Prokka tool mentioned by @taltman looks at first glance to be capable of high-throughput annotation with output in Genbank format. My bad. Would be fantastic if someone could volunteer to set up Prokka for this...

@ababaian
Copy link
Owner Author

ababaian commented May 23, 2020

Meta-data Required

  • Primary Contact Information
  • Sequence Author List
  • Reference for publication (if avail) - Unpublished
  • Sequence Technology - Illumina
  • Assembled Sequence OR unassembled sequence
  • Assembly Program Name
  • Assembly Program Version
  • Assembly Name
  • Coverage
  • Molecule Type - genomic RNA
  • Toplogy - Linear
  • Is the sequence complete
  • Fasta File
  • Submission Category - TPA (see below)
  • TPA - Evidence
  • TPA - GenBank Accessions
  • Source - Host
  • Source - Note (SRA Accession)
  • Source - Strain/Isolate *
  • Source - Country *
  • Source - Collection Date *

Submission Category

The category of submission we fall under would be "TPA:Inferential" See: https://www.ncbi.nlm.nih.gov/genbank/tpa/

Annotation Features

This method is more suitable for:
adding many different features on a single sequence or on multiple sequences
uses the five-column, tab-delimited feature table format, which is also used in Sequin
each table in the feature table file applies to only one sequence; if multiple sequences have been uploaded in your nucleotide fasta file, each corresponding table must be labeled with that sequence's Sequence ID
multiple tables can be uploaded in a single file.

We can officially submit sequences without annotation, so there is no lower requirement. We can do a first pass annotation and add the obvious/easy meta-data and note entries where we are not satisfied and that will require better annotation. This is likely to be manual and time-intensive work so I suggest if this ends up LWIA we opt to 'crowd source' it to virologists qualified to do so. We should still aim for a good high-throughput annotation pipeline.

@taltman
Copy link
Collaborator

taltman commented Jul 7, 2020

Note, the TPA page says:

Note: It is required that all new annotations will be experimentally determined to exist, directly or indirectly.

From their FAQ page:

Computational studies on their own do not constitute experimental evidence and must be accompanied by biological experiments that support the new annotation.

Our workflow is complex, and doesn't fit any of their neat bins exactly. Will reach out to my contacts at NCBI for guidance.

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 7, 2020

Our annotations will be TPA inferential: "A database of sequences annotated by inference, where the source molecule or its product(s) have not been the subject of direct experimentation."

https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/

@taltman
Copy link
Collaborator

taltman commented Jul 7, 2020

Emails sent, will update as I get more guidance.

@taltman
Copy link
Collaborator

taltman commented Jul 10, 2020

I have received an initial email from the GenBank team. They have asserted the following:

The new annotation/assembly must be supported by experimental or inferential
evidence. Sequence similarity, computational, or bioinformatic studies
alone are not sufficient as supporting evidence.

I'm not clear what is meant by inferential evidence that is not experimental nor computational. They provide the following webpage providing a bunch of examples of "TPA:inferential" scenarios, but I'm still unclear about the actual definition:

https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/

I've sent a quick reply asking for more description of what constitutes inferential evidence. My rough idea is that it involves indirect experimental evidence for a sequence or the annotation of the sequence.

@taltman
Copy link
Collaborator

taltman commented Jul 10, 2020

Upon further reading of:
https://www.ncbi.nlm.nih.gov/genbank/tpafaq/

What is the difference between TPA:experimental and TPA:inferential?
Sequence records in the TPA:experimental database are supported directly by experimental evidence while sequence data and annotation in the TPA:inferential database is indirectly supported by experimental evidence.

So however you slice it, our sequences and annotations seem to need experimental evidence of some flavor in order to submit these TPA:inferential submissions to GenBank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

5 participants