Deliver uniform annotation of reference coronavirus genomes #101

taltman · 2020-05-17T07:09:18Z

To better evaluate read hits to the reference genomes, we need to have a quality, consistent annotation of those genomes, to see where the reads are hitting, and whether they are significant.

taltman · 2020-05-17T07:10:04Z

@ababaian , could you please help me describe the problem here, and the list of requirements that would make the Bowtie2 hit analysis easier?

rcedgar · 2020-05-17T14:22:36Z

It would be very helpful to know the coordinates of biologically important segments such as "spike protein" and "polymerase" in every reference sequence. That way, we can automate detection and visualization of coverage on a gene-by-gene basis, and it would also enable biologically informative analyses of the Cov family itself such as gene gain and loss which AFAIK is not published though I haven't tried very hard to find it -- reference solicited! I've taken a stab at cleaning up the annotations, and it looks to be a lot of work because annotations in the Genbank files have variable quality, nomenclature etc. Looks to me like a lot of manual effort would be needed to clean this up. If in fact this work has not been done already, it would be a useful contribution / preprint / paper in its own right and doesn't require much / any understanding of Serratus which can be a bit daunting to a newcomer.

Edit: to be clear, the annotations should be harmonized so that there is a single system of identifiers and meta-data so that "spike protein" is always "spike protiein" rather than ORF1, unidentified protein etc.

rcedgar · 2020-05-17T15:30:31Z

Here's an idea for generating uniform annotations with less effort. Build HMMs from the well-annotated proteins and other features in the Genbank records. Use these to find homologs in the other reference sequences. Use these to make expanded MSAs and better HMMs. Might take a couple of iterations, but this is the kind of thing you can do in a few hours. Do this with both nt and aa sequences. The HMMs can be used to annotate contigs directly without needing a reference alignment.

ababaian · 2020-05-17T18:31:17Z

rcedgar · 2020-05-17T21:27:17Z

I'm assigning myself to develop a system for annotating a Cov nt sequence with coordinates of functional domains (ORFs, regulatory sequences etc.). Something like PFAM HMMs I'm thinking.

rcedgar · 2020-05-17T21:39:48Z

@ababaian said "this belongs to coordinates ~21,000-23,000 on the pan-genome" This kind of system works well for SSU genes and immunoglobulins, would be ideal here. It may not be possible, or a good idea, if there is gene gain or loss, or mosaic / chimeric viruses in the family. First step is to parse into functional units, then ask if domain organization is well conserved.

rcedgar · 2020-05-17T21:43:48Z

Related note: naively projecting onto a pan-genome with 32 bins (summarizer strategy) is robust against the complications mentioned in my last comment.

taltman · 2020-05-17T22:43:33Z

@rcedgar If you want to take this on, awesome! I might suggest some tools here:
Prokka, for performing virus-oriented gene calling and a basic annotation using UniProt
EggNOG-Mapper, which has viral HMMs and DIAMOND protein databases built already.
Also, check out these resources from Steven Hallam:
https://github.com/ababaian/serratus/wiki/Serratus-Annotation

taltman · 2020-07-07T18:42:39Z

We now have VADR for annotating CoVs. Do we still want to annotate the reference CoV genomes?

rcedgar · 2020-07-07T18:47:08Z

If VADR is the tool we're going to use in production, then would be great validation to provide VADR annotations of the ~800 full-length Cov genomes currently in GenBank. For the RefSeqs especially, this will show the strengths and weaknesses of the method. Edit -- Actually I guess not the RefSeqs because they are used as references by VADR IIRC.

taltman created this issue from a note in Serratus Annotation (To do) May 17, 2020

taltman added this to the Annotation: Ref Sequences milestone May 17, 2020

ababaian mentioned this issue May 17, 2020

Taxonomy identifiers for Cov reference database #45

Closed

rcedgar self-assigned this May 17, 2020

taltman mentioned this issue May 17, 2020

Genbank parser dev #61

Closed

ababaian closed this as completed Dec 9, 2020

Serratus Annotation automation moved this from To do to Done Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deliver uniform annotation of reference coronavirus genomes #101

Deliver uniform annotation of reference coronavirus genomes #101

taltman commented May 17, 2020

taltman commented May 17, 2020

rcedgar commented May 17, 2020 •

edited

rcedgar commented May 17, 2020

ababaian commented May 17, 2020 •

edited

rcedgar commented May 17, 2020

rcedgar commented May 17, 2020

rcedgar commented May 17, 2020

taltman commented May 17, 2020

taltman commented Jul 7, 2020

rcedgar commented Jul 7, 2020 •

edited

Deliver uniform annotation of reference coronavirus genomes #101

Deliver uniform annotation of reference coronavirus genomes #101

Comments

taltman commented May 17, 2020

taltman commented May 17, 2020

rcedgar commented May 17, 2020 • edited

rcedgar commented May 17, 2020

ababaian commented May 17, 2020 • edited

Sequence Annotation Fields

Desired Fields

rcedgar commented May 17, 2020

rcedgar commented May 17, 2020

rcedgar commented May 17, 2020

taltman commented May 17, 2020

taltman commented Jul 7, 2020

rcedgar commented Jul 7, 2020 • edited

rcedgar commented May 17, 2020 •

edited

ababaian commented May 17, 2020 •

edited

rcedgar commented Jul 7, 2020 •

edited