Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deliver uniform annotation of reference coronavirus genomes #101

Closed
taltman opened this issue May 17, 2020 · 10 comments
Closed

Deliver uniform annotation of reference coronavirus genomes #101

taltman opened this issue May 17, 2020 · 10 comments

Comments

@taltman
Copy link
Collaborator

taltman commented May 17, 2020

To better evaluate read hits to the reference genomes, we need to have a quality, consistent annotation of those genomes, to see where the reads are hitting, and whether they are significant.

@taltman taltman created this issue from a note in Serratus Annotation (To do) May 17, 2020
@taltman
Copy link
Collaborator Author

taltman commented May 17, 2020

@ababaian , could you please help me describe the problem here, and the list of requirements that would make the Bowtie2 hit analysis easier?

@taltman taltman added this to the Annotation: Ref Sequences milestone May 17, 2020
@rcedgar
Copy link
Collaborator

rcedgar commented May 17, 2020

It would be very helpful to know the coordinates of biologically important segments such as "spike protein" and "polymerase" in every reference sequence. That way, we can automate detection and visualization of coverage on a gene-by-gene basis, and it would also enable biologically informative analyses of the Cov family itself such as gene gain and loss which AFAIK is not published though I haven't tried very hard to find it -- reference solicited! I've taken a stab at cleaning up the annotations, and it looks to be a lot of work because annotations in the Genbank files have variable quality, nomenclature etc. Looks to me like a lot of manual effort would be needed to clean this up. If in fact this work has not been done already, it would be a useful contribution / preprint / paper in its own right and doesn't require much / any understanding of Serratus which can be a bit daunting to a newcomer.

Edit: to be clear, the annotations should be harmonized so that there is a single system of identifiers and meta-data so that "spike protein" is always "spike protiein" rather than ORF1, unidentified protein etc.

@rcedgar
Copy link
Collaborator

rcedgar commented May 17, 2020

Here's an idea for generating uniform annotations with less effort. Build HMMs from the well-annotated proteins and other features in the Genbank records. Use these to find homologs in the other reference sequences. Use these to make expanded MSAs and better HMMs. Might take a couple of iterations, but this is the kind of thing you can do in a few hours. Do this with both nt and aa sequences. The HMMs can be used to annotate contigs directly without needing a reference alignment.

@ababaian
Copy link
Owner

ababaian commented May 17, 2020

Sequence Annotation Fields

I'd like to have all the coronavirus metadata in a single, uniform, flat TSV file. This need not contain the sequence data directly, only a uniform and clean meta-data describing each sequence.
(see also: #45 and Genbank Parsing for a start to this and the near complete genbank parser #61).

We'll then want this metadata generated for every 'new genome' that we find.

Desired Fields

For every sequence in cov0

  • Accession ID
  • CoV Species Taxonomy ID
  • CoV Species description
  • Host Species Taxonomy ID
  • Inferred Host Species Taxonomy ID. When a host species is not known, use an 'inferred' host species from the closest known virus.
  • Host / Inferred Host Description
  • Whole genome or partial
  • Sequence length
  • Sequence coverage (how many NNNN are there)
  • open reading frames in sequence (comma seperated). Process with a pan-CoV and uniform nomenclature.
  • Coordinates of each ORF
  • Is this a blacklisted sequence?

Motivation: Why do we need this right now? We are currently using cov2m which contains both full-length genomes and sequence fragments from highly diverged CoV (i.e. cloned pol gene). We want to allow the 'summarizer' to bin hits onto a theoretical/rough "pan-genome" such that a divergent virus like Frankie, when the hits are distributed across 10+ genomes and fragments, all of those hits can be uniformly placed onto the 'pan-genome' or the 'pan-pol' or 'pan-spike'.

This is similar to the 'liftOver` operation between genomes, but in this instance we want all genomes/fragments to be 'liftOver' to a uniform pan-genome.

With this meta-data, we can then align proteins in sequences/fragments and create a true 'pan-genome'. For instance, if we only have the Spike gene in a fragment, this belongs to coordinates ~21,000-23,000 on the pan-genome.

edit: Then of course for the assembly/contigs that are generated in new libraries. We can quickly infer meta-data about what should be in a contig, by knowing approximately where it sits on the pan-CoV genome.

@rcedgar rcedgar self-assigned this May 17, 2020
@rcedgar
Copy link
Collaborator

rcedgar commented May 17, 2020

I'm assigning myself to develop a system for annotating a Cov nt sequence with coordinates of functional domains (ORFs, regulatory sequences etc.). Something like PFAM HMMs I'm thinking.

@rcedgar
Copy link
Collaborator

rcedgar commented May 17, 2020

@ababaian said "this belongs to coordinates ~21,000-23,000 on the pan-genome" This kind of system works well for SSU genes and immunoglobulins, would be ideal here. It may not be possible, or a good idea, if there is gene gain or loss, or mosaic / chimeric viruses in the family. First step is to parse into functional units, then ask if domain organization is well conserved.

@rcedgar
Copy link
Collaborator

rcedgar commented May 17, 2020

Related note: naively projecting onto a pan-genome with 32 bins (summarizer strategy) is robust against the complications mentioned in my last comment.

@taltman
Copy link
Collaborator Author

taltman commented May 17, 2020

@rcedgar If you want to take this on, awesome! I might suggest some tools here:
Prokka, for performing virus-oriented gene calling and a basic annotation using UniProt
EggNOG-Mapper, which has viral HMMs and DIAMOND protein databases built already.
Also, check out these resources from Steven Hallam:
https://github.com/ababaian/serratus/wiki/Serratus-Annotation

@taltman
Copy link
Collaborator Author

taltman commented Jul 7, 2020

We now have VADR for annotating CoVs. Do we still want to annotate the reference CoV genomes?

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 7, 2020

If VADR is the tool we're going to use in production, then would be great validation to provide VADR annotations of the ~800 full-length Cov genomes currently in GenBank. For the RefSeqs especially, this will show the strengths and weaknesses of the method. Edit -- Actually I guess not the RefSeqs because they are used as references by VADR IIRC.

@ababaian ababaian closed this as completed Dec 9, 2020
Serratus Annotation automation moved this from To do to Done Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants