-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deliver uniform annotation of reference coronavirus genomes #101
Comments
@ababaian , could you please help me describe the problem here, and the list of requirements that would make the Bowtie2 hit analysis easier? |
It would be very helpful to know the coordinates of biologically important segments such as "spike protein" and "polymerase" in every reference sequence. That way, we can automate detection and visualization of coverage on a gene-by-gene basis, and it would also enable biologically informative analyses of the Cov family itself such as gene gain and loss which AFAIK is not published though I haven't tried very hard to find it -- reference solicited! I've taken a stab at cleaning up the annotations, and it looks to be a lot of work because annotations in the Genbank files have variable quality, nomenclature etc. Looks to me like a lot of manual effort would be needed to clean this up. If in fact this work has not been done already, it would be a useful contribution / preprint / paper in its own right and doesn't require much / any understanding of Serratus which can be a bit daunting to a newcomer. Edit: to be clear, the annotations should be harmonized so that there is a single system of identifiers and meta-data so that "spike protein" is always "spike protiein" rather than ORF1, unidentified protein etc. |
Here's an idea for generating uniform annotations with less effort. Build HMMs from the well-annotated proteins and other features in the Genbank records. Use these to find homologs in the other reference sequences. Use these to make expanded MSAs and better HMMs. Might take a couple of iterations, but this is the kind of thing you can do in a few hours. Do this with both nt and aa sequences. The HMMs can be used to annotate contigs directly without needing a reference alignment. |
Sequence Annotation FieldsI'd like to have all the coronavirus metadata in a single, uniform, flat TSV file. This need not contain the sequence data directly, only a uniform and clean meta-data describing each sequence. We'll then want this metadata generated for every 'new genome' that we find. Desired FieldsFor every sequence in
Motivation: Why do we need this right now? We are currently using This is similar to the 'liftOver` operation between genomes, but in this instance we want all genomes/fragments to be 'liftOver' to a uniform pan-genome. With this meta-data, we can then align proteins in sequences/fragments and create a true 'pan-genome'. For instance, if we only have the Spike gene in a fragment, this belongs to coordinates ~21,000-23,000 on the pan-genome. edit: Then of course for the assembly/contigs that are generated in new libraries. We can quickly infer meta-data about what should be in a contig, by knowing approximately where it sits on the pan-CoV genome. |
I'm assigning myself to develop a system for annotating a Cov nt sequence with coordinates of functional domains (ORFs, regulatory sequences etc.). Something like PFAM HMMs I'm thinking. |
@ababaian said "this belongs to coordinates ~21,000-23,000 on the pan-genome" This kind of system works well for SSU genes and immunoglobulins, would be ideal here. It may not be possible, or a good idea, if there is gene gain or loss, or mosaic / chimeric viruses in the family. First step is to parse into functional units, then ask if domain organization is well conserved. |
Related note: naively projecting onto a pan-genome with 32 bins (summarizer strategy) is robust against the complications mentioned in my last comment. |
@rcedgar If you want to take this on, awesome! I might suggest some tools here: |
We now have VADR for annotating CoVs. Do we still want to annotate the reference CoV genomes? |
If VADR is the tool we're going to use in production, then would be great validation to provide VADR annotations of the ~800 full-length Cov genomes currently in GenBank. For the RefSeqs especially, this will show the strengths and weaknesses of the method. Edit -- Actually I guess not the RefSeqs because they are used as references by VADR IIRC. |
To better evaluate read hits to the reference genomes, we need to have a quality, consistent annotation of those genomes, to see where the reads are hitting, and whether they are significant.
The text was updated successfully, but these errors were encountered: