Taxonomy identifiers for Cov reference database #45

rcedgar · 2020-04-23T15:04:44Z

It would be useful to include taxonomy information in pan-genome reference to enable analyses such as taxonomy-aware summaries of bowtie2 hits. Taxonomy descriptions are included in the full FASTA deflines, but not the taxonomy id (i.e., the integer accession in the NCBI Taxonomy database). The integer accession is better because it places the sequence in a tree structure. For example, MH726362.1 is "Porcine epidemic diarrhea virus isolate GDS05, complete genome" and the taxonomy id in the Genbank record is 28295. To implement this, we need a script which can take a long list (~33k) of sequence accessions (MH726362.1) and fetch their taxonomy identifiers. I'm guessing this can be done by a simply query using Entrez or something like that.

Even better would be to have the taxonomy identifier (ideal) or species name (good) of the host as well. This can be done by downloading the Genbank record and extracting the "/host" annotation which gives the species name, e.g. for MH726362.1 you would find /host="Sus scrofa". The taxonomy identifier of the host is not given, so an additional lookup (Entrez?) would be required to retrieve it.

ababaian · 2020-04-24T02:54:05Z

I also downloaded the original cov0 dataset in genbank format cov0 that is rich with this information

LOCUS       X81024                   761 bp    RNA     linear   VRL 14-JUL-2016
DEFINITION  Feline enteric coronavirus genomic RNA, ORF3a, 3d, 4.
ACCESSION   X81024
VERSION     X81024.1
KEYWORDS    ORF3a; ORF3d; ORF4.
SOURCE      Coronaviridae
  ORGANISM  Coronaviridae
            Viruses; ssRNA viruses; ssRNA positive-strand viruses, no DNA
            stage; Nidovirales.
REFERENCE   1
  AUTHORS   Lewis,E.L.
  JOURNAL   Thesis (1996) Bristol University
REFERENCE   2  (bases 1 to 761)
  AUTHORS   Lewis,E.L.
  TITLE     Direct Submission
  JOURNAL   Submitted (11-AUG-1994) E.L. Lewis, University of Bristol, Dept of
            Clinical Veterinary Sciences, Langford House, Bristol BS18 7DU, UK
  REMARK    Revised by author 15-JAN-96
COMMENT     nucleotides 1-10 overlap sequence 4456-4465 of entry X80799.
            
            CTAAAC is minimal conserved sequence of transcription signals. ORF
            3d possibly not transcribed.
            ORF3a contains a large deletion compared with the wild type FECV
            strain 79-1683.
            No protein products of ORF3a or 3d yet identified in FECV strain
            79-1683 infected cells.
            ORF3 thought to encode non-structural gene product/s. Further
            analysis underway.
FEATURES             Location/Qualifiers
     source          1..761
                     /organism="Coronaviridae"
                     /mol_type="genomic RNA"
                     /strain="79-1683"
                     /isolate="Actinomycin D sensitive strain"
                     /db_xref="taxon:11118"
                     /map="between ORF2 (spike) and ORF4 (small membrane)"
                     /clone="registered plasmid prefix:PVML"
                     /clone_lib="Lambda ZAPII; FECV strain 79-1683 cDNA
                     /Bluescript SK"
     misc_feature    1..9
                     /note="ORF3a transcription signal"
     CDS             31..168
                     /codon_start=1
                     /product="ORF3a"
                     /protein_id="CAA56929.1"
                     /db_xref="InterPro:IPR006784"
                     /db_xref="UniProtKB/TrEMBL:Q66926"
                     /translation="MDIVKSIDTSVDAVLDELDRAYFAVTLKVEFIIQQMCILHNKNV
                     L"
     misc_feature    267..274
                     /note="ORF3d transcription signal"
     CDS             297..>761
                     /codon_start=1
                     /product="ORF3d"
                     /protein_id="CAA56930.1"
                     /db_xref="InterPro:IPR004293"
                     /db_xref="UniProtKB/TrEMBL:Q66927"
                     /translation="MFKIVSMTLIGPMLIAYGYYIDGTVTTTVLALRFVYLAYFWYVN
                     SRFEFILYNTTTLMFVHGRAAPFMRSSHSSIYVTLYGGINYMFVNDLTLHFVDPMLVS
                     IAIRGLAHADLTVVRAVELLNGDFIYVFSQEPVVGVYNAAFSQAVLNEIDLKE"
     misc_feature    738..743
                     /note="ORF4 transcription signal"
     misc_feature    761
                     /note="ORF4 start codon"
ORIGIN      
        1 actaaactta tgagtcacta cagatcttgt atggatattg tcaaatctat tgacacatcc
       61 gtagacgctg tacttgacga acttgatcgt gcatactttg ctgttactct taaagtagaa
      121 tttataatac agcaaatgtg cattctacac aacaagaacg tgttatagta caacagcatc
      181 aagttgttag tgctagaaca caaaactatt attcagagtt cagcatcgct gtactctttg
      241 tatcattttt ggctttgtac cgtagtacaa actttaagac gtgtgtcggc atcttaatgt
      301 ttaagattgt atcaatgaca cttataggac ctatgcttat agcatatggt tactatattg
      361 atggcactgt tacaacaact gtcttagctt taagatttgt ctacttagca tacttttggt
      421 atgttaatag taggtttgaa tttattttat acaatacaac gacactcatg tttgtacatg
      481 gcagagctgc accgtttatg agaagttctc acagctctat ttatgtcaca ttgtatggtg
      541 gcataaatta tatgtttgtg aatgacctca cgttgcattt tgtagaccct atgcttgtaa
      601 gcatagcaat acgtggctta gctcatgctg atctaactgt tgttagagca gtcgaacttc
      661 tcaatggtga ttttatttat gtattttcac aggagcccgt agtcggtgtt tacaatgcag
      721 ccttttctca ggcggttcta aacgaaattg acttaaaaga a
//

ababaian · 2020-04-24T20:03:37Z

Desired Fields (when avail)

RCE: deleted requested field "FASTA sequence length" because it duplicates information in the sequence itself.

Bdegraaf1234 · 2020-04-24T20:57:16Z

I'm looking at this now, the most recent R package (https://bioconductor.org/packages/release/bioc/html/genbankr.html) does not allow for only some of the entries to have chromosome names:

This file appears to have some source features which specify chromosome names and others that do not. This is not currently supported. Please contact the maintainer if you need this feature

I guess I will keep looking, I don't know if they want the files to just be consistent, or if we just need the info. Otherwise, we might delete the chromosome name beforehand?

edit: seems we have only 21 entries with the chromosome name so I will delete it

Bdegraaf1234 · 2020-04-25T12:00:19Z

OPEN questions:

Duplicates: I'm unclear whats meant there.. @rcedgar
How to identify blacklisted entries?

Desired Fields (when avail)

* [x]  Accession ID

* [x]  CoV Species Taxonomy ID

* [x]  Host Species Taxonomy ID --> not as ID yet, 80%

* [x]  Whole genome or partial --> not formatted perfectly (~90%)

* [x]  open reading frames in sequence (comma seperated) --> what do we want? i took all the gene_ids for now

* [ ]  Is blacklisted?

* [x] deleted FASTA sequence length (RCE)

* [ ] duplicates information in the sequence itself. (RCE)

definition is mostly formatted like this:

Human coronavirus 229E strain HCoV-229E-8/8/01 spike glycoprotein(S) gene, complete cds.

I'm taking the string between the 1st and 2nd comma but its not 100% consistent, more like 90.
6 possible entries: complete/partial cds, complete/partial sequence, complete/partial genome

strain field is quite inconsistent and incomplete (~33%), might be useful though?
ORF is now a list of gene_id which looks like this: S,orf3a,orf3b,E,M,orf6,orf7a,orf7b,orf8a,orf8b,N,orf9b. some rows give the orf as 1a, b etc so i cant filter on "orf"
#duplicates counts the number of "n" nuleotides

rcedgar · 2020-04-25T14:52:53Z

This issue is solved by having the Genbank records, if we have those there is no need to copy fields into FASTA headers.

Bdegraaf1234 · 2020-04-25T16:19:42Z

@rcedgar
So, no use for a flattened version of the genbank records/their metadata?

rcedgar · 2020-04-25T16:24:19Z

None that I know of right now. Sorry you wasted time on that. Things are a bit disorganized because we're moving quickly; not sure how to avoid this problem in future, perhaps wait until an issue is assigned to you.

Bdegraaf1234 · 2020-04-25T16:44:50Z

makes sense, thanks for the quick response

ababaian · 2020-04-25T17:52:25Z

I set up @Bdegraaf1234 on this. I'd like to have that genbank metadata in a flat TSV file and in an R data.frame for easier downstream analysis. There's alot of these small formatting issues inconsistencies (i.e. data tidying) which will be incredibly helpful to have moving forward.

Duplicates: I'm unclear whats meant there
How to identify blacklisted entries?

We can swing back to these once the main data is cleaned up and well annotated. These values are something we are generating as we process theses bulk sequences into what we are calling a 'pan-genome'

You can see this processing procedure in this notebook entry

rcedgar · 2020-04-25T18:23:08Z

The thing with "duplicates" is a formatting problem caused by copy-pasting the list. See the first time it appears above; I edited the list to remove "sequence length" from @ababaian 's original wish-list on the grounds that it was redundant (which it would be with fasta, but not with tsv).

rcedgar · 2020-04-25T18:26:15Z

If we're going to re-format the gb records so that they are more easily parsed, then I would suggest using FASTA per my original "ABC" proposal. This is more flexible and forwards-compatible than TSV, and avoids introducing a new file format -- everything can read fasta, and parsing a defline with fields is just as easy as parsing a tsv file, e.g. in Python it is trivial to do Fields = Line.split(':') where:is the field separator, thus avoiding a dependency on a tsv library import. If the file size of including the sequence is an issue, then the deflines can trivially be obtained by grep "^>" to create a file which is almost the same as a tsv. This secondary file could be generated automatically at the same time the FASTA is made.

taltman · 2020-05-17T07:23:40Z

What's the current status on this issue?

ababaian · 2020-05-17T18:21:19Z

Closing this and moving discussion to #101

rcedgar added enhancement New feature or request Bioinformatics Bioinformatics task labels Apr 23, 2020

ababaian mentioned this issue Apr 23, 2020

Preserving full headers in pan-genome #46

Closed

rcedgar closed this as completed Apr 25, 2020

ababaian assigned Bdegraaf1234 Apr 25, 2020

ababaian reopened this Apr 25, 2020

ababaian linked a pull request Apr 27, 2020 that will close this issue

Genbank parser dev #61

Open

rcedgar mentioned this issue Apr 28, 2020

Updates for cov3r pan-genome #64

Closed

taltman added this to To do in Serratus Annotation via automation May 17, 2020

taltman added this to the Annotation: Ref Sequences milestone May 17, 2020

ababaian closed this as completed May 17, 2020

Serratus Annotation automation moved this from To do to Done May 17, 2020

ababaian mentioned this issue May 17, 2020

Deliver uniform annotation of reference coronavirus genomes #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomy identifiers for Cov reference database #45

Taxonomy identifiers for Cov reference database #45

rcedgar commented Apr 23, 2020 •

edited

ababaian commented Apr 24, 2020

ababaian commented Apr 24, 2020 •

edited by rcedgar

Bdegraaf1234 commented Apr 24, 2020 •

edited

Bdegraaf1234 commented Apr 25, 2020 •

edited

Desired Fields (when avail)

rcedgar commented Apr 25, 2020

Bdegraaf1234 commented Apr 25, 2020

rcedgar commented Apr 25, 2020

Bdegraaf1234 commented Apr 25, 2020

ababaian commented Apr 25, 2020 •

edited

rcedgar commented Apr 25, 2020 •

edited

rcedgar commented Apr 25, 2020 •

edited

taltman commented May 17, 2020

ababaian commented May 17, 2020

Taxonomy identifiers for Cov reference database #45

Taxonomy identifiers for Cov reference database #45

Comments

rcedgar commented Apr 23, 2020 • edited

ababaian commented Apr 24, 2020

ababaian commented Apr 24, 2020 • edited by rcedgar

Desired Fields (when avail)

Bdegraaf1234 commented Apr 24, 2020 • edited

Bdegraaf1234 commented Apr 25, 2020 • edited

Desired Fields (when avail)

rcedgar commented Apr 25, 2020

Bdegraaf1234 commented Apr 25, 2020

rcedgar commented Apr 25, 2020

Bdegraaf1234 commented Apr 25, 2020

ababaian commented Apr 25, 2020 • edited

rcedgar commented Apr 25, 2020 • edited

rcedgar commented Apr 25, 2020 • edited

taltman commented May 17, 2020

ababaian commented May 17, 2020

rcedgar commented Apr 23, 2020 •

edited

ababaian commented Apr 24, 2020 •

edited by rcedgar

Bdegraaf1234 commented Apr 24, 2020 •

edited

Bdegraaf1234 commented Apr 25, 2020 •

edited

ababaian commented Apr 25, 2020 •

edited

rcedgar commented Apr 25, 2020 •

edited

rcedgar commented Apr 25, 2020 •

edited