Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomy identifiers for Cov reference database #45

Closed
rcedgar opened this issue Apr 23, 2020 · 13 comments · May be fixed by #61
Closed

Taxonomy identifiers for Cov reference database #45

rcedgar opened this issue Apr 23, 2020 · 13 comments · May be fixed by #61
Assignees
Labels
Bioinformatics Bioinformatics task enhancement New feature or request

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented Apr 23, 2020

It would be useful to include taxonomy information in pan-genome reference to enable analyses such as taxonomy-aware summaries of bowtie2 hits. Taxonomy descriptions are included in the full FASTA deflines, but not the taxonomy id (i.e., the integer accession in the NCBI Taxonomy database). The integer accession is better because it places the sequence in a tree structure. For example, MH726362.1 is "Porcine epidemic diarrhea virus isolate GDS05, complete genome" and the taxonomy id in the Genbank record is 28295. To implement this, we need a script which can take a long list (~33k) of sequence accessions (MH726362.1) and fetch their taxonomy identifiers. I'm guessing this can be done by a simply query using Entrez or something like that.

Even better would be to have the taxonomy identifier (ideal) or species name (good) of the host as well. This can be done by downloading the Genbank record and extracting the "/host" annotation which gives the species name, e.g. for MH726362.1 you would find /host="Sus scrofa". The taxonomy identifier of the host is not given, so an additional lookup (Entrez?) would be required to retrieve it.

@rcedgar rcedgar added enhancement New feature or request Bioinformatics Bioinformatics task labels Apr 23, 2020
@ababaian
Copy link
Owner

I also downloaded the original cov0 dataset in genbank format cov0 that is rich with this information

LOCUS       X81024                   761 bp    RNA     linear   VRL 14-JUL-2016
DEFINITION  Feline enteric coronavirus genomic RNA, ORF3a, 3d, 4.
ACCESSION   X81024
VERSION     X81024.1
KEYWORDS    ORF3a; ORF3d; ORF4.
SOURCE      Coronaviridae
  ORGANISM  Coronaviridae
            Viruses; ssRNA viruses; ssRNA positive-strand viruses, no DNA
            stage; Nidovirales.
REFERENCE   1
  AUTHORS   Lewis,E.L.
  JOURNAL   Thesis (1996) Bristol University
REFERENCE   2  (bases 1 to 761)
  AUTHORS   Lewis,E.L.
  TITLE     Direct Submission
  JOURNAL   Submitted (11-AUG-1994) E.L. Lewis, University of Bristol, Dept of
            Clinical Veterinary Sciences, Langford House, Bristol BS18 7DU, UK
  REMARK    Revised by author 15-JAN-96
COMMENT     nucleotides 1-10 overlap sequence 4456-4465 of entry X80799.
            
            CTAAAC is minimal conserved sequence of transcription signals. ORF
            3d possibly not transcribed.
            ORF3a contains a large deletion compared with the wild type FECV
            strain 79-1683.
            No protein products of ORF3a or 3d yet identified in FECV strain
            79-1683 infected cells.
            ORF3 thought to encode non-structural gene product/s. Further
            analysis underway.
FEATURES             Location/Qualifiers
     source          1..761
                     /organism="Coronaviridae"
                     /mol_type="genomic RNA"
                     /strain="79-1683"
                     /isolate="Actinomycin D sensitive strain"
                     /db_xref="taxon:11118"
                     /map="between ORF2 (spike) and ORF4 (small membrane)"
                     /clone="registered plasmid prefix:PVML"
                     /clone_lib="Lambda ZAPII; FECV strain 79-1683 cDNA
                     /Bluescript SK"
     misc_feature    1..9
                     /note="ORF3a transcription signal"
     CDS             31..168
                     /codon_start=1
                     /product="ORF3a"
                     /protein_id="CAA56929.1"
                     /db_xref="InterPro:IPR006784"
                     /db_xref="UniProtKB/TrEMBL:Q66926"
                     /translation="MDIVKSIDTSVDAVLDELDRAYFAVTLKVEFIIQQMCILHNKNV
                     L"
     misc_feature    267..274
                     /note="ORF3d transcription signal"
     CDS             297..>761
                     /codon_start=1
                     /product="ORF3d"
                     /protein_id="CAA56930.1"
                     /db_xref="InterPro:IPR004293"
                     /db_xref="UniProtKB/TrEMBL:Q66927"
                     /translation="MFKIVSMTLIGPMLIAYGYYIDGTVTTTVLALRFVYLAYFWYVN
                     SRFEFILYNTTTLMFVHGRAAPFMRSSHSSIYVTLYGGINYMFVNDLTLHFVDPMLVS
                     IAIRGLAHADLTVVRAVELLNGDFIYVFSQEPVVGVYNAAFSQAVLNEIDLKE"
     misc_feature    738..743
                     /note="ORF4 transcription signal"
     misc_feature    761
                     /note="ORF4 start codon"
ORIGIN      
        1 actaaactta tgagtcacta cagatcttgt atggatattg tcaaatctat tgacacatcc
       61 gtagacgctg tacttgacga acttgatcgt gcatactttg ctgttactct taaagtagaa
      121 tttataatac agcaaatgtg cattctacac aacaagaacg tgttatagta caacagcatc
      181 aagttgttag tgctagaaca caaaactatt attcagagtt cagcatcgct gtactctttg
      241 tatcattttt ggctttgtac cgtagtacaa actttaagac gtgtgtcggc atcttaatgt
      301 ttaagattgt atcaatgaca cttataggac ctatgcttat agcatatggt tactatattg
      361 atggcactgt tacaacaact gtcttagctt taagatttgt ctacttagca tacttttggt
      421 atgttaatag taggtttgaa tttattttat acaatacaac gacactcatg tttgtacatg
      481 gcagagctgc accgtttatg agaagttctc acagctctat ttatgtcaca ttgtatggtg
      541 gcataaatta tatgtttgtg aatgacctca cgttgcattt tgtagaccct atgcttgtaa
      601 gcatagcaat acgtggctta gctcatgctg atctaactgt tgttagagca gtcgaacttc
      661 tcaatggtga ttttatttat gtattttcac aggagcccgt agtcggtgtt tacaatgcag
      721 ccttttctca ggcggttcta aacgaaattg acttaaaaga a
//

@ababaian
Copy link
Owner

ababaian commented Apr 24, 2020

Desired Fields (when avail)

  • Accession ID

  • CoV Species Taxonomy ID

  • Host Species Taxonomy ID

  • Whole genome or partial

  • open reading frames in sequence (comma seperated)

  • Is blacklisted?

RCE: deleted requested field "FASTA sequence length" because it duplicates information in the sequence itself.

@Bdegraaf1234
Copy link

Bdegraaf1234 commented Apr 24, 2020

I'm looking at this now, the most recent R package (https://bioconductor.org/packages/release/bioc/html/genbankr.html) does not allow for only some of the entries to have chromosome names:

This file appears to have some source features which specify chromosome names and others that do not. This is not currently supported. Please contact the maintainer if you need this feature

I guess I will keep looking, I don't know if they want the files to just be consistent, or if we just need the info. Otherwise, we might delete the chromosome name beforehand?

edit: seems we have only 21 entries with the chromosome name so I will delete it

@Bdegraaf1234
Copy link

Bdegraaf1234 commented Apr 25, 2020

OPEN questions:

  • Duplicates: I'm unclear whats meant there.. @rcedgar
  • How to identify blacklisted entries?

Desired Fields (when avail)

* [x]  Accession ID

* [x]  CoV Species Taxonomy ID

* [x]  Host Species Taxonomy ID --> not as ID yet, 80%

* [x]  Whole genome or partial --> not formatted perfectly (~90%)

* [x]  open reading frames in sequence (comma seperated) --> what do we want? i took all the gene_ids for now

* [ ]  Is blacklisted?

* [x] deleted FASTA sequence length (RCE)

* [ ] duplicates information in the sequence itself. (RCE)
  • definition is mostly formatted like this:
Human coronavirus 229E strain HCoV-229E-8/8/01 spike glycoprotein(S) gene, complete cds.

I'm taking the string between the 1st and 2nd comma but its not 100% consistent, more like 90.
6 possible entries: complete/partial cds, complete/partial sequence, complete/partial genome

  • strain field is quite inconsistent and incomplete (~33%), might be useful though?
  • ORF is now a list of gene_id which looks like this: S,orf3a,orf3b,E,M,orf6,orf7a,orf7b,orf8a,orf8b,N,orf9b. some rows give the orf as 1a, b etc so i cant filter on "orf"
  • #duplicates counts the number of "n" nuleotides

@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 25, 2020

This issue is solved by having the Genbank records, if we have those there is no need to copy fields into FASTA headers.

@rcedgar rcedgar closed this as completed Apr 25, 2020
@Bdegraaf1234
Copy link

@rcedgar
So, no use for a flattened version of the genbank records/their metadata?

@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 25, 2020

None that I know of right now. Sorry you wasted time on that. Things are a bit disorganized because we're moving quickly; not sure how to avoid this problem in future, perhaps wait until an issue is assigned to you.

@Bdegraaf1234
Copy link

makes sense, thanks for the quick response

@ababaian
Copy link
Owner

ababaian commented Apr 25, 2020

I set up @Bdegraaf1234 on this. I'd like to have that genbank metadata in a flat TSV file and in an R data.frame for easier downstream analysis. There's alot of these small formatting issues inconsistencies (i.e. data tidying) which will be incredibly helpful to have moving forward.


Duplicates: I'm unclear whats meant there
How to identify blacklisted entries?

We can swing back to these once the main data is cleaned up and well annotated. These values are something we are generating as we process theses bulk sequences into what we are calling a 'pan-genome'

You can see this processing procedure in this notebook entry

@ababaian ababaian reopened this Apr 25, 2020
@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 25, 2020

The thing with "duplicates" is a formatting problem caused by copy-pasting the list. See the first time it appears above; I edited the list to remove "sequence length" from @ababaian 's original wish-list on the grounds that it was redundant (which it would be with fasta, but not with tsv).

@rcedgar
Copy link
Collaborator Author

rcedgar commented Apr 25, 2020

If we're going to re-format the gb records so that they are more easily parsed, then I would suggest using FASTA per my original "ABC" proposal. This is more flexible and forwards-compatible than TSV, and avoids introducing a new file format -- everything can read fasta, and parsing a defline with fields is just as easy as parsing a tsv file, e.g. in Python it is trivial to do Fields = Line.split(':') where:is the field separator, thus avoiding a dependency on a tsv library import. If the file size of including the sequence is an issue, then the deflines can trivially be obtained by grep "^>" to create a file which is almost the same as a tsv. This secondary file could be generated automatically at the same time the FASTA is made.

@ababaian ababaian linked a pull request Apr 27, 2020 that will close this issue
@taltman
Copy link
Collaborator

taltman commented May 17, 2020

What's the current status on this issue?

@taltman taltman added this to To do in Serratus Annotation via automation May 17, 2020
@taltman taltman added this to the Annotation: Ref Sequences milestone May 17, 2020
@ababaian
Copy link
Owner

Closing this and moving discussion to #101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

4 participants