SATIVA requires a 'taxonomy file', i.e. a text file that links each sequence record to a ncbi taxonomic id (aka taxid - see here).

We are going to generate this file from a bunch of sequence records downloaded from Genbank.

We have downloaded our reference sequences from Genbank in the so-called, genbank format, which is a rich format containing all sorts of metadata for every sequence record, including a specific taxonomic id (taxid), i.e. a identity number that uniquely identifies every species (and also higher taxonomic levels, such as genera, families, classes, etc) recorded on Genbank.

Let's parse our Genbank file using Biopython.


In [1]:
from Bio import SeqIO

records = SeqIO.to_dict(SeqIO.parse(open('../fetch_clean_align_tree/12S_preSATIVA_mammals.gb','r'),'genbank'))

In the below cell we display the first of our sequence records in Genbank format. Try to identify the taxid of the record and use NCBI's taxonomy database (here) to retrieve the full taxonomic tree for the taxon.

In [2]:
for r in records.keys():
    print records[r].format('genbank')
    break

LOCUS       KM092492                 963 bp    DNA     circular MAM 27-OCT-2014
DEFINITION  Neomys fodiens mitochondrion, complete genome.
ACCESSION   KM092492
VERSION     KM092492.1
KEYWORDS    .
SOURCE      mitochondrion Neomys fodiens (Eurasian water shrew)
  ORGANISM  Neomys fodiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Laurasiatheria; Insectivora; Soricidae;
            Soricinae; Neomys.
REFERENCE   1  (bases 1 to 17260)
  AUTHORS   Xu,C., Liu,Z. and Zhao,W.
  TITLE     Mitochondrial DNA complete genome of Neomys fodiens
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 17260)
  AUTHORS   Xu,C., Liu,Z. and Zhao,W.
  TITLE     Direct Submission
  JOURNAL   Submitted (03-JUL-2014) College of Life Science, Northeast
            Agricultural University, No. 59 Mucai Street Xiangfang District,
            Harbin, Heilongjiang 150030, China
FEATURES             Location/Qualifiers
     source          1..963
     

We don't want to make our life harder than it already is and deal with subspecies so in the tax file for SATIVA we'll only do species.

Iterate over all records and check if taxid is subspecies, if yes replace with taxid of species and extract taxonomy line. If it's a subspecies record the record id for further processing.


In [3]:
from collections import defaultdict

taxon_to_taxid = {}
recs_to_adjust = []
taxon_to_recs = defaultdict(list)

for key in records.keys():
    r = records[key]
   
    source = [f for f in r.features if f.type == 'source'][0]
    if (len(source.qualifiers['organism'][0].split(" ")) == 2):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
#            print source.qualifiers['db_xref']
            for t in source.qualifiers['db_xref']:
#                print t
                if 'taxon' in t:
                    if not source.qualifiers['organism'][0] in taxon_to_taxid:
                        print " .. add to records"
                        taxon_to_taxid[source.qualifiers['organism'][0]] = t.split(":")[1]
                    else:
                        print " .. already covered"
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    elif (len(source.qualifiers['organism'][0].split(" ")) > 2 and source.qualifiers['organism'][0].split(" ")[1] == 'sp.'):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
            for t in source.qualifiers['db_xref']:
                if 'taxon' in t:
                    if not str(t) in taxon_to_taxid:
                        taxon_to_taxid[source.qualifiers['organism'][0]] = str(t)
                    else:
                        print " .. already covered" 
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    else:
        print "subspecies: %s" %source.qualifiers['organism'][0]
        recs_to_adjust.append(r.id)

Neomys fodiens  .. add to records
Bos taurus  .. add to records
Lepus europaeus  .. add to records
Rangifer tarandus  .. add to records
Myocastor coypus  .. add to records
Capreolus capreolus  .. add to records
Balaenoptera physalus  .. add to records
Mustela putorius  .. add to records
Lagenorhynchus acutus  .. add to records
subspecies: Canis lupus familiaris
Ovis aries  .. add to records
Mustela putorius  .. already covered
Felis catus  .. add to records
Equus caballus  .. add to records
Sus scrofa  .. add to records
Physeter catodon  .. add to records
subspecies: Felis silvestris lybica
subspecies: Canis lupus familiaris
Sus scrofa  .. already covered
Sus scrofa  .. already covered
Micromys minutus  .. add to records
Capreolus capreolus  .. already covered
Equus caballus  .. already covered
Ovis aries  .. already covered
Ovis aries  .. already covered
subspecies: Sus scrofa domesticus
Capra hircus  .. add to records
Mustela putorius  .. already covered
subspecies: Canis lupus famil

For the records that were identified as being *subspecies* reduce to species and check whether the taxid for the species had already been encountered before. If not we'd need to fetch it from NCBI.

In [4]:
from collections import defaultdict

to_fetch = defaultdict(list)

for key in records.keys():
    r = records[key]
    if r.id in recs_to_adjust:
        source = [f for f in r.features if f.type == 'source'][0]
        adjust_from = source.qualifiers['organism'][0]
        adjust_to = " ".join(adjust_from.split(" ")[:2])
        print "%s -> %s" %(adjust_from,adjust_to)
        if adjust_to in taxon_to_taxid:
            taxon_to_recs[adjust_to].append(r.id)
        else:
            to_fetch[adjust_to].append(r.id)

Canis lupus familiaris -> Canis lupus
Felis silvestris lybica -> Felis silvestris
Canis lupus familiaris -> Canis lupus
Sus scrofa domesticus -> Sus scrofa
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Sus scrofa domesticus -> Sus scrofa
Mus musculus castaneus -> Mus musculus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Canis lupus familiaris -> Canis lupus
Rhinolophus ferrumequinum tragatus -> Rhinolophus ferrumequinum
Sciurus vulgaris fuscoater -> Sciurus vulgaris
Canis lupus familiaris -> Canis lupus
Cervus elaphus yarkandensis -> Cervus elaphus
Cani

Check if we are good, or if there are any taxids missing.

In [5]:
if to_fetch:
    print "need to fetch some taxids"
else:
    print "Have taxids for all records"

need to fetch some taxids


In [6]:
print to_fetch

defaultdict(<type 'list'>, {'Crocidura suaveolens': ['AF434825.1'], 'Canis lupus': ['KU290857.1', 'AB499816.1', 'KU290550.1', 'EU789654.1', 'FJ817359.1', 'KU290815.1', 'JF342816.1', 'U12828.1', 'KJ637080.1', 'EU789741.1', 'KU290423.1', 'JX088676.1', 'KU290831.1', 'KU290971.1', 'JF342876.1', 'KU290643.1', 'KU290731.1', 'KM224242.1', 'KU290795.1', 'KJ637068.1', 'KU290793.1', 'KM061533.1', 'AB048590.1', 'KU290785.1', 'EU408291.1', 'KU290907.1', 'JF342845.1', 'JF342836.1', 'AY656746.1', 'KU290905.1', 'KF661086.1', 'KU290939.1', 'DQ480489.1', 'KU290497.1', 'EU408260.1', 'KJ139384.1', 'EU789662.1', 'KU290561.1', 'KU290557.1', 'KU290509.1', 'JX088685.1', 'KU290468.1', 'EU256473.1', 'KU290462.1', 'KM061586.1', 'KU291066.1', 'KU290464.1', 'KF661083.1', 'KU290812.1', 'EU789759.1', 'JF342890.1', 'KF661093.1', 'KU290680.1', 'KU290814.1', 'KU290850.1', 'KU291054.1', 'KU290404.1', 'KJ139388.1', 'KU290432.1', 'DQ480499.1', 'FJ817358.1', 'KU290822.1', 'EU789717.1', 'EU789640.1', 'FJ817364.1', 'JF34286

In [7]:
from Bio import Entrez
Entrez.email = "L.Harper@2015.hull.ac.uk"

for binomial in to_fetch:
    print binomial
    handle = Entrez.esearch(db="Taxonomy", term=binomial)
    record = Entrez.read(handle)
    print record["IdList"][0]
    taxon_to_taxid[binomial] = record["IdList"][0]
    
    taxon_to_recs[binomial] = to_fetch[binomial]

Crocidura suaveolens
52631
Canis lupus
9612


Write taxids to file and fetch full taxonomy for all of them using taxit from the taxtastic package.

In [8]:
taxids = []

out=open("taxids.txt",'w')
for sp in taxon_to_taxid:
    taxids.append(taxon_to_taxid[sp])
    out.write(taxon_to_taxid[sp]+"\n")
out.close()

Create tab-delimited text file with full taxonomic tree for each taxid.

In [9]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

The full 'taxonomy string' for a given taxon as returned from NCBI could look, e.g. like so:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Actinopterygii; Actinopteri; Neopterygii; Teleostei; Osteoglossocephalai; Clupeocephala; Euteleosteomorpha; Protacanthopterygii; Salmoniformes; Salmonidae; Salmoninae; Salmo; Salmo trutta

The number of taxonomic levels in the taxonomy string may vary between taxa. Some taxonomic groups are classified into relatively uncommon intermediate taxonomic levels, that may not exist for other taxa.

In order to make our lives easier downstream we will limit ourselves only to a defined set of the most common taxonomic levels, that should be known for pretty much all taxa.

  -  superkingdom
  -  phylum
  -  class
  -  order
  -  family
  -  genus
  -  species

Extract 'taxonomy sring' for a specific set of taxonomic levlels.


In [10]:
from collections import defaultdict

tax_levels=['superkingdom','phylum','class','order','family','genus','species']
indices = []
taxdict = defaultdict(list)
taxids_to_taxonomy = {}

infile=open("taxa.csv",'r')
header=infile.next()

header_as_list=header.strip().replace('"','').split(",")
for i in range(len(header_as_list)):
#    print header_as_list[i]
    if header_as_list[i] in tax_levels:
#        print "\t"+header_as_list[i],i
        indices.append(i)

for line in infile:
    line_as_list=line.strip().replace('"',"").split(",")
    taxdict[line_as_list[0]] = line_as_list[1:]

infile.close()

for t in taxids:
    print t,
#    print taxdict[t]
    taxonomy=""
    for i in range(len(tax_levels)):
        if taxdict[t][indices[i]-1] == "":
            taxonomy+='unknown'+';'
        else:
            taxonomy+=taxdict[taxdict[t][indices[i]-1]][2]+";"
    print taxonomy[:-1]
    taxids_to_taxonomy[t] = taxonomy[:-1]

47230 Eukaryota;Chordata;Mammalia;Rodentia;Cricetidae;Microtus;Microtus arvalis
9940 Eukaryota;Chordata;Mammalia;unknown;Bovidae;Ovis;Ovis aries
9883 Eukaryota;Chordata;Mammalia;unknown;Cervidae;Hydropotes;Hydropotes inermis
9870 Eukaryota;Chordata;Mammalia;unknown;Cervidae;Rangifer;Rangifer tarandus
9720 Eukaryota;Chordata;Mammalia;Carnivora;Phocidae;Phoca;Phoca vitulina
59463 Eukaryota;Chordata;Mammalia;Chiroptera;Vespertilionidae;Myotis;Myotis lucifugus
39082 Eukaryota;Chordata;Mammalia;Rodentia;Gliridae;Muscardinus;Muscardinus avellanarius
39089 Eukaryota;Chordata;Mammalia;Carnivora;Phocidae;Phoca;Phoca groenlandica
103596 Eukaryota;Chordata;Mammalia;Cetacea;Delphinidae;Peponocephala;Peponocephala electra
51298 Eukaryota;Chordata;Mammalia;Chiroptera;Vespertilionidae;Myotis;Myotis myotis
30532 Eukaryota;Chordata;Mammalia;unknown;Cervidae;Dama;Dama dama
59472 Eukaryota;Chordata;Mammalia;Chiroptera;Vespertilionidae;Pipistrellus;Pipistrellus kuhlii
9755 Eukaryota;Chordata;Mammalia;Ceta

Write out *.tax file for SATIVA.

In [11]:
out=open("tax_for_SATIVA.tax", 'w')

for sp in taxon_to_recs:
    for rec in taxon_to_recs[sp]:
        out.write("%s\t%s\n" %(rec,taxids_to_taxonomy[taxon_to_taxid[sp]]))
        
out.close()