SATIVA requires a 'taxonomy file', i.e. a text file that links each sequence record to a ncbi taxonomic id (aka taxid - see here).

We are going to generate this file from a bunch of sequence records downloaded from Genbank.

We have downloaded our reference sequences from Genbank in the so-called, genbank format, which is a rich format containing all sorts of metadata for every sequence record, including a specific taxonomic id (taxid), i.e. a identity number that uniquely identifies every species (and also higher taxonomic levels, such as genera, families, classes, etc) recorded on Genbank.

Let's parse our Genbank file using Biopython.


In [1]:
from Bio import SeqIO

records = SeqIO.to_dict(SeqIO.parse(open('../fetch_clean_align_tree/12S_preSATIVA_birds.gb','r'),'genbank'))

In the below cell we display the first of our sequence records in Genbank format. Try to identify the taxid of the record and use NCBI's taxonomy database (here) to retrieve the full taxonomic tree for the taxon.

In [2]:
for r in records.keys():
    print records[r].format('genbank')
    break

LOCUS       LC012932                 651 bp    DNA     linear   VRT 12-NOV-2015
DEFINITION  Chrysolophus pictus mitochondrial gene for 12S rRNA, partial
            sequence.
ACCESSION   LC012932
VERSION     LC012932.1
KEYWORDS    .
SOURCE      mitochondrion Chrysolophus pictus (golden pheasant)
  ORGANISM  Chrysolophus pictus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda;
            Coelurosauria; Aves; Neognathae; Galloanserae; Galliformes;
            Phasianidae; Phasianinae; Chrysolophus.
REFERENCE   1
  AUTHORS   Nadeem,M.S., Ahmad,H., Muhammad,K., Mahmood,S.A., Akbar,N.,
            Murtaza,B.N., Saleem,A. and Rehman,M.
  TITLE     Chrysolophus pictus 12S RNA gene partial sequence
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 651)
  AUTHORS   Nadeem,M.S.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-DEC-2014) Contact:Muhammad Shahid Nadeem Hazara
            

We don't want to make our life harder than it already is and deal with subspecies so in the tax file for SATIVA we'll only do species.

Iterate over all records and check if taxid is subspecies, if yes replace with taxid of species and extract taxonomy line. If it's a subspecies record the record id for further processing.


In [3]:
from collections import defaultdict

taxon_to_taxid = {}
recs_to_adjust = []
taxon_to_recs = defaultdict(list)

for key in records.keys():
    r = records[key]
   
    source = [f for f in r.features if f.type == 'source'][0]
    if (len(source.qualifiers['organism'][0].split(" ")) == 2):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
#            print source.qualifiers['db_xref']
            for t in source.qualifiers['db_xref']:
#                print t
                if 'taxon' in t:
                    if not source.qualifiers['organism'][0] in taxon_to_taxid:
                        print " .. add to records"
                        taxon_to_taxid[source.qualifiers['organism'][0]] = t.split(":")[1]
                    else:
                        print " .. already covered"
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    elif (len(source.qualifiers['organism'][0].split(" ")) > 2 and source.qualifiers['organism'][0].split(" ")[1] == 'sp.'):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
            for t in source.qualifiers['db_xref']:
                if 'taxon' in t:
                    if not str(t) in taxon_to_taxid:
                        taxon_to_taxid[source.qualifiers['organism'][0]] = str(t)
                    else:
                        print " .. already covered" 
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    else:
        print "subspecies: %s" %source.qualifiers['organism'][0]
        recs_to_adjust.append(r.id)

Chrysolophus pictus  .. add to records
Schoeniclus aureolus  .. add to records
Apus apus  .. add to records
Phoenicopterus roseus  .. add to records
Podilymbus podiceps  .. add to records
Poecile palustris  .. add to records
Dupetor flavicollis  .. add to records
Phoenicopterus ruber  .. add to records
Tetrao urogallus  .. add to records
Remiz consobrinus  .. add to records
Emberiza pusilla  .. add to records
Turdus migratorius  .. add to records
Dupetor flavicollis  .. already covered
Fulica americana  .. add to records
Turdus philomelos  .. add to records
Phylloscopus orientalis  .. add to records
Podiceps auritus  .. add to records
Abrornis inornata  .. add to records
Picus viridis  .. add to records
subspecies: Buteo buteo burmanicus
Charadrius semipalmatus  .. add to records
Emberiza citrinella  .. add to records
Recurvirostra avosetta  .. add to records
Cygnus cygnus  .. add to records
Pluvialis dominica  .. add to records
Luscinia cyanura  .. add to records
Arenaria interpres  .

For the records that were identified as being *subspecies* reduce to species and check whether the taxid for the species had already been encountered before. If not we'd need to fetch it from NCBI.

In [4]:
from collections import defaultdict

to_fetch = defaultdict(list)

for key in records.keys():
    r = records[key]
    if r.id in recs_to_adjust:
        source = [f for f in r.features if f.type == 'source'][0]
        adjust_from = source.qualifiers['organism'][0]
        adjust_to = " ".join(adjust_from.split(" ")[:2])
        print "%s -> %s" %(adjust_from,adjust_to)
        if adjust_to in taxon_to_taxid:
            taxon_to_recs[adjust_to].append(r.id)
        else:
            to_fetch[adjust_to].append(r.id)

Buteo buteo burmanicus -> Buteo buteo
Motacilla flava cinereocapilla -> Motacilla flava
Oenanthe oenanthe oenanthe -> Oenanthe oenanthe
Motacilla cinerea cinerea -> Motacilla cinerea
Falco tinnunculus interstinctus -> Falco tinnunculus
Trachemys scripta elegans -> Trachemys scripta
Milvus migrans lineatus -> Milvus migrans
Cygnus columbianus jankowskii -> Cygnus columbianus
Motacilla flava lutea -> Motacilla flava
Motacilla flava flava -> Motacilla flava
Oxyura jamaicensis jamaicensis -> Oxyura jamaicensis
Motacilla flava feldegg -> Motacilla flava
Gallinago gallinago gallinago -> Gallinago gallinago
Turdus ruficollis ruficollis -> Turdus ruficollis
Loxia curvirostra pusilla -> Loxia curvirostra
Loxia leucoptera megaplaga -> Loxia leucoptera
Trachemys scripta elegans -> Trachemys scripta
Pandion haliaetus haliaetus -> Pandion haliaetus
Phylloscopus fuscatus robustus -> Phylloscopus fuscatus
Hirundo rustica erythrogaster -> Hirundo rustica
Phylloscopus fuscatus fuscatus -> Phylloscopus 

Check if we are good, or if there are any taxids missing.

In [5]:
if to_fetch:
    print "need to fetch some taxids"
else:
    print "Have taxids for all records"

need to fetch some taxids


In [6]:
print to_fetch

defaultdict(<type 'list'>, {'Dendrocopos minor': ['KF766041.1'], 'Oenanthe oenanthe': ['EU154491.1'], 'Thalasseus sandvicensis': ['AY631349.1'], 'Turdus naumanni': ['KJ834096.1'], 'Turdus ruficollis': ['EU154560.1'], 'Zoothera sibirica': ['EU154578.1'], 'Zoothera dauma': ['EU154573.1', 'KT340629.1'], 'Turdus viscivorus': ['EU154569.1'], 'Loxia leucoptera': ['AF171660.1', 'AF171655.1', 'AF171661.1'], 'Motacilla flava': ['AY259399.1', 'AY259421.1', 'AY259411.1', 'AY259407.1', 'AY259404.1', 'AY259424.1'], 'Hirundo rustica': ['KX398931.1', 'KP148840.1'], 'Loxia curvirostra': ['AF171662.1', 'AF171658.1', 'AF171652.1', 'AF171653.1'], 'Seicercus borealis': ['AY635094.1'], 'Oxyura jamaicensis': ['AY747700.1']})


In [7]:
from Bio import Entrez
Entrez.email = "L.Harper@2015.hull.ac.uk"

for binomial in to_fetch:
    print binomial
    handle = Entrez.esearch(db="Taxonomy", term=binomial)
    record = Entrez.read(handle)
    print record["IdList"][0]
    taxon_to_taxid[binomial] = record["IdList"][0]
    
    taxon_to_recs[binomial] = to_fetch[binomial]

Dendrocopos minor
1517834
Oenanthe oenanthe
279966
Thalasseus sandvicensis
126723
Turdus naumanni
34940
Turdus ruficollis
411525
Zoothera sibirica
311352
Zoothera dauma
36288
Turdus viscivorus
301543
Loxia leucoptera
96539
Motacilla flava
180448
Hirundo rustica
43150
Loxia curvirostra
64802
Seicercus borealis
36273
Oxyura jamaicensis
8884


Write taxids to file and fetch full taxonomy for all of them using taxit from the taxtastic package.

In [8]:
taxids = []

out=open("taxids.txt",'w')
for sp in taxon_to_taxid:
    taxids.append(taxon_to_taxid[sp])
    out.write(taxon_to_taxid[sp]+"\n")
out.close()

Create tab-delimited text file with full taxonomic tree for each taxid.

In [9]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

The full 'taxonomy string' for a given taxon as returned from NCBI could look, e.g. like so:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Actinopterygii; Actinopteri; Neopterygii; Teleostei; Osteoglossocephalai; Clupeocephala; Euteleosteomorpha; Protacanthopterygii; Salmoniformes; Salmonidae; Salmoninae; Salmo; Salmo trutta

The number of taxonomic levels in the taxonomy string may vary between taxa. Some taxonomic groups are classified into relatively uncommon intermediate taxonomic levels, that may not exist for other taxa.

In order to make our lives easier downstream we will limit ourselves only to a defined set of the most common taxonomic levels, that should be known for pretty much all taxa.

  -  superkingdom
  -  phylum
  -  class
  -  order
  -  family
  -  genus
  -  species

Extract 'taxonomy sring' for a specific set of taxonomic levlels.


In [10]:
from collections import defaultdict

tax_levels=['superkingdom','phylum','class','order','family','genus','species']
indices = []
taxdict = defaultdict(list)
taxids_to_taxonomy = {}

infile=open("taxa.csv",'r')
header=infile.next()

header_as_list=header.strip().replace('"','').split(",")
for i in range(len(header_as_list)):
#    print header_as_list[i]
    if header_as_list[i] in tax_levels:
#        print "\t"+header_as_list[i],i
        indices.append(i)

for line in infile:
    line_as_list=line.strip().replace('"',"").split(",")
    taxdict[line_as_list[0]] = line_as_list[1:]

infile.close()

for t in taxids:
    print t,
#    print taxdict[t]
    taxonomy=""
    for i in range(len(tax_levels)):
        if taxdict[t][indices[i]-1] == "":
            taxonomy+='unknown'+';'
        else:
            taxonomy+=taxdict[taxdict[t][indices[i]-1]][2]+";"
    print taxonomy[:-1]
    taxids_to_taxonomy[t] = taxonomy[:-1]

85097 Eukaryota;Chordata;Aves;Gaviiformes;Gaviidae;Gavia;Gavia pacifica
1108006 Eukaryota;Chordata;Aves;Passeriformes;Motacillidae;Anthus;Anthus novaeseelandiae
132587 Eukaryota;Chordata;Aves;Anseriformes;Anatidae;Anser;Anser fabalis
100830 Eukaryota;Chordata;Aves;Galliformes;Phasianidae;Tetrao;Tetrao urogallus
257867 Eukaryota;Chordata;Aves;Pelecaniformes;Threskiornithidae;Platalea;Platalea leucorodia
9052 Eukaryota;Chordata;Aves;Galliformes;Phasianidae;Perdix;Perdix perdix
9091 Eukaryota;Chordata;Aves;Galliformes;Phasianidae;Coturnix;Coturnix coturnix
9252 Eukaryota;Chordata;Aves;Podicipediformes;Podicipedidae;Podilymbus;Podilymbus podiceps
37610 Eukaryota;Chordata;Aves;Passeriformes;Turdidae;Erithacus;Erithacus rubecula
172684 Eukaryota;Chordata;Aves;Gruiformes;Otididae;Tetrax;Tetrax tetrax
47245 Eukaryota;Chordata;Aves;Columbiformes;Columbidae;Zenaida;Zenaida macroura
48895 Eukaryota;Chordata;Aves;Passeriformes;Paridae;Poecile;Poecile montanus
9121 Eukaryota;Chordata;Aves;Gruiforme

Write out *.tax file for SATIVA.

In [11]:
out=open("tax_for_SATIVA.tax", 'w')

for sp in taxon_to_recs:
    for rec in taxon_to_recs[sp]:
        out.write("%s\t%s\n" %(rec,taxids_to_taxonomy[taxon_to_taxid[sp]]))
        
out.close()