SATIVA requires a 'taxonomy file', i.e. a text file that links each sequence record to a ncbi taxonomic id (aka taxid - see here).

We are going to generate this file from a bunch of sequence records downloaded from Genbank.

We have downloaded our reference sequences from Genbank in the so-called, genbank format, which is a rich format containing all sorts of metadata for every sequence record, including a specific taxonomic id (taxid), i.e. a identity number that uniquely identifies every species (and also higher taxonomic levels, such as genera, families, classes, etc) recorded on Genbank.

Let's parse our Genbank file using Biopython.


In [1]:
from Bio import SeqIO

records = SeqIO.to_dict(SeqIO.parse(open('../fetch_clean_align_tree/12S_preSATIVA_reptiles.gb','r'),'genbank'))

In the below cell we display the first of our sequence records in Genbank format. Try to identify the taxid of the record and use NCBI's taxonomy database (here) to retrieve the full taxonomic tree for the taxon.

In [2]:
for r in records.keys():
    print records[r].format('genbank')
    break

LOCUS       KT030713                 387 bp    DNA     linear   VRT 20-JUN-2016
DEFINITION  Podarcis muralis isolate Mur02 12S ribosomal RNA gene, partial
            sequence; mitochondrial.
ACCESSION   KT030713
VERSION     KT030713.1
KEYWORDS    .
SOURCE      mitochondrion Podarcis muralis (common wall lizard)
  ORGANISM  Podarcis muralis
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Lepidosauria; Squamata; Bifurcata; Unidentata; Episquamata;
            Laterata; Lacertibaenia; Lacertidae; Podarcis.
REFERENCE   1  (bases 1 to 387)
  AUTHORS   Rodriguez,V., Buades,J.M., Brown,R.P., Terrasa,B., Perez-Mellado,V.,
            Corti,C., Delaugerre,M., Castro,J.A., Picornell,A. and Ramon,C.
  TITLE     A multilocus analysis of evolution of the endemic wall lizard
            Podarcis tiliguerta from the Mediterranean islands of Corsica and
            Sardinia
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 387)
  AUTHORS   Rodriguez,V., Buad

We don't want to make our life harder than it already is and deal with subspecies so in the tax file for SATIVA we'll only do species.

Iterate over all records and check if taxid is subspecies, if yes replace with taxid of species and extract taxonomy line. If it's a subspecies record the record id for further processing.


In [3]:
from collections import defaultdict

taxon_to_taxid = {}
recs_to_adjust = []
taxon_to_recs = defaultdict(list)

for key in records.keys():
    r = records[key]
   
    source = [f for f in r.features if f.type == 'source'][0]
    if (len(source.qualifiers['organism'][0].split(" ")) == 2):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
#            print source.qualifiers['db_xref']
            for t in source.qualifiers['db_xref']:
#                print t
                if 'taxon' in t:
                    if not source.qualifiers['organism'][0] in taxon_to_taxid:
                        print " .. add to records"
                        taxon_to_taxid[source.qualifiers['organism'][0]] = t.split(":")[1]
                    else:
                        print " .. already covered"
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    elif (len(source.qualifiers['organism'][0].split(" ")) > 2 and source.qualifiers['organism'][0].split(" ")[1] == 'sp.'):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
            for t in source.qualifiers['db_xref']:
                if 'taxon' in t:
                    if not str(t) in taxon_to_taxid:
                        taxon_to_taxid[source.qualifiers['organism'][0]] = str(t)
                    else:
                        print " .. already covered" 
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    else:
        print "subspecies: %s" %source.qualifiers['organism'][0]
        recs_to_adjust.append(r.id)

Podarcis muralis  .. add to records
Podarcis muralis  .. already covered
Caretta caretta  .. add to records
subspecies: Mus musculus castaneus
Podarcis muralis  .. already covered
Emys orbicularis  .. add to records
Podarcis muralis  .. already covered
Chelonia mydas  .. add to records
Zootoca vivipara  .. add to records
Zootoca vivipara  .. already covered
Eretmochelys imbricata  .. add to records
Zootoca vivipara  .. already covered
Zootoca vivipara  .. already covered
Eretmochelys imbricata  .. already covered
Chelonia mydas  .. already covered
Xenopus laevis  .. add to records
Lacerta viridis  .. add to records
Lacerta agilis  .. add to records
Eretmochelys imbricata  .. already covered
subspecies: Trachemys scripta scripta
Chelydra serpentina  .. add to records
Trachemys scripta  .. add to records
subspecies: Mus musculus castaneus
Podarcis muralis  .. already covered
Chelonia mydas  .. already covered
Podarcis muralis  .. already covered
Podarcis muralis  .. already covered
Podar

For the records that were identified as being *subspecies* reduce to species and check whether the taxid for the species had already been encountered before. If not we'd need to fetch it from NCBI.

In [4]:
from collections import defaultdict

to_fetch = defaultdict(list)

for key in records.keys():
    r = records[key]
    if r.id in recs_to_adjust:
        source = [f for f in r.features if f.type == 'source'][0]
        adjust_from = source.qualifiers['organism'][0]
        adjust_to = " ".join(adjust_from.split(" ")[:2])
        print "%s -> %s" %(adjust_from,adjust_to)
        if adjust_to in taxon_to_taxid:
            taxon_to_recs[adjust_to].append(r.id)
        else:
            to_fetch[adjust_to].append(r.id)

Mus musculus castaneus -> Mus musculus
Trachemys scripta scripta -> Trachemys scripta
Mus musculus castaneus -> Mus musculus
Mus musculus castaneus -> Mus musculus
Chrysemys picta bellii -> Chrysemys picta
Trachemys scripta elegans -> Trachemys scripta
Mus musculus castaneus -> Mus musculus
Emys orbicularis orbicularis -> Emys orbicularis
Mus musculus molossinus -> Mus musculus
Mus musculus musculus -> Mus musculus
Mus musculus domesticus -> Mus musculus
Emys orbicularis orbicularis -> Emys orbicularis
Xenopus laevis sudanensis -> Xenopus laevis
Mus musculus musculus -> Mus musculus
Mus musculus musculus -> Mus musculus
Mus musculus musculus -> Mus musculus
Lacerta viridis viridis -> Lacerta viridis
Mus musculus castaneus -> Mus musculus
Mus musculus musculus -> Mus musculus
Trachemys scripta elegans -> Trachemys scripta
Trachemys scripta elegans -> Trachemys scripta


Check if we are good, or if there are any taxids missing.

In [5]:
if to_fetch:
    print "need to fetch some taxids"
else:
    print "Have taxids for all records"

Have taxids for all records


If need to fetch taxids, run cells below.

In [6]:
#print to_fetch

defaultdict(<type 'list'>, {})


In [7]:
#from Bio import Entrez
#Entrez.email = "L.Harper@2015.hull.ac.uk"

#for binomial in to_fetch:
    #print binomial
    #handle = Entrez.esearch(db="Taxonomy", term=binomial)
    #record = Entrez.read(handle)
    #print record["IdList"][0]
    #taxon_to_taxid[binomial] = record["IdList"][0]
    
    #taxon_to_recs[binomial] = to_fetch[binomial]

Write taxids to file and fetch full taxonomy for all of them using taxit from the taxtastic package.

In [7]:
taxids = []

out=open("taxids.txt",'w')
for sp in taxon_to_taxid:
    taxids.append(taxon_to_taxid[sp])
    out.write(taxon_to_taxid[sp]+"\n")
out.close()

Create tab-delimited text file with full taxonomic tree for each taxid.

In [8]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

The full 'taxonomy string' for a given taxon as returned from NCBI could look, e.g. like so:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Actinopterygii; Actinopteri; Neopterygii; Teleostei; Osteoglossocephalai; Clupeocephala; Euteleosteomorpha; Protacanthopterygii; Salmoniformes; Salmonidae; Salmoninae; Salmo; Salmo trutta

The number of taxonomic levels in the taxonomy string may vary between taxa. Some taxonomic groups are classified into relatively uncommon intermediate taxonomic levels, that may not exist for other taxa.

In order to make our lives easier downstream we will limit ourselves only to a defined set of the most common taxonomic levels, that should be known for pretty much all taxa.

  -  superkingdom
  -  phylum
  -  class
  -  order
  -  family
  -  genus
  -  species

Extract 'taxonomy sring' for a specific set of taxonomic levlels.


In [9]:
from collections import defaultdict

tax_levels=['superkingdom','phylum','class','order','family','genus','species']
indices = []
taxdict = defaultdict(list)
taxids_to_taxonomy = {}
temp_tax_levels = []

infile=open("taxa.csv",'r')
header=infile.next()

header_as_list=header.strip().replace('"','').split(",")
for i in range(len(header_as_list)):
#    print header_as_list[i]
    if header_as_list[i] in tax_levels:
#        print "\t"+header_as_list[i],i
        temp_tax_levels.append(header_as_list[i])
        indices.append(i)
        
for line in infile:
#    print line.strip()
    line_as_list=line.strip().replace('"',"").split(",")
    taxdict[line_as_list[0]] = line_as_list[1:]

infile.close()

#In rare cases main taxonomic lavels are missing in all taxa that are to be processed. In such cases we'll adjust
#list of taxonomic levels to be included in the taxonomy file for sativa
if not len(tax_levels) == len(temp_tax_levels):
    tax_levels = temp_tax_levels[:]


for t in taxids:
    print t,
#    print taxdict[t]
    taxonomy=""
    for i in range(len(tax_levels)):
        if taxdict[t][indices[i]-1] == "":
            taxonomy+='unknown'+';'
        else:
            taxonomy+=taxdict[taxdict[t][indices[i]-1]][2]+";"
    print taxonomy[:-1]
    taxids_to_taxonomy[t] = taxonomy[:-1]

8524 Eukaryota;Chordata;unknown;Squamata;Lacertidae;Zootoca;Zootoca vivipara
201439 Eukaryota;Chordata;unknown;Squamata;Colubridae;Zamenis;Zamenis longissimus
100952 Eukaryota;Chordata;Actinopteri;Perciformes;Cottidae;Cottus;Cottus gobio
82168 Eukaryota;Chordata;unknown;Testudines;Emydidae;Emys;Emys orbicularis
64176 Eukaryota;Chordata;unknown;Squamata;Lacertidae;Podarcis;Podarcis muralis
8479 Eukaryota;Chordata;unknown;Testudines;Emydidae;Chrysemys;Chrysemys picta
102178 Eukaryota;Chordata;unknown;Squamata;Anguidae;Anguis;Anguis fragilis
201445 Eukaryota;Chordata;unknown;Squamata;Colubridae;Coronella;Coronella austriaca
65476 Eukaryota;Chordata;unknown;Squamata;Lacertidae;Lacerta;Lacerta viridis
8469 Eukaryota;Chordata;unknown;Testudines;Cheloniidae;Chelonia;Chelonia mydas
8355 Eukaryota;Chordata;Amphibia;Anura;Pipidae;Xenopus;Xenopus laevis
100823 Eukaryota;Chordata;unknown;Squamata;Colubridae;Natrix;Natrix natrix
31155 Eukaryota;Chordata;unknown;Squamata;Viperidae;Vipera;Vipera beru

Write out *.tax file for SATIVA.

In [10]:
out=open("tax_for_SATIVA.tax", 'w')

for sp in taxon_to_recs:
    for rec in taxon_to_recs[sp]:
        out.write("%s\t%s\n" %(rec,taxids_to_taxonomy[taxon_to_taxid[sp]]))
        
out.close()

In [11]:
!cat tax_for_SATIVA.tax

GQ142072.1	Eukaryota;Chordata;unknown;Squamata;Lacertidae;Zootoca;Zootoca vivipara
KM401599.1	Eukaryota;Chordata;unknown;Squamata;Lacertidae;Zootoca;Zootoca vivipara
AY151993.1	Eukaryota;Chordata;unknown;Squamata;Lacertidae;Zootoca;Zootoca vivipara
AF080334.1	Eukaryota;Chordata;unknown;Squamata;Lacertidae;Zootoca;Zootoca vivipara
KF781329.1	Eukaryota;Chordata;unknown;Squamata;Lacertidae;Zootoca;Zootoca vivipara
AY122780.1	Eukaryota;Chordata;unknown;Squamata;Colubridae;Zamenis;Zamenis longissimus
AB188189.1	Eukaryota;Chordata;Actinopteri;Perciformes;Cottidae;Cottus;Cottus gobio
HG802830.1	Eukaryota;Chordata;unknown;Testudines;Emydidae;Emys;Emys orbicularis
AB090021.1	Eukaryota;Chordata;unknown;Testudines;Emydidae;Emys;Emys orbicularis
HG802786.1	Eukaryota;Chordata;unknown;Testudines;Emydidae;Emys;Emys orbicularis
HG802772.1	Eukaryota;Chordata;unknown;Testudines;Emydidae;Emys;Emys orbicularis
HG802805.1	Eukaryota;Chordata;unknown;Testudines;Emydidae;Emys;Emys orbicularis
JN99