SATIVA requires a 'taxonomy file', i.e. a text file that links each sequence record to a ncbi taxonomic id (aka taxid - see here).

We are going to generate this file from a bunch of sequence records downloaded from Genbank.

We have downloaded our reference sequences from Genbank in the so-called, genbank format, which is a rich format containing all sorts of metadata for every sequence record, including a specific taxonomic id (taxid), i.e. a identity number that uniquely identifies every species (and also higher taxonomic levels, such as genera, families, classes, etc) recorded on Genbank.

Let's parse our Genbank file using Biopython.


In [1]:
from Bio import SeqIO

records = SeqIO.to_dict(SeqIO.parse(open('../fetch_clean_align_tree/12S_preSATIVA_amphib.gb','r'),'genbank'))

In the below cell we display the first of our sequence records in Genbank format. Try to identify the taxid of the record and use NCBI's taxonomy database (here) to retrieve the full taxonomic tree for the taxon.

In [2]:
for r in records.keys():
    print records[r].format('genbank')
    break

LOCUS       GU477775                 445 bp    DNA     linear   VRT 31-DEC-2010
DEFINITION  Chelydra serpentina 12S ribosomal RNA gene, partial sequence;
            mitochondrial.
ACCESSION   GU477775
VERSION     GU477775.1
KEYWORDS    .
SOURCE      mitochondrion Chelydra serpentina (Common snapping turtle)
  ORGANISM  Chelydra serpentina
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Archelosauria; Testudines; Cryptodira; Durocryptodira;
            Americhelydia; Chelydroidea; Chelydridae; Chelydra.
REFERENCE   1  (bases 1 to 445)
  AUTHORS   Zheng,J., Cheng,Q. and Zhang,Q.
  TITLE     Mitochondrial 12S rRNA gene sequence variation and phylogenetic
            relationships of tortoise
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 445)
  AUTHORS   Zheng,J., Cheng,Q. and Zhang,Q.
  TITLE     Direct Submission
  JOURNAL   Submitted (18-JAN-2010) Department of Fishery Biotechnology, East
            China Sea Fisheries Research Institute,

We don't want to make our live harder than it already is and deal with subspecies so in the tax file for SATIVA we'll only do species.

Iterate over all records and check if taxid is subspecies, if yes replace with taxid of species and extract taxonomy line. If it's a subspecies record the record id for further processing.


In [3]:
from collections import defaultdict

taxon_to_taxid = {}
recs_to_adjust = []
taxon_to_recs = defaultdict(list)

for key in records.keys():
    r = records[key]
   
    source = [f for f in r.features if f.type == 'source'][0]
    if (len(source.qualifiers['organism'][0].split(" ")) == 2):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
#            print source.qualifiers['db_xref']
            for t in source.qualifiers['db_xref']:
#                print t
                if 'taxon' in t:
                    if not source.qualifiers['organism'][0] in taxon_to_taxid:
                        print " .. add to records"
                        taxon_to_taxid[source.qualifiers['organism'][0]] = t.split(":")[1]
                    else:
                        print " .. already covered"
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    elif (len(source.qualifiers['organism'][0].split(" ")) > 2 and source.qualifiers['organism'][0].split(" ")[1] == 'sp.'):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
            for t in source.qualifiers['db_xref']:
                if 'taxon' in t:
                    if not str(t) in taxon_to_taxid:
                        taxon_to_taxid[source.qualifiers['organism'][0]] = str(t)
                    else:
                        print " .. already covered" 
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    else:
        print "subspecies: %s" %source.qualifiers['organism'][0]
        recs_to_adjust.append(r.id)

Chelydra serpentina  .. add to records
subspecies: Alytes obstetricans pertinax
Rana temporaria  .. add to records
Bombina bombina  .. add to records
subspecies: Salamandra salamandra morenica
Epidalea calamita  .. add to records
Gallus gallus  .. add to records
Discoglossus pictus  .. add to records
Gallus gallus  .. already covered
subspecies: Bombina variegata scabra
Discoglossus pictus  .. already covered
subspecies: Bombina variegata variegata
Rana dalmatina  .. add to records
Rana temporaria  .. already covered
Pelophylax ridibundus  .. add to records
Rana catesbeiana  .. add to records
Gallus gallus  .. already covered
subspecies: Discoglossus pictus pictus
subspecies: Bombina variegata scabra
Rana dalmatina  .. already covered
Rana temporaria  .. already covered
Triturus carnifex  .. add to records
Hydromantes genei  .. add to records
Pelophylax lessonae  .. add to records
subspecies: Discoglossus pictus auritus
Xenopus laevis  .. add to records
Gallus gallus  .. already covere

For the records that were identified as being *subspecies* reduce to species and check whether the taxid for the species had already been encountert before. If not we 'd need to fetch it from NCBI.

In [4]:
from collections import defaultdict

to_fetch = defaultdict(list)

for key in records.keys():
    r = records[key]
    if r.id in recs_to_adjust:
        source = [f for f in r.features if f.type == 'source'][0]
        adjust_from = source.qualifiers['organism'][0]
        adjust_to = " ".join(adjust_from.split(" ")[:2])
        print "%s -> %s" %(adjust_from,adjust_to)
        if adjust_to in taxon_to_taxid:
            taxon_to_recs[adjust_to].append(r.id)
        else:
            to_fetch[adjust_to].append(r.id)

Alytes obstetricans pertinax -> Alytes obstetricans
Salamandra salamandra morenica -> Salamandra salamandra
Bombina variegata scabra -> Bombina variegata
Bombina variegata variegata -> Bombina variegata
Discoglossus pictus pictus -> Discoglossus pictus
Bombina variegata scabra -> Bombina variegata
Discoglossus pictus auritus -> Discoglossus pictus
Gallus gallus spadiceus -> Gallus gallus
Salamandra salamandra morenica -> Salamandra salamandra
Lissotriton vulgaris meridionalis -> Lissotriton vulgaris
Salamandra salamandra bernardezi -> Salamandra salamandra
Alytes obstetricans boscai -> Alytes obstetricans
Bombina variegata kolombatovici -> Bombina variegata
Alytes obstetricans boscai -> Alytes obstetricans
Discoglossus pictus auritus -> Discoglossus pictus
Alytes obstetricans almogavarii -> Alytes obstetricans
Gallus gallus murghi -> Gallus gallus
Salamandra salamandra almanzoris -> Salamandra salamandra
Discoglossus pictus auritus -> Discoglossus pictus
Salamandra salamandra bejarae -

Check if we are good, or if there are any taxids missing.

In [5]:
if to_fetch:
    print "need to fetch some taxids"
else:
    print "Have taxids for all records"

Have taxids for all records


Write taxids to file and fetch full taxonomy for all of them using taxit from the taxtastic package.

In [6]:
taxids = []

out=open("taxids.txt",'w')
for sp in taxon_to_taxid:
    taxids.append(taxon_to_taxid[sp])
    out.write(taxon_to_taxid[sp]+"\n")
out.close()

Create tab-delimited text file with full taxonomic tree for each taxid.

In [7]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

The full 'taxonomy string' for a given taxon as returned from NCBI could look, e.g. like so:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Actinopterygii; Actinopteri; Neopterygii; Teleostei; Osteoglossocephalai; Clupeocephala; Euteleosteomorpha; Protacanthopterygii; Salmoniformes; Salmonidae; Salmoninae; Salmo; Salmo trutta

The number of taxonomic levels in the taxonomy string may vary between taxa. Some taxonomic groups are classified into relatively uncommon intermediate taxonomic levels, that may not exist for other taxa.

In order to make our lives easier downstream we will limit ourselves only to a defined set of the most common taxonomic levels, that should be known for pretty much all taxa.

  -  superkingdom
  -  phylum
  -  class
  -  order
  -  family
  -  genus
  -  species

Extract 'taxonomy sring' for a specific set of taxonomic levlels.


In [8]:
from collections import defaultdict

tax_levels=['superkingdom','phylum','class','order','family','genus','species']
indices = []
taxdict = defaultdict(list)
taxids_to_taxonomy = {}

infile=open("taxa.csv",'r')
header=infile.next()

header_as_list=header.strip().replace('"','').split(",")
for i in range(len(header_as_list)):
#    print header_as_list[i]
    if header_as_list[i] in tax_levels:
#        print "\t"+header_as_list[i],i
        indices.append(i)

for line in infile:
    line_as_list=line.strip().replace('"',"").split(",")
    taxdict[line_as_list[0]] = line_as_list[1:]

infile.close()

for t in taxids:
    print t,
#    print taxdict[t]
    taxonomy=""
    for i in range(len(tax_levels)):
        if taxdict[t][indices[i]-1] == "":
            taxonomy+='unknown'+';'
        else:
            taxonomy+=taxdict[taxdict[t][indices[i]-1]][2]+";"
    print taxonomy[:-1]
    taxids_to_taxonomy[t] = taxonomy[:-1]

8406 Eukaryota;Chordata;Amphibia;Anura;Ranidae;Pelophylax;Pelophylax ridibundus
8443 Eukaryota;Chordata;Amphibia;Anura;Alytidae;Alytes;Alytes obstetricans
8351 Eukaryota;Chordata;Amphibia;Anura;Alytidae;Discoglossus;Discoglossus pictus
8324 Eukaryota;Chordata;Amphibia;Caudata;Salamandridae;Lissotriton;Lissotriton vulgaris
45623 Eukaryota;Chordata;Amphibia;Anura;Ranidae;Pelophylax;Pelophylax lessonae
256425 Eukaryota;Chordata;Amphibia;Caudata;Salamandridae;Lissotriton;Lissotriton helveticus
54263 Eukaryota;Chordata;Amphibia;Caudata;Salamandridae;Ichthyosaura;Ichthyosaura alpestris
51331 Eukaryota;Chordata;Amphibia;Anura;Ranidae;Rana;Rana dalmatina
30326 Eukaryota;Chordata;Amphibia;Anura;Bufonidae;Epidalea;Epidalea calamita
8355 Eukaryota;Chordata;Amphibia;Anura;Pipidae;Xenopus;Xenopus laevis
57571 Eukaryota;Chordata;Amphibia;Caudata;Salamandridae;Salamandra;Salamandra salamandra
8326 Eukaryota;Chordata;Amphibia;Caudata;Salamandridae;Triturus;Triturus carnifex
8407 Eukaryota;Chordata;Amp

Write out *.tax file for SATIVA.

In [9]:
out=open("tax_for_SATIVA.tax", 'w')

for sp in taxon_to_recs:
    for rec in taxon_to_recs[sp]:
        out.write("%s\t%s\n" %(rec,taxids_to_taxonomy[taxon_to_taxid[sp]]))
        
out.close()