`SATIVA` requires a 'taxonomy file', i.e. a text file that links each sequence record to a ncbi taxonomic id (aka `taxid` - see [here](http://www.ncbi.nlm.nih.gov/taxonomy)).

We are going to generate this file from a bunch of sequence records downloaded from Genbank.



We have downloaded our reference sequences from Genbank in the so-called, `genbank` format, which is a rich format containing all sorts of metadata for every sequence record, including a specific taxonomic id (`taxid`), i.e. a identity number that uniquely identifies every species (and also higher taxonomic levels, such as genera, families, classes, etc) recorded on Genbank.

Let's parse our Genbank file using Biopython.


In [4]:
from Bio import SeqIO

records = SeqIO.to_dict(SeqIO.parse(open('../2-nr/CytB_nr_pre_Sativa.gb','r'),'genbank'))



In the below cell we display the first of our sequence records in Genbank format. Try to identify the `taxid` of the record and use NCBI's taxonomy database ([here](http://www.ncbi.nlm.nih.gov/taxonomy)) to retrieve the full taxonomic tree for the taxon.

In [7]:
for r in records.keys():
    print records[r].format('genbank')
    break

LOCUS       AB021776                1140 bp    DNA              VRT 14-JUL-2016
DEFINITION  Anguilla anguilla mitochondrial cytb gene for cytochrome b, complete
            cds.
ACCESSION   AB021776
VERSION     AB021776.1  GI:12248828
KEYWORDS    .
SOURCE      mitochondrion Anguilla anguilla (European eel)
  ORGANISM  Anguilla anguilla
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Actinopterygii; Neopterygii; Teleostei; Anguilliformes; Anguillidae;
            Anguilla.
REFERENCE   1
  AUTHORS   Aoyama,J.
  TITLE     Molecular phylogeny and evolution of the freshwater eels, genus
            Anguilla
  JOURNAL   Thesis (1998) Ocean Research Institute, University of Tokyo
REFERENCE   2  (bases 1 to 1140)
  AUTHORS   Aoyama,J., Nishida,M. and Tsukamoto,K.
  TITLE     Direct Submission
  JOURNAL   Submitted (23-DEC-1998) Contact:Jun Aoyama Ocean Research Institute,
            University of Tokyo, Division of Fisheries Ecology; 1-15-1,
         

We don't want to make our lives harder than it already is and deal with _subspecies_ so in the tax file for `SATIVA` we'll only do species.

Iterate over all records and check if taxid is subspecies, if yes replace with taxid of species and extract taxonomy line. If it's a subspecies record the record id for further processing.

In [9]:
from collections import defaultdict

taxon_to_taxid = {}
recs_to_adjust = []
taxon_to_recs = defaultdict(list)

for key in records.keys():
    r = records[key]
   
    source = [f for f in r.features if f.type == 'source'][0]
    if (len(source.qualifiers['organism'][0].split(" ")) == 2):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
#            print source.qualifiers['db_xref']
            for t in source.qualifiers['db_xref']:
#                print t
                if 'taxon' in t:
                    if not source.qualifiers['organism'][0] in taxon_to_taxid:
                        print " .. add to records"
                        taxon_to_taxid[source.qualifiers['organism'][0]] = t.split(":")[1]
                    else:
                        print " .. already covered"
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    elif (len(source.qualifiers['organism'][0].split(" ")) > 2 and source.qualifiers['organism'][0].split(" ")[1] == 'sp.'):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
            for t in source.qualifiers['db_xref']:
                if 'taxon' in t:
                    if not str(t) in taxon_to_taxid:
                        taxon_to_taxid[source.qualifiers['organism'][0]] = str(t)
                    else:
                        print " .. already covered" 
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    else:
        print "subspecies: %s" %source.qualifiers['organism'][0]
        recs_to_adjust.append(r.id)


Anguilla anguilla  .. add to records
Coregonus lavaretus  .. add to records
Gobio gobio  .. add to records
Micropterus salmoides  .. add to records
Gobio gobio  .. already covered
Carassius auratus  .. add to records
Pungitius pungitius  .. add to records
Squalius cephalus  .. add to records
Gobio gobio  .. already covered
Pungitius pungitius  .. already covered
Squalius cephalus  .. already covered
Alburnoides bipunctatus  .. add to records
Proterorhinus semilunaris  .. add to records
Pungitius pungitius  .. already covered
Pseudorasbora parva  .. add to records
subspecies: Barbus barbus barbus
Leuciscus leuciscus  .. add to records
Alburnoides bipunctatus  .. already covered
Alburnoides bipunctatus  .. already covered
Carassius auratus  .. already covered
subspecies: Chondrostoma nasus nasus
Pseudorasbora parva  .. already covered
Pungitius pungitius  .. already covered
Coregonus lavaretus  .. already covered
Coregonus lavaretus  .. already covered
subspecies: Carassius auratus aurat

For the records that were identified as being _subspecies_ reduce to species and check whether the `taxid` for the species had already been encountert before. If not we 'd need to fetch it from NCBI.

In [11]:
from collections import defaultdict

to_fetch = defaultdict(list)

for key in records.keys():
    r = records[key]
    if r.id in recs_to_adjust:
        source = [f for f in r.features if f.type == 'source'][0]
        adjust_from = source.qualifiers['organism'][0]
        adjust_to = " ".join(adjust_from.split(" ")[:2])
        print "%s -> %s" %(adjust_from,adjust_to)
        if adjust_to in taxon_to_taxid:
            taxon_to_recs[adjust_to].append(r.id)
        else:
            to_fetch[adjust_to].append(r.id)
            


Barbus barbus barbus -> Barbus barbus
Chondrostoma nasus nasus -> Chondrostoma nasus
Carassius auratus auratus -> Carassius auratus
Carassius auratus auratus -> Carassius auratus
Alburnoides bipunctatus strymonicus -> Alburnoides bipunctatus
Alburnoides bipunctatus strymonicus -> Alburnoides bipunctatus
Alburnoides bipunctatus rossicus -> Alburnoides bipunctatus
Carassius auratus grandoculis -> Carassius auratus
Carassius auratus ssp. 'Pingxiang' -> Carassius auratus
Chondrostoma nasus nasus -> Chondrostoma nasus
Alburnoides bipunctatus tzanevi -> Alburnoides bipunctatus
Alburnoides bipunctatus strymonicus -> Alburnoides bipunctatus
Chondrostoma nasus nasus -> Chondrostoma nasus
Chondrostoma nasus nasus -> Chondrostoma nasus
Barbus barbus barbus -> Barbus barbus
Chondrostoma nasus nasus -> Chondrostoma nasus
Barbus barbus barbus -> Barbus barbus
Carassius auratus grandoculis -> Carassius auratus
Barbus barbus barbus -> Barbus barbus
Cyprinus carpio 'xingguonensis' -> Cyprinus carpio
Ca

Check if we are good, or if there are any taxids missing.

In [12]:
if to_fetch:
    print "need to fetch some taxids"
else:
    print "Have taxids for all records"



Have taxids for all records


Write taxids to file and fetch full taxonomy for all of them using `taxit` from the [taxtastic](http://fhcrc.github.io/taxtastic/index.html#) package.

In [13]:
taxids = []

out=open("taxids.txt",'w')
for sp in taxon_to_taxid:
    taxids.append(taxon_to_taxid[sp])
    out.write(taxon_to_taxid[sp]+"\n")
out.close()

Create tab-delimited text file with full taxonomic tree for each `taxid`.

In [14]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

 

The full 'taxonomy string' for a given taxon as returned from NCBI could look, e.g. like so:

```
cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Actinopterygii; Actinopteri; Neopterygii; Teleostei; Osteoglossocephalai; Clupeocephala; Euteleosteomorpha; Protacanthopterygii; Salmoniformes; Salmonidae; Salmoninae; Salmo; Salmo trutta
```

The number of taxonomic levels in the taxonomy string may vary between taxa. Some taxonomic groups are classified into relatively uncommon intermediate taxonomic levels, that may not exist for other taxa. 

In order to make our lives easier downstream we will limit ourselves only to a defined set of the most common taxonomic levels, that should be known for pretty much all taxa. 


- superkingdom
- phylum
- class
- order
- family
- genus
- species



Extract 'taxonomy sring' for a specific set of taxonomic levlels.

In [16]:
from collections import defaultdict

tax_levels=['superkingdom','phylum','class','order','family','genus','species']
indices = []
taxdict = defaultdict(list)
taxids_to_taxonomy = {}

infile=open("taxa.csv",'r')
header=infile.next()

header_as_list=header.strip().replace('"','').split(",")
for i in range(len(header_as_list)):
#    print header_as_list[i]
    if header_as_list[i] in tax_levels:
#        print "\t"+header_as_list[i],i
        indices.append(i)

for line in infile:
    line_as_list=line.strip().replace('"',"").split(",")
    taxdict[line_as_list[0]] = line_as_list[1:]

infile.close()

for t in taxids:
    print t,
#    print taxdict[t]
    taxonomy=""
    for i in range(len(tax_levels)):
        if taxdict[t][indices[i]-1] == "":
            taxonomy+='unknown'+';'
        else:
            taxonomy+=taxdict[taxdict[t][indices[i]-1]][2]+";"
    print taxonomy[:-1]
    taxids_to_taxonomy[t] = taxonomy[:-1]
    


322563 Eukaryota;Chordata;Actinopteri;Gobiiformes;Gobiidae;Proterorhinus;Proterorhinus semilunaris
8030 Eukaryota;Chordata;Actinopteri;Salmoniformes;Salmonidae;Salmo;Salmo salar
58325 Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Leuciscus;Leuciscus leuciscus
69944 Eukaryota;Chordata;Actinopteri;Gadiformes;Lotidae;Lota;Lota lota
278164 Eukaryota;Chordata;Actinopteri;Clupeiformes;Clupeidae;Alosa;Alosa alosa
58324 Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Phoxinus;Phoxinus phoxinus
36185 Eukaryota;Chordata;Actinopteri;Salmoniformes;Salmonidae;Thymallus;Thymallus thymallus
7748 Eukaryota;Chordata;unknown;Petromyzontiformes;Petromyzontidae;Lampetra;Lampetra fluviatilis
40830 Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Barbus;Barbus barbus
8010 Eukaryota;Chordata;Actinopteri;Esociformes;Esocidae;Esox;Esox lucius
58317 Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Blicca;Blicca bjoerkna
7984 Eukaryota;Chordata;Actinopteri;Cypriniformes;Cobitidae;

Write out `*.tax` file for `SATIVA`.

In [17]:
out=open("tax_for_SATIVA.tax", 'w')

for sp in taxon_to_recs:
    for rec in taxon_to_recs[sp]:
        out.write("%s\t%s\n" %(rec,taxids_to_taxonomy[taxon_to_taxid[sp]]))
        
out.close()