#Running SATIVA - positives

Prepare input files.
The alignment has been produced using ReproPhylo in the Sativa_pos_prep notebook. We'll just cleanup the sequence headers and create a local copy.

In [22]:
!cat rbcL@mafftLinsi_aln_clipped.phy | sed 's/_f[0-9] / /' > alignment.phy

SATIVA requires a 'taxonomy file', i.e. a text file that links each sequence record to a NCBI taxonomic ID (aka taxid).

We are going to generate this file from a bunch of sequence records downloaded from Genbank.

We have downloaded our reference sequences from Genbank in .gb format, which contains metadata including the taxid.

Let's parse the Genbank file.

In [23]:
from Bio import SeqIO

records = SeqIO.to_dict(SeqIO.parse(open('rbcL_nr_pre_Sativa_pos.gb','r'),'genbank'))

Here is the first Genbank format record:

In [24]:
for r in records.keys():
    print records[r].format('genbank')
    break

LOCUS       JX571820                 552 bp    DNA     linear   PLN 17-SEP-2013
DEFINITION  Dracaena draco voucher Hosam00045 ribulose-1,5-bisphosphate
            carboxylase/oxygenase large subunit (rbcL) gene, partial cds;
            chloroplast.
ACCESSION   JX571820
VERSION     JX571820.1
KEYWORDS    .
SOURCE      chloroplast Dracaena draco
  ORGANISM  Dracaena draco
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Asparagaceae;
            Nolinoideae; Dracaena.
REFERENCE   1  (bases 1 to 552)
  AUTHORS   Elansary,H.O.
  TITLE     Towards a DNA barcode library for Egyptian flora, with a preliminary
            focus on ornamental trees and shrubs of two major gardens
  JOURNAL   DNA Barcodes (Berlin) 1, 46-55 (2013)
REFERENCE   2  (bases 1 to 552)
  AUTHORS   Elansary,H.O.
  TITLE     Direct Submission
  JOURNAL   Submitted (04-SEP-2012) Floriculture, Ornamental Horticulture and
     

Iterate over all records and check is taxid is subspecies, if yes, replace with taxid of species and extract taxonomy line. If it's a subspecies record the record ID for further processing.

In [25]:
from collections import defaultdict

taxon_to_taxid = {}
recs_to_adjust = []
taxon_to_recs = defaultdict(list)

for key in records.keys():
    r = records[key]
   
    source = [f for f in r.features if f.type == 'source'][0]
    if (len(source.qualifiers['organism'][0].split(" ")) == 2):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
#            print source.qualifiers['db_xref']
            for t in source.qualifiers['db_xref']:
#                print t
                if 'taxon' in t:
                    if not source.qualifiers['organism'][0] in taxon_to_taxid:
                        print " .. add to records"
                        taxon_to_taxid[source.qualifiers['organism'][0]] = t.split(":")[1]
                    else:
                        print " .. already covered"
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    elif (len(source.qualifiers['organism'][0].split(" ")) > 2 and source.qualifiers['organism'][0].split(" ")[1] == 'sp.'):
        print source.qualifiers['organism'][0],
        if 'db_xref' in source.qualifiers:
            for t in source.qualifiers['db_xref']:
                if 'taxon' in t:
                    if not str(t) in taxon_to_taxid:
                        taxon_to_taxid[source.qualifiers['organism'][0]] = str(t)
                    else:
                        print " .. already covered" 
                    taxon_to_recs[source.qualifiers['organism'][0]].append(r.id)
    else:
        print "subspecies: %s" %source.qualifiers['organism'][0]
        recs_to_adjust.append(r.id)

Dracaena draco  .. add to records
Kalanchoe pinnata  .. add to records
Kalanchoe pinnata  .. already covered
Kalanchoe pinnata  .. already covered
Dracaena draco  .. already covered
Dracaena transvaalensis  .. add to records
Crassula perforata  .. add to records
Dracaena fragrans  .. add to records
Dracaena draco  .. already covered
Dracaena mannii  .. add to records
Dracaena aletriformis  .. add to records
Crassula nudicaulis  .. add to records


For the records that were identified as being _subspecies_ reduce to species and check whether the taxid for the species had already been encountered. If not we'd need to fetch it from NCBI.

In [26]:
from collections import defaultdict

to_fetch = defaultdict(list)

for key in records.keys():
    r = records[key]
    if r.id in recs_to_adjust:
        source = [f for f in r.features if f.type == 'source'][0]
        adjust_from = source.qualifiers['organism'][0]
        adjust_to = " ".join(adjust_from.split(" ")[:2])
        print "%s -> %s" %(adjust_from,adjust_to)
        if adjust_to in taxon_to_taxid:
            taxon_to_recs[adjust_to].append(r.id)
        else:
            to_fetch[adjust_to].append(r.id)

Check if we are good or if any are missing.

In [27]:
if to_fetch:
    print "need to fetch some taxids"
else:
    print "Have taxids for all records"

Have taxids for all records


Write taxids to file and fetch full taxonomy for all of them using taxit from the taxtastic package.

In [28]:
taxids = []

out=open("taxids.txt",'w')
for sp in taxon_to_taxid:
    taxids.append(taxon_to_taxid[sp])
    out.write(taxon_to_taxid[sp]+"\n")
out.close()

Create tab-delimited text file with full taxonomic tree for each taxid.

In [29]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

In order to make our lives easier downstream we will limit ourselves to only a defined set of the most common taxonomic levels, that should be known for pretty much all taxa: superkingdom, phylum, class, order, family, genus, species.

Extract 'taxonomy string' for a specific set of taxonomic levels.

In [30]:
from collections import defaultdict

tax_levels=['superkingdom','phylum','class','order','family','genus','species']
indices = []
taxdict = defaultdict(list)
taxids_to_taxonomy = {}

infile=open("taxa.csv",'r')
header=infile.next()

header_as_list=header.strip().replace('"','').split(",")
for i in range(len(header_as_list)):
#    print header_as_list[i]
    if header_as_list[i] in tax_levels:
#        print "\t"+header_as_list[i],i
        indices.append(i)

for line in infile:
    line_as_list=line.strip().replace('"',"").split(",")
    taxdict[line_as_list[0]] = line_as_list[1:]

infile.close()

for t in taxids:
    print t,
#    print taxdict[t]
    taxonomy=""
    for i in range(len(tax_levels)):
        if taxdict[t][indices[i]-1] == "":
            taxonomy+='unknown'+';'
        else:
            taxonomy+=taxdict[taxdict[t][indices[i]-1]][2]+";"
    print taxonomy[:-1]
    taxids_to_taxonomy[t] = taxonomy[:-1]


131161 Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Crassula;Crassula perforata
231032 Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena fragrans
80913 Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Kalanchoe;Kalanchoe pinnata
992684 Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena aletriformis
1237548 Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena transvaalensis
1237547 Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena mannii
1641072 Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Crassula;Crassula nudicaulis
100532 Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena draco


Write out the *.tax file for SATIVA.

In [31]:
out=open("tax_for_SATIVA.tax", 'w')

for sp in taxon_to_recs:
    for rec in taxon_to_recs[sp]:
        out.write("%s\t%s\n" %(rec,taxids_to_taxonomy[taxon_to_taxid[sp]]))
        
out.close()

Check it

In [32]:
!head tax_for_SATIVA.tax

AF274594.1	Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Crassula;Crassula perforata
JQ734500.1	Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena fragrans
JQ591185.1	Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Kalanchoe;Kalanchoe pinnata
GU135277.1	Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Kalanchoe;Kalanchoe pinnata
KP208892.1	Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Kalanchoe;Kalanchoe pinnata
JF265398.1	Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena aletriformis
JX572540.1	Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena transvaalensis
JX572539.1	Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena mannii
KP110251.1	Eukaryota;Streptophyta;unknown;Saxifragales;Crassulaceae;Crassula;Crassula nudicaulis
JX571820.1	Eukaryota;Streptophyta;Liliopsida;Asparagales;Asparagaceae;Dracaena;Dracaena draco


Let's now run SATIVA. If it's not already installed, we'll need to do this.

In [33]:
!git clone --recursive https://github.com/amkozlov/sativa.git

fatal: destination path 'sativa' already exists and is not an empty directory.


In [34]:
cd sativa/

/home/working/Sativa/sativa


Record SHA-1 checksum for the current commit for reproducibility>

In [35]:
!git log -1 | head -n 1

commit 8a99328f3f5382f7f541526878d049415af70999


In [36]:
!./install.sh

Your compiler: gcc 4.8
Building AVX: yes
Building AVX2: yes
make: Entering directory `/home/working/Sativa/sativa/raxml'
rm -rf builddir.*
rm -f unpack.*.stamp
make: Leaving directory `/home/working/Sativa/sativa/raxml'

Done!


Done! Let's run it

In [37]:
cd ..

/home/working/Sativa


In [38]:
!./sativa/sativa.py -s alignment.phy -t tax_for_SATIVA.tax -x zoo -n rbcL -o ./ -T 5 -v


SATIVA v0.9-55-g0cbb090, released on 2016-06-28. Last version: https://github.com/amkozlov/sativa 
By A.Kozlov and J.Zhang, the Exelixis Lab. Based on RAxML 8.2.3 by A.Stamatakis.

SATIVA was called as follows:

./sativa/sativa.py -s alignment.phy -t tax_for_SATIVA.tax -x zoo -n rbcL -o ./ -T 5 -v

Mislabels search is running with the following parameters:
 Alignment:                        alignment.phy
 Taxonomy:                         tax_for_SATIVA.tax
 Output directory:                 /home/working/Sativa
 Job name / output files prefix:   rbcL
 Model of rate heterogeneity:      AUTO
 Confidence cut-off:               0.000000
 Number of threads:                5

*** STEP 1: Building the reference tree using provided alignment and taxonomic annotations ***

=> Loading taxonomy from file: tax_for_SATIVA.tax ...

==> Loading reference alignment from file: alignment.phy ...

Guessing input format: not fasta
Guessing input format: not phylip_relaxed
===> Validating taxonomy and al