## Exercise 4 - Reference database curation ##

In this exercise we will be using [SATIVA](http://nar.oxfordjournals.org/content/44/11/5022.long), a recenctly published tool designed for 'Phylogeny-aware identification and correction of taxonomically mislabeled sequences'.

For the purpose of the exercise we have prepared a (semi-) real life example of a custom reference database that requires curation (CytB sequences for all UK freshwater fish species, downloaded from Genbank), before actually using it as a reference for taxonomic assignemnt. Below you will be runnign SATIVA and we will be discussing the results and their interpretation and implications. The steps to produce this toy dataset are outlined [here](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/exercise-4/supplementary_material/build_tree/build_reduced_tree_for_ref_db_discussion.ipynb).

For future reference:
A (real-) real life example of custom database curation, to obtain the full reference database for the analyses to follow later this afternoon, can be found [here](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/tree/master/data/exercise-5/supplementary_data/reference_db).



`SATIVA` is not installed on your machine. Install it locally, like so:

In [1]:
!git clone --recursive https://github.com/amkozlov/sativa.git

Cloning into 'sativa'...
remote: Counting objects: 541, done.[K
remote: Total 541 (delta 0), reused 0 (delta 0), pack-reused 541[K
Receiving objects: 100% (541/541), 3.81 MiB | 2.55 MiB/s, done.
Resolving deltas: 100% (346/346), done.
Checking connectivity... done.


In [2]:
cd sativa/

/home/working/media/chrishah/STORAGE/Dropbox/Github/metabarcode-course-2016-temp/data/exercise-4/supplementary_material/run_SATIVA/sativa


Record SHA-1 checksum for the current commit for reproducibility.

In [3]:
!git log -1 | head -n 1

commit 8a99328f3f5382f7f541526878d049415af70999


In [None]:
!./install.sh

In [5]:
cd ..

/home/working/media/chrishah/STORAGE/Dropbox/Github/metabarcode-course-2016-temp/data/exercise-4/supplementary_material/run_SATIVA


See if SATIVA runs ok.

In [6]:
!./sativa/sativa.py -h

usage: sativa.py -s ALIGNMENT -t TAXONOMY -x {BAC,BOT,ZOO,VIR} [options]

SATIVA v0.9-55-g0cbb090, released on 2016-06-28. Last version: https://github.com/amkozlov/sativa 
By A.Kozlov and J.Zhang, the Exelixis Lab. Based on RAxML 8.2.3 by A.Stamatakis.

optional arguments:
  -h, --help            show this help message and exit
  -s ALIGN_FNAME        Reference alignment file (PHYLIP or FASTA). Sequences
                        must be aligned, their IDs must correspond to those in
                        taxonomy file.
  -t TAXONOMY_FNAME     Reference taxonomy file.
  -x {bac,bot,zoo,vir}  Taxonomic code: BAC(teriological), BOT(anical),
                        ZOO(logical), VIR(ological)
  -n OUTPUT_NAME        Job name, will be used as a prefix for output file
                        names (default: taxonomy file name without extension)
  -o OUTPUT_DIR         Output directory (default: current).
  -T NUM_THREADS        Specify the number of CPUs (default: 12)
  -N

__What does SATIVA need?__

As a minimum SATIVA requires two things:
 - An alignment of the reference sequences 
 - A SATIVA taxonomy `*.tax` file containing full taxonomic information for every sequence in the alignment 

Both files have been prepared for you and are present in the directory `data/exercise-4/input_data/`.


The ailgnment has been produced using Reprophylo (see [notebook](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/exercise-4/supplementary_material/build_tree/build_reduced_tree_for_ref_db_discussion.ipynb)). We'll just cleanup the sequence headers and create a local copy.

In [7]:
!cat ./input_data/CytB@mafftLinsi_aln_clipped.phy | sed 's/_f[0-9] / /' > alignment.phy

The tax file has been generated from a Genbank file using [this](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/exercise-4/supplementary_material/run_SATIVA/create_SATIVA_tax_file_from_gb.ipynb) notebook.

In [8]:
!head ./input_data/tax_for_SATIVA.tax

KF552102.1	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Rutilus;Rutilus rutilus
HM156757.1	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Rutilus;Rutilus rutilus
KF784812.1	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Rutilus;Rutilus rutilus
KF784831.1	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Rutilus;Rutilus rutilus
AJ555554.1	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Rutilus;Rutilus rutilus
EU444633.1	Eukaryota;Chordata;Actinopteri;Gobiiformes;Gobiidae;Proterorhinus;Proterorhinus semilunaris
EU444608.1	Eukaryota;Chordata;Actinopteri;Gobiiformes;Gobiidae;Proterorhinus;Proterorhinus semilunaris
KJ605209.1	Eukaryota;Chordata;Actinopteri;Gobiiformes;Gobiidae;Proterorhinus;Proterorhinus semilunaris
KJ605211.1	Eukaryota;Chordata;Actinopteri;Gobiiformes;Gobiidae;Proterorhinus;Proterorhinus semilunaris
EU444632.1	Eukaryota;Chordata;Actinopteri;Gobiiformes;Gobiidae;Proterorhinus;Proterorhinus semilunaris


Let's now run `SATIVA`.

In [9]:
!./sativa/sativa.py -s alignment.phy -t tax_for_SATIVA.tax -x zoo -n CytB -o ./ -T 5 -v


SATIVA v0.9-55-g0cbb090, released on 2016-06-28. Last version: https://github.com/amkozlov/sativa 
By A.Kozlov and J.Zhang, the Exelixis Lab. Based on RAxML 8.2.3 by A.Stamatakis.

SATIVA was called as follows:

./sativa/sativa.py -s alignment.phy -t tax_for_SATIVA.tax -x zoo -n CytB -o ./ -T 5 -v

Mislabels search is running with the following parameters:
 Alignment:                        alignment.phy
 Taxonomy:                         tax_for_SATIVA.tax
 Output directory:                 /home/working/media/chrishah/STORAGE/Dropbox/Github/metabarcode-course-2016-temp/data/exercise-4/supplementary_material/run_SATIVA
 Job name / output files prefix:   CytB
 Model of rate heterogeneity:      AUTO
 Confidence cut-off:               0.000000
 Number of threads:                5

*** STEP 1: Building the reference tree using provided alignment and taxonomic annotations ***

=> Loading taxonomy from file: tax_for_SATIVA.tax ...

==> Loading reference alignment from file: alignment.phy .

`SATIVA` (among other things) writes a report to a text file - Line wrapped below. See also [here](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/exercise-4/results_backup/CytB.mis)

In [10]:
!cat CytB.mis

;Our suggestion should only be taken as indicative of an affiliation to the same group, whose correct name must be determined 
;in an additional step according to the specific rules of nomenclature that apply to the studied organisms.
;
;SeqID	MislabeledLevel	OriginalLabel	ProposedLabel	Confidence	OriginalTaxonomyPath	ProposedTaxonomyPath	PerRankConfidence
AJ969128.1	Order	Centrarchiformes	Cypriniformes	1.000	Eukaryota;Chordata;Actinopteri;Centrarchiformes;Centrarchidae;Lepomis;Lepomis gibbosus	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Tinca;Tinca tinca	1.000;1.000;1.000;1.000;1.000;1.000;1.000
KP644340.1	Order	Gadiformes	Cypriniformes	1.000	Eukaryota;Chordata;Actinopteri;Gadiformes;Lotidae;Lota;Lota lota	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Blicca;Blicca bjoerkna	1.000;1.000;1.000;1.000;1.000;1.000;1.000
JF489783.1	Genus	Pseudorasbora	Gobio	1.000	Eukaryota;Chordata;Actinopteri;Cypriniformes;Cyprinidae;Pseudorasbora;Pseudorasbora parva	Eukaryota;C

Cleanup and remove `SATIVA` after you're done.

In [11]:
!rm -rf sativa/