# Exercise 5 - Taxonomic assignment

In this exercise we will attempt to assign taxonomic identity to the set of denovo OTUs obtained from the 39 UK tree arthropod metabarcoding samples that we have produced in [exercise-3](https://github.com/HullUni-bioinformatics/egrep-2016/tree/master/data/exercise-3).

We will again be using our custom pipeline metaBEAT to make our lives easier and to facilitate reproducibility.

Taxonomic assignment will be performed using two different approaches:
 - BLAST based LCA
 - [Kraken](http://ccb.jhu.edu/software/kraken/MANUAL.html) (k-mer based sequence classification)

We will be using a custom reference database that we have prepared for you (details can be found [here](https://github.com/HullUni-bioinformatics/egrep-2016/blob/master/data/reference-dbs/supplementary_material/download_reference_data.ipynb).

The final result of this exercise will be taxonomically annotated OTU tables in [BIOM](http://biom-format.org/) format from each approach, which we can then go an compare (if we have time). BIOM format and the associated set of python functions has been developed as a standardized format for representing 'biological sample by observation contingency tables' in the -omics area - so big tables...


Before we'll set the pipeline going we need to prepare a metadata file with relevant info about the samples (e.g. tree species, forest, replicate) that we want to add to our final BIOM tables.

This should be a simple comma-delimited text file, that should look like this:
```bash
Sample,Forest,Species,Replicate
THB_BET,Thetford,Betula_pendula,TH_BET
THA_OAK,Thetford,Quercus_robur,TH_OAK
...
...
```

The complete metadata for the samples is in `../metadata/2014_tree_metadata.csv`. 

Use you command line skills to produce this file. Note that the header lines as well as the columns 1,2 and 3 you can extract straight from the complete metadata file. Column 4 'Replicate' you should be able to produce by modifying the original sample id. 

Below is a solution using python.

In [None]:
infile = open('../metadata/2014_tree_metadata.csv','r')

headers = infile.next().strip().split(",")

out = "%s,%s,%s,Replicate\n" %(headers[0],headers[3],headers[5])

for line in infile:
    l = line.strip().split(",")
    out+="%s,%s,%s,%s_%s\n" %(l[0],l[3],l[5],l[0][:2],l[0][4:])

outfile = open('metadata.csv','w')    
outfile.write(out)

infile.close()
outfile.close()

Now let's start the pipeline. As input you need to specify the denovo OTU table from exercise 3 in BIOM format. We'll also ask metaBEAT to attempt taxonomic assignment with the two different approaches mentioned above.

Both BLAST and Kraken require databases in specific formats that metaBEAT will build automatically if necessary, but for the purpose of this course we have prepared it for you (see [here](https://github.com/HullUni-bioinformatics/egrep-2016/blob/master/data/reference-dbs/supplementary_material/download_reference_data.ipynb) if you are interested in how exactly it was done). The respective databases can be found at:
```bash
../reference-dbs/BLAST/egrep-custom/blast_k31n5 #blast database
../reference-dbs/KRAKEN/kraken_k31n5 #kraken database
```

metaBEAT will automatically wrangle the data into the particular file formats that are required by each of the methods, run all necessary steps, and finally convert the outputs of each program to a standardized BIOM table.


__Set it off! __

In [None]:
%%bash

metaBEAT_global.py \
-B ../exercise-3/GLOBAL/COI-trim30min100-merge-c3-id97-OTU-denovo.biom \
--metadata metadata.csv \
--kraken --kraken_db ../reference-dbs/KRAKEN/kraken_k31n5 \
--blast --blast_db ../reference-dbs/BLAST/egrep-custom/blast_k31n5 --min_ident 0.8 \
-n 5 -o COI-trim30min100-merge-c3-id97





Detailed explanation of the above command:
```bash
metaBEAT_global.py \
-B ../exercise-3/GLOBAL/COI-trim30min100-merge-c3-id97-OTU-denovo.biom \ #denovo BIOM table
--metadata metadata.csv \ #metadata file
--kraken \ #use kraken for assignment
--kraken_db ../reference-dbs/KRAKEN/kraken_k31n5 \ #location of kraken database
--blast \ #use blast for assignment
--blast_db ../reference-dbs/BLAST/egrep-custom/blast_k31n5 \ # location of blast database
--min_ident 0.8 \ #only attemt assignments for queries with at least 80% simlilarity to any reference sequence
-n 5 \ #use 5 processors where possible
-o COI-trim30min100-merge-c3-id97 #arbitraty name of the analysis
```

Note that the entire process only took ~10 minutes.

As before, everything that metaBEAT did is presented to you in the output (see above) and all intermediate files are kept in the directory `./GLOBAL`, and for each of the two approaches in a separate directory within.

Scroll through the output cell to get a quick idea of all the things that the pipeline just did for you. It's not needed to go into detail, just have a quick look.

metaBEAT has produced results tables in BIOM format for each of the approaches in the corresponding directory.

A nice tool for interactive exploration of your results is [phinch](http://phinch.org/). Load one of the tables (`*.biom`) and see if you can get an overview of your results.

As mentioned before, the BIOM format is a standardized file format and there is a range of tools out there that can be used to analyse these data.

