# Exercise 5 - Taxonomic assignment

In this exercise we will attempt to assign taxonomic identity to the set of denovo OTUs obtained from the 57 Lake Windermere eDNA samples that we have produced in [exercise-3](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/tree/master/data/exercise-3).

We will be using a custom phylogenetically curated reference database (details about how this was created can be found [here](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/tree/master/data/exercise-5/supplementary_data/reference_db)).

Taxonomic assignment will be performed using three different approaches:
 - BLAST based LCA
 - [pplacer](http://matsen.fhcrc.org/pplacer/) (Phylogenetic placement)
 - [Kraken](http://ccb.jhu.edu/software/kraken/MANUAL.html) (k-mer based sequence classification)

We will again be using our custom pipeline metaBEAT to make our lives easier and to facilitate reproducibility.

The final result of this exercise will be taxonomically annotated OTU tables in [BIOM](http://biom-format.org/) format from each approach, which we can then go an compare (if we have time). BIOM format and the associated set of python functions has been developed as a standardized format for representing 'biological sample by observation contingency tables' in the -omics area - so big tables...

Most of the input data you will need for the analysis you have already produced in exercise 3. The rest you can find in the folder `input_data`. The directory `supplementary_data` describes the detailed processing steps and all intermediate files to obtain the input data.

Specify the location and file format your reference sequences come in. As before you could mix and match several files with different formats (`fasta`, `Genbank`) if you wanted. You'll need to prepare a simple text file that contains the path to the file and the format specification, like so:

```bash
PATH/TO/YOUR/sequences1.fasta <tab> fasta
PATH/TO/YOUR/sequences2.genbank <tab> gb
...
...
```

We have prepared a set of reference sequences in Genbank format for you in the directoy `/input_data`. The file is called `CytB_European-fish_SATIVA_cleaned.gb'.

This time you can use your command line skills to produce the file - We call it `Refmap.txt`.

In [10]:
!echo './input_data/CytB_European-fish_SATIVA_cleaned.gb\tgb' > REFmap.txt

In [11]:
!cat REFmap.txt

./input_data/CytB_European-fish_SATIVA_cleaned.gb	gb


That's almost it. Now let's start the pipeline. As input you need to specify the denovo OTU table from exercise 3 in BIOM format and the `REFmap.txt` file. We'll also ask metaBEAT to attempt taxonomic assignment with the three different approaches mentioned above.

Phylogenetic placement using `pplacer` requires a number specific files that need to be formatted/prepared in a particular way into a so-called Reference package (aka 'refpkg'). At a minimum this needs to contain a phylogenetic tree, the underlying alignment as `fasta`, a HiddenMarkov model (HMM) profile of the alignemnt, and a number of other files summarizing the taxonomic identity of the taxa in the reference. We are using the program `taxit` from the [taxtastic](http://fhcrc.github.io/taxtastic/commands.html#create) package to do the formatting. What exactly we did to produce the refpkg for this analysis is outlined in [this](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/exercise-5/supplementary_data/pplacer/build-refpkg/build_pplacer_refpkg.ipynb) notebook.

Kraken requires a specific database that metaBEAT will build automatically if necessary, but for the purpose of this course we have prepared it for you (see [here](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/exercise-5/supplementary_data/kraken/build_custom_db.ipynb) for how it was done) and and it should be present in the `./input_data`.

metaBEAT will automatically wrangle the data into the particular file formats that are required by each of the methods, run all necessary steps, and finally convert the outputs of each program to a standardized BIOM table.



__Set it off! __

In [13]:
%%bash

metaBEAT_global.py \
-B ../exercise-3/GLOBAL/CytB-trim30min100-merge-c3-id1-OTU-denovo.biom \
-R REFmap.txt \
--blast --min_ident 0.85 \
--kraken --kraken_db ./input_data/KRAKEN_DB/ \
--pplace --refpkg ./input_data/CytB.refpkg/ \
-n 5 -o CytB-trim30min100-merge-c3-id1


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.96-global


Tue Sep 13 09:05:27 2016

/usr/bin/metaBEAT_global.py -B ../exercise-3/GLOBAL/CytB-trim30min100-merge-c3-id1-OTU-denovo.biom -R REFmap.txt --blast --min_ident 0.85 --kraken --kraken_db ./input_data/KRAKEN_DB/ --pplace --refpkg ./input_data/CytB.refpkg/ -n 5 -o CytB-trim30min100-merge-c3-id1


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

As the mail address is not specified in the script itself (variable 'Entrez.email'), metaBEAT expects a simple text file called 'user_email.txt' that contains your email address (first line of file) in the same location as the metaBEAT.py script (in your case: /usr/bin/)

found 'c.hahn@hull.ac.uk' in /usr/bin/user_email.txt

taxonomy.db found at /usr/bin/taxonomy.db

parsing BIOM ta




Detailed explanation of the above command:
```bash
metaBEAT_global.py \
-B ../exercise-3/GLOBAL/CytB-trim30min100-merge-c3-id1-OTU-denovo.biom \ #denovo BIOM table
-R REFmap.txt \ #file listing all sequence files to be used as reference
--blast \ #specify BLAST LCA based assignment strategy
--min_ident 0.85 \ #only attempt assignment for queries with at lest 85% identity to any reference sequence (relevant only for BLAST and pplacer)
--kraken  \ #specify to use Kraken for assignment
--kraken_db ./input_data/KRAKEN_DB/ \ #path to the Kraken database (if ommitted metabeat will generate it)
--pplace \ #specify to use pplacer for assignment
--refpkg ./input_data/CytB.refpkg/ \ #Refpackage for pplacer
-n 5 \ #use 5 threads
-o CytB-trim30min100-merge-c3-id1 &> taxonomic_assignment.log #prefix for output and write log to file
```

Note that the entire process only took ~10 minutes.

As before, everything that metaBEAT did is presented to you in the output (see above) and all intermediate files are kept in the directory `./GLOBAL`, and for each of the three approaches in a separate directory within.

Scroll through the output cell to get a quick idea of all the things that the pipeline just did for you. It's not needed to go into detail, just have a quick look.

metaBEAT has produced results tables in BIOM format for each of the approaches in the corresponding directory.

A nice tool for interactive exploration of your results is [phinch](http://phinch.org/). Load one of the tables (`*.biom`) and see if you can get an overview of your results.

As mentioned before, the BIOM format is a standardized file format and there is a range of tools out there that can be used to analyse these data.

