#Cumbrian Lake eDNA metabarcoding data 

In this exercise we will use [metaBEAT](https://github.com/HullUni-bioinformatics/metaBEAT), a tool tailored towards reproducible and efficient analyses of metabarcoding data that we have developed in-house. It is still under active development and will likely be extended further in the future. The pipeline is available in a Docker [container](https://registry.hub.docker.com/u/chrishah/metabeat2/) with all necessary dependencies. The Docker image is building on [ReproPhylo](https://registry.hub.docker.com/u/szitenberg/reprophylo/).

The data we will be analyzing are CytB sequences amplified from eDNA samples collected from 3 Cumbrian lakes in the Lake district - Lake Windermere, Derwent water and Basenthwaite lake. The experiment was designed to assess the potential of the eDNA approach to assess fish community compositions. A manuscript (Haenfling et al.) is in preparation. 
For the purpose of the exercise we will attempt BLAST based taxonomic assignment. The metaBEAT tool is designed as a wrapper around a complete analysis from raw data, performing (optionally) de-multiplexing, quality filtering, clustering along the way, to OTU tables in biom format. It currently supports BLAST and phylogenetic placement (pplacer). The plan is to include further approaches in the future and to allow for efficient and standardized comparative assessments of all approaches. 

We will be analyzing a total of 79 individual samples sequenced on the Illumina MiSeq platform.

metBEAT offers a large number of options. Most of them will sound familiar and should make sense to you given your experience from the course so far. 


In [None]:
!metaBEAT.py -h

We will limit ourselves to a basic analysis for now.


Minimal input for an analysis is a set of reference sequences in one or several files (could be mixing several formats, e.g. `fasta` and `genbank`) and a set of query sequences, again potentially one to many. These will be run through the pipeline sequentially. 
You will need to provide information on the nature and location of the reference and query sequences in separate tab-delimited text files via the `-R` and `-Q` flags, respectively.

The format for the Reference file is:

`/path/to/file <tab> format`

You may generate the required text files in any text editor. 
A simple example of how to produce a file pointing to your reference sequences in the `data` directory using your command line skills could be:

In [4]:
!echo "data/CytB_cleaned_05_2015.gb\tgb" > REFmap.txt

You will have to produce a similar file for the query sequences, again tab-delimited:

`sample_ID <tab> format <tab> file1 <tab> file2 <tab> optionally barcodes`

Let's use a mini python script to write the required information into a text file, which I will call `Querymap.txt`

In [42]:
import os

out_list = []
string = ''
datadir = 'data/sequences'
files = os.listdir(datadir)
for f in sorted(files):
    if '_R1_' in f:
        string += "%s\tfastq\t%s/%s\t%s/%s" %(f[:3], datadir, f, datadir, f.replace('_R1_', '_R2_'))+'\n'

out = open('Querymap.txt', 'w')
out.write(string)
out.close()

Have a look:

In [48]:
!cat Querymap.txt

A01	fastq	data/sequences/A01D_S1_L001_R1_001.fastq.gz	data/sequences/A01D_S1_L001_R2_001.fastq.gz
A02	fastq	data/sequences/A02D_S2_L001_R1_001.fastq.gz	data/sequences/A02D_S2_L001_R2_001.fastq.gz
A03	fastq	data/sequences/A03D_S3_L001_R1_001.fastq.gz	data/sequences/A03D_S3_L001_R2_001.fastq.gz
A04	fastq	data/sequences/A04D_S4_L001_R1_001.fastq.gz	data/sequences/A04D_S4_L001_R2_001.fastq.gz
A05	fastq	data/sequences/A05D_S5_L001_R1_001.fastq.gz	data/sequences/A05D_S5_L001_R2_001.fastq.gz
A06	fastq	data/sequences/A06D_S6_L001_R1_001.fastq.gz	data/sequences/A06D_S6_L001_R2_001.fastq.gz
A07	fastq	data/sequences/A07D_S7_L001_R1_001.fastq.gz	data/sequences/A07D_S7_L001_R2_001.fastq.gz
A08	fastq	data/sequences/A08D_S8_L001_R1_001.fastq.gz	data/sequences/A08D_S8_L001_R2_001.fastq.gz
A09	fastq	data/sequences/A09D_S9_L001_R1_001.fastq.gz	data/sequences/A09D_S9_L001_R2_001.fastq.gz
A10	fastq	data/sequences/A10D_S10_L001_R1_001.fastq.gz	data/sequences/A10D_S10_L001_R2_001.fastq.gz
B01	fast

metaBEAT also allows to link metadata to your samples. Metadata in our case will simply link the samples to certain lakes. A comma-separated file is present in your `data` directory. You could produce this in your favourite text editor, excel or whatever else.

In [46]:
!head data/example_metadata_basic.csv

sample_ID,water body
A01,Windermere N basin
A02,Windermere N basin
A03,Windermere S basin
A04,Windermere S basin
A05,Windermere N basin
A06,Windermere S basin
A07,Windermere N basin
A08,Windermere N basin
A09,Windermere S basin


In [None]:
!metaBEAT.py -Q Querymap.txt -R REFmap.txt --blast --taxids --merge --cluster --clust_cov 3 -E --metadata data/example_metadata_basic.csv -o eDNA_blast_lake_metadata

This will run for about 2 hours and hopefully produce a bunch of directories and a result OTU table in `eDNA_blast_lake_metadata.biom` and `eDNA_blast_lake_metadata.tsv`.

Import and explore the `eDNA_blast_lake_metadata.biom` file using e.g. [phinch](http://phinch.org/index.html).

__WELL DONE!__