The first step will be to __trim/clean our raw Illumina data__.

In [None]:
!mkdir trimming

In [19]:
cd trimming

/home/working/CytB/trimming


Prepare a text file specifying the samples to be processed including the format and location of the reads. 

The below command expects the Illumina data to be present in 2 fastq files (forward and reverse reads) per sample in a directory `./raw_reads/`. It expects the files to be named 'sampleID-marker', followed by '\_1' or '\_2' to identify the forward/reverse read file respectively. sampleID must corresponds to the first column in the file `Sample_accessions.tsv`, marker is either '12S' or 'CytB'. 

Read file names, for example:
```
../raw_reads/Bassenthwaite_01-CytB_1.fastq.gz
../raw_reads/Bassenthwaite_01-CytB_2.fastq.gz
../raw_reads/Bassenthwaite_02-CytB_1.fastq.gz
../raw_reads/Bassenthwaite_02-CytB_2.fastq.gz
../raw_reads/Bassenthwaite_03-CytB_1.fastq.gz
../raw_reads/Bassenthwaite_03-CytB_2.fastq.gz
../raw_reads/Bassenthwaite_04-CytB_1.fastq.gz
../raw_reads/Bassenthwaite_04-CytB_2.fastq.gz
../raw_reads/Bassenthwaite_05-CytB_1.fastq.gz
```


In [None]:
%%bash

for a in $(cat ../../supplementary_data/Sample_accessions.tsv | grep "CytB" | cut -f 1 | grep "SampleID" -v)
do
    R1=$(ls -1 ../../raw_reads/$a-CytB_* | grep "_1.fastq")
    R2=$(ls -1 ../../raw_reads/$a-CytB_* | grep "_2.fastq")

    echo -e "$a\tfastq\t$R1\t$R2"
done > Querymap.txt

The resulting file should look e.g. like below:

In [20]:
!head Querymap.txt

Bassenthwaite_01	fastq	../../raw_reads/Bassenthwaite_01-CytB_1.fastq.gz	../../raw_reads/Bassenthwaite_01-CytB_2.fastq.gz
Bassenthwaite_02	fastq	../../raw_reads/Bassenthwaite_02-CytB_1.fastq.gz	../../raw_reads/Bassenthwaite_02-CytB_2.fastq.gz
Bassenthwaite_03	fastq	../../raw_reads/Bassenthwaite_03-CytB_1.fastq.gz	../../raw_reads/Bassenthwaite_03-CytB_2.fastq.gz
Bassenthwaite_04	fastq	../../raw_reads/Bassenthwaite_04-CytB_1.fastq.gz	../../raw_reads/Bassenthwaite_04-CytB_2.fastq.gz
Bassenthwaite_05	fastq	../../raw_reads/Bassenthwaite_05-CytB_1.fastq.gz	../../raw_reads/Bassenthwaite_05-CytB_2.fastq.gz
Bassenthwaite_shore-01	fastq	../../raw_reads/Bassenthwaite_shore-01-CytB_1.fastq.gz	../../raw_reads/Bassenthwaite_shore-01-CytB_2.fastq.gz
Derwent_01	fastq	../../raw_reads/Derwent_01-CytB_1.fastq.gz	../../raw_reads/Derwent_01-CytB_2.fastq.gz
Derwent_02	fastq	../../raw_reads/Derwent_02-CytB_1.fastq.gz	../../raw_reads/Derwent_02-CytB_2.fastq.gz
Derwent_03	fastq	../../raw_reads/Derwent_0

The amplicon is expected to be > 400 bp long. With a readlength of 300 bp we don't expect to see any primer sequences, so it's not necessary to provide the Primer sequence for the trimming algorithm.

__Raw data trimming, removal of adapter sequences and merging of reads__ using the `metaBEAT` pipeline.

In [None]:
%%bash

metaBEAT.py \
-Q Querymap.txt \
--trim_qual 30 \
--trim_minlength 100 \
--merge \
--product_length 400 \
-n 5 -v &> log


In [21]:
cd ../

/home/working/CytB


Some stats on the read counts before/after trimming, merging etc. are summarized for you in `read_stats.csv`.

Next stage of the processing is __chimera detection__ and removal of putative chimeric sequences. We'll do that using `uchime` as implemented in `vsearch`.

In [None]:
!mkdir chimera_detection

In [22]:
cd chimera_detection

/home/working/CytB/chimera_detection


Convert reference database from Genbank to fasta format to be used in chimera detection.

Prepare Refmap file, i.e. text file that specifies the location and the format of the reference to be used.

The reference sequences in Genbank format should already be present in the `CytB` directory: `CytB_cleaned_02_2016.gb`.

In [None]:
%%bash

#Write REFmap
for file in $(ls -1 ../* | grep "gb$")
do
    echo -e "$file\tgb"
done > REFmap.txt

In [23]:
!cat REFmap.txt

../CytB_cleaned_02_2016.gb	gb


In [None]:
%%bash

metaBEAT.py \
-R REFmap.txt \
-f

This will produce `refs.fasta`.

Now run chimera detection.

In [None]:
%%bash


for a in $(cut -f 1 ../trimming/Querymap.txt)
do
    if [ -s ../trimming/$a/$a\_trimmed.fasta ]
    then
        echo -e "\n### Detecting chimeras in $a ###\n"
        mkdir $a
        cd $a
        vsearch --uchime_ref ../../trimming/$a/$a\_trimmed.fasta --db ../refs.fasta \
        --nonchimeras $a-nonchimeras.fasta --chimeras $a-chimeras.fasta &> log
        cd ..

    else
        echo -e "$a is empty"
    fi
done




In [None]:
cd ..

Last step is __taxonomic assignment of reads based on a BLAST - LCA approach__ using the metaBEAT pipeline.

In [None]:
!mkdir non-chimeras

In [None]:
cd non-chimeras/

Prepare Querymap and Refmap txt files.

In [None]:
%%bash

#Querymap
for a in $(ls -l ../chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
    echo -e "$a-nc\tfasta\t../chimera_detection/$a/$a-nonchimeras.fasta"
done > Querymap.txt

#REFmap
#Write REFmap
for file in $(ls -1 ../* | grep "gb$")
do
    echo -e "$file\tgb"
done > REFmap.txt

Sequence clustering and taxonomic assignment via metaBEAT.

In [None]:
%%bash

metaBEAT.py \
-Q Querymap.txt \
-R REFmap.txt \
--cluster --clust_match 1 --clust_cov 3 \
--blast --min_ident 0.95 \
-m CytB -n 5 \
-E -v -\
o CytB-trim_30-merge-nonchimeras-cluster_1c3-blast-min_ident_0.95 &> log


Final result of taxonomic assignment can be found in the table `CytB-trim_30-merge-nonchimeras-cluster_1c3-blast-min_ident_0.95.tsv` (see also [here](https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/supplementary_data/assignment_results/CytB-trim_30-merge-nonchimeras-cluster_1c3-blast-min_ident_0.95.tsv)). 

metaBEAT also produced the final result in [BIOM](http://biom-format.org/) format (`CytB-trim_30-merge-nonchimeras-cluster_1c3-blast-min_ident_0.95.biom`), which can be analyzed with a number of tools and visually explored e.g. using [Phinch](http://phinch.org/).