I will attempt to assign taxonomic identity to the set of denovo OTUs obtained from the 532 freshwater pond eDNA samples and PCR controls previously processed using metaBEAT.

I will be using a custom phylogenetically curated reference database. Taxonomic assignment will be performed using two different approaches:
- BLAST based LCA
- Kraken (k-mer based sequence classification)

I will again be using metaBEAT to facilitate reproducibility.

The final result of this notebook will be a taxonomically annotated OTU table in BIOM format from each approach, which I can then go and compare. BIOM format and the associated set of python functions has been developed as a standardized format for representing 'biological sample by observation contingency tables' in the -omics area.

Most of the input data was produced during processing of the eDNA samples. The rest can be found in the folder Reference_Alignment. 

I must specify location and file format reference sequences come in. Different formats (fasta, Genbank) can be mixed and matched. A simple text file that contains the path to the file and the format specification must be prepared.


The reference sequences in Genbank/fasta format are contained in the directory Reference_Alignment. The files is called `12S_UK...._SATIVA_cleaned.gb' and additional fasta files containing Sanger sequences to supplement records on genbank.

Produce the text file containing the reference sequences using the command line - We call it Refmap.txt.

In [1]:
!echo '../Reference_Alignment/Amphibians/12S_UKamphibians_SATIVA_cleaned.gb\tgb\n' \
'../Reference_Alignment/Amphibians/Amphibian12S_ReferenceSequences.fasta\tfasta\n' \
'../Reference_Alignment/Reptiles/12S_UKreptiles_SATIVA_cleaned.gb\tgb\n' \
'../Reference_Alignment/Mammals/12S_UKmammals_SATIVA_cleaned.gb\tgb\n' \
'../Reference_Alignment/Birds/12S_UKbirds_SATIVA_cleaned.gb\tgb\n' \
'../Reference_Alignment/Fish/20161026_INBO_12S_fishrefs_hfj_edit.fasta\tfasta\n' \
'../Reference_Alignment/Fish/custom_extended_12S_edit_10_2016.gb\tgb\n' \
'../Reference_Alignment/Fish/RhamphochromisEsox_mt.gb\tgb' > REFmap.txt

In [2]:
!cat REFmap.txt

../Reference_Alignment/Amphibians/12S_UKamphibians_SATIVA_cleaned.gb	gb
 ../Reference_Alignment/Amphibians/Amphibian12S_ReferenceSequences.fasta	fasta
 ../Reference_Alignment/Reptiles/12S_UKreptiles_SATIVA_cleaned.gb	gb
 ../Reference_Alignment/Mammals/12S_UKmammals_SATIVA_cleaned.gb	gb
 ../Reference_Alignment/Birds/12S_UKbirds_SATIVA_cleaned.gb	gb
 ../Reference_Alignment/Fish/20161026_INBO_12S_fishrefs_hfj_edit.fasta	fasta
 ../Reference_Alignment/Fish/custom_extended_12S_edit_10_2016.gb	gb
 ../Reference_Alignment/Fish/RhamphochromisEsox_mt.gb	gb


Produce the text file containing non-chimera query sequences - Querymap.txt

In [3]:
%%bash

#Querymap
for a in $(ls -l ../2-chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
    echo -e "$a-nc\tfasta\t../2-chimera_detection/$a/$a-nonchimeras.fasta"
done > Querymap.txt

In [4]:
!cat Querymap.txt

B1_-nc	fasta	../2-chimera_detection/B1_/B1_-nonchimeras.fasta
B2_-nc	fasta	../2-chimera_detection/B2_/B2_-nonchimeras.fasta
B3_-nc	fasta	../2-chimera_detection/B3_/B3_-nonchimeras.fasta
B4_-nc	fasta	../2-chimera_detection/B4_/B4_-nonchimeras.fasta
Blank_-nc	fasta	../2-chimera_detection/Blank_/Blank_-nonchimeras.fasta
F1_-nc	fasta	../2-chimera_detection/F1_/F1_-nonchimeras.fasta
F10_-nc	fasta	../2-chimera_detection/F10_/F10_-nonchimeras.fasta
F100_-nc	fasta	../2-chimera_detection/F100_/F100_-nonchimeras.fasta
F101_-nc	fasta	../2-chimera_detection/F101_/F101_-nonchimeras.fasta
F102_-nc	fasta	../2-chimera_detection/F102_/F102_-nonchimeras.fasta
F103_-nc	fasta	../2-chimera_detection/F103_/F103_-nonchimeras.fasta
F104_-nc	fasta	../2-chimera_detection/F104_/F104_-nonchimeras.fasta
F105_-nc	fasta	../2-chimera_detection/F105_/F105_-nonchimeras.fasta
F106_-nc	fasta	../2-chimera_detection/F106_/F106_-nonchimeras.fasta
F107_-nc	fasta	../2-chimera_detection/F107_/F107_-nonchimeras.fa

The Querymap.txt file has been made but include the GLOBAL directory in which all centroids and queries are contained (line 514). This will cause metaBEAT to fail so must be removed manually from the Querymap.txt file.

In [5]:
!sed '/GLOBAL/d' Querymap.txt > Querymap_final.txt

In [6]:
!cat Querymap_final.txt

B1_-nc	fasta	../2-chimera_detection/B1_/B1_-nonchimeras.fasta
B2_-nc	fasta	../2-chimera_detection/B2_/B2_-nonchimeras.fasta
B3_-nc	fasta	../2-chimera_detection/B3_/B3_-nonchimeras.fasta
B4_-nc	fasta	../2-chimera_detection/B4_/B4_-nonchimeras.fasta
Blank_-nc	fasta	../2-chimera_detection/Blank_/Blank_-nonchimeras.fasta
F1_-nc	fasta	../2-chimera_detection/F1_/F1_-nonchimeras.fasta
F10_-nc	fasta	../2-chimera_detection/F10_/F10_-nonchimeras.fasta
F100_-nc	fasta	../2-chimera_detection/F100_/F100_-nonchimeras.fasta
F101_-nc	fasta	../2-chimera_detection/F101_/F101_-nonchimeras.fasta
F102_-nc	fasta	../2-chimera_detection/F102_/F102_-nonchimeras.fasta
F103_-nc	fasta	../2-chimera_detection/F103_/F103_-nonchimeras.fasta
F104_-nc	fasta	../2-chimera_detection/F104_/F104_-nonchimeras.fasta
F105_-nc	fasta	../2-chimera_detection/F105_/F105_-nonchimeras.fasta
F106_-nc	fasta	../2-chimera_detection/F106_/F106_-nonchimeras.fasta
F107_-nc	fasta	../2-chimera_detection/F107_/F107_-nonchimeras.fa

That's almost it. Now start the pipeline to do sequence clustering and taxonomic assignment of non-chimera queries via metaBEAT. As input, Querymap.txt containing samples that have been trimmed, merged and checked for chimeras, and the REFmap.txt file must be specified. metaBEAT will be asked to attempt taxonomic assignment with the two different approaches mentioned above.

Kraken requires a specific database that metaBEAT will build automatically if necessary.

metaBEAT will automatically wrangle the data into the particular file formats that are required by each of the methods, run all necessary steps, and finally convert the outputs of each program to a standardized BIOM table.

**GO!**

In [7]:
!metaBEAT_global.py -h

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gi_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--bitscore_skim_LCA <FLOAT>] [--bitscore_skim_adjust_off]
                   [--min_bit <INT>] [--refpkg <DIR>] [--jp

In [8]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.95 --min_ali_length 0.8 \
--kraken \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast95-id1 &> log95

metaBEAT will generate a directory with all temporary files that were created during the processing for each sample and will record useful stats summarizing the data processing in the file 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast95-id1_read_stats.csv.

Note that the entire process only took ~10 minutes.

All BLAST and KRAKEN outputs are kept in the directory ./GLOBAL.

metaBEAT has produced results tables in BIOM format for each of the approaches in the corresponding directory.
A nice tool for interactive exploration of results is phinch. Load one of the tables (*.biom) and see if you can get an overview of the results.

If repeating analysis once databases have been created:

In [9]:
# metaBEAT_global.py \
#-Q Querymap_final.txt \
#-R REFmap.txt \
#--cluster --clust_match 0.97 --clust_cov 5 \
#--blast --blast_xml ./GLOBAL/BLAST_0.95/global_blastn.out.xml --min_ident 0.95 --min_ali_length 0.8 \
#--kraken --kraken_db ./GLOBAL/KRAKEN/KRAKEN_DB/ \
#-m 12S -n 5 \
#-E -v \
#-@ L.Harper@2015.hull.ac.uk \
#-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast95-id1 &> log95

Repeat the metaBEAT analysis at increasing minimum identity thresholds for the BLAST search. 

In [10]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.96 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast96-id1 &> log0.96

In [11]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.97 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast97-id1 &> log97

In [12]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.98 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast98-id1 &> log98

In [13]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.99 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast99-id1 &> log99

In [14]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 1 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast100-id1 &> log100

After inspection of the taxonomic assignment at different BLAST minimum identity thresholds, it was decided to progress with a minimum identity threshold of 98% as this included sound identification of species. Lower thresholds included spurious species such as *Lota lota*, an introduced fish species not present in ponds, and higher thresholds removed common and native pond species with high read counts such as *Lissotrition vulgaris*. Due to high variability of the 12S rRNA region, a 1 or 2 bp mismatch can prevent assignment of some species at high BLAST identity thresholds.

Re-run the BLAST at minimum identity threshold of 98% so that GLOBAL files, including global_queries.fasta, contain these results rather than last BLAST search at 100%. These files will be required to perform a BLAST of the unassigned reads.

In [15]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.98 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast98-id1 &> log98