I will attempt to assign taxonomic identity to the set of denovo OTUs obtained from the 532 freshwater pond eDNA samples and PCR controls previously processed using metaBEAT.

I will be using a custom phylogenetically curated reference database. Taxonomic assignment will be performed using a BLAST-based LCA approach.

I will again be using [metaBEAT](https://github.com/HullUni-bioinformatics/metaBEAT) to facilitate reproducibility.

The final result of this notebook will be a taxonomically annotated OTU table in BIOM and tsv format, which I can then use in downstream analysis. BIOM format, and the associated set of python functions, has been developed as a standardized format for representing 'biological sample by observation contingency tables' in the -omics area.

Most of the input data was produced during processing of the eDNA samples. The rest can be found in the folder `../Reference_database`. 

I must specify location of and file format that reference sequences come in. Different formats (fasta, GenBank) can be mixed and matched. A simple text file that contains the path to the file and the format specification must be prepared.

The reference sequences in GenBank/fasta format are contained in the directory `../Reference_database`. The files are called `12S_UK...._SATIVA_cleaned.gb` and additional fasta files containing Sanger sequences to supplement records on GenBank.

In [1]:
!pwd

/home/working/Harper_et_al_2018/Jupyter_notebooks


In [2]:
cd ..

/home/working/Harper_et_al_2018


In [3]:
!mkdir 3-taxonomic_assignment

In [4]:
cd 3-taxonomic_assignment/

/home/working/Harper_et_al_2018/3-taxonomic_assignment


Produce the text file containing the reference sequences using the command line - we call it `REFmap.txt`.

In [5]:
!echo '../Reference_database/Amphibians/12S_UKamphibians_SATIVA_cleaned.gb\tgb\n' \
'../Reference_database/Amphibians/Amphibian12S_ReferenceSequences.fasta\tfasta\n' \
'../Reference_database/Reptiles/12S_UKreptiles_SATIVA_cleaned.gb\tgb\n' \
'../Reference_database/Mammals/12S_UKmammals_SATIVA_cleaned.gb\tgb\n' \
'../Reference_database/Birds/12S_UKbirds_SATIVA_cleaned.gb\tgb\n' \
'../Reference_database/Fish/20161026_INBO_12S_fishrefs_hfj_edit.fasta\tfasta\n' \
'../Reference_database/Fish/custom_extended_12S_edit_10_2016.gb\tgb\n' \
'../Reference_database/Fish/RhamphochromisEsox_mt.gb\tgb' > REFmap.txt

In [6]:
!cat REFmap.txt

../Reference_database/Amphibians/12S_UKamphibians_SATIVA_cleaned.gb	gb
 ../Reference_database/Amphibians/Amphibian12S_ReferenceSequences.fasta	fasta
 ../Reference_database/Reptiles/12S_UKreptiles_SATIVA_cleaned.gb	gb
 ../Reference_database/Mammals/12S_UKmammals_SATIVA_cleaned.gb	gb
 ../Reference_database/Birds/12S_UKbirds_SATIVA_cleaned.gb	gb
 ../Reference_database/Fish/20161026_INBO_12S_fishrefs_hfj_edit.fasta	fasta
 ../Reference_database/Fish/custom_extended_12S_edit_10_2016.gb	gb
 ../Reference_database/Fish/RhamphochromisEsox_mt.gb	gb


Produce the text file containing non-chimera query sequences - `Querymap.txt`

In [7]:
%%bash

#Querymap
for a in $(ls -l ../2-chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
    echo -e "$a-nc\tfasta\t../2-chimera_detection/$a/$a-nonchimeras.fasta"
done > Querymap.txt

In [8]:
!cat Querymap.txt

F1-nc	fasta	../2-chimera_detection/F1/F1-nonchimeras.fasta
F10-nc	fasta	../2-chimera_detection/F10/F10-nonchimeras.fasta
F100-nc	fasta	../2-chimera_detection/F100/F100-nonchimeras.fasta
F101-nc	fasta	../2-chimera_detection/F101/F101-nonchimeras.fasta
F102-nc	fasta	../2-chimera_detection/F102/F102-nonchimeras.fasta
F103-nc	fasta	../2-chimera_detection/F103/F103-nonchimeras.fasta
F104-nc	fasta	../2-chimera_detection/F104/F104-nonchimeras.fasta
F105-nc	fasta	../2-chimera_detection/F105/F105-nonchimeras.fasta
F106-nc	fasta	../2-chimera_detection/F106/F106-nonchimeras.fasta
F107-nc	fasta	../2-chimera_detection/F107/F107-nonchimeras.fasta
F108-nc	fasta	../2-chimera_detection/F108/F108-nonchimeras.fasta
F109-nc	fasta	../2-chimera_detection/F109/F109-nonchimeras.fasta
F11-nc	fasta	../2-chimera_detection/F11/F11-nonchimeras.fasta
F110-nc	fasta	../2-chimera_detection/F110/F110-nonchimeras.fasta
F111-nc	fasta	../2-chimera_detection/F111/F111-nonchimeras.fasta
F112-nc	fasta	../2-chi

The `Querymap.txt` file has been made but includes the `./GLOBAL` directory in which all centroids and queries are contained (line 514). This will cause metaBEAT to fail so it must be removed manually from the `Querymap.txt` file.

In [9]:
!sed '/GLOBAL/d' Querymap.txt > Querymap_final.txt

In [10]:
!cat Querymap_final.txt

F1-nc	fasta	../2-chimera_detection/F1/F1-nonchimeras.fasta
F10-nc	fasta	../2-chimera_detection/F10/F10-nonchimeras.fasta
F100-nc	fasta	../2-chimera_detection/F100/F100-nonchimeras.fasta
F101-nc	fasta	../2-chimera_detection/F101/F101-nonchimeras.fasta
F102-nc	fasta	../2-chimera_detection/F102/F102-nonchimeras.fasta
F103-nc	fasta	../2-chimera_detection/F103/F103-nonchimeras.fasta
F104-nc	fasta	../2-chimera_detection/F104/F104-nonchimeras.fasta
F105-nc	fasta	../2-chimera_detection/F105/F105-nonchimeras.fasta
F106-nc	fasta	../2-chimera_detection/F106/F106-nonchimeras.fasta
F107-nc	fasta	../2-chimera_detection/F107/F107-nonchimeras.fasta
F108-nc	fasta	../2-chimera_detection/F108/F108-nonchimeras.fasta
F109-nc	fasta	../2-chimera_detection/F109/F109-nonchimeras.fasta
F11-nc	fasta	../2-chimera_detection/F11/F11-nonchimeras.fasta
F110-nc	fasta	../2-chimera_detection/F110/F110-nonchimeras.fasta
F111-nc	fasta	../2-chimera_detection/F111/F111-nonchimeras.fasta
F112-nc	fasta	../2-chi

That's almost it. Now start the pipeline to do sequence clustering and taxonomic assignment of non-chimera queries via metaBEAT. As input, Querymap.txt containing samples that have been trimmed, merged and checked for chimeras, and the REFmap.txt file must be specified. metaBEAT will be asked to attempt taxonomic assignment with the two different approaches mentioned above.

Kraken requires a specific database that metaBEAT will build automatically if necessary.

metaBEAT will automatically wrangle the data into the particular file formats that are required by each of the methods, run all necessary steps, and finally convert the outputs of each program to a standardized BIOM table.

**GO!**

In [11]:
!metaBEAT_global.py -h

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>] [--bc_dist <INT>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gb_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--update_taxonomy] [--taxonomy_db <FILE>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--bitscore_skim_LCA <FLOAT>] [--bitsc

In [None]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.95 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast95-id1 &> log95

In [None]:
!tail -n 50 log95

metaBEAT will generate a directory with all temporary files that were created during the processing for each sample and will record useful stats summarizing the data processing in the file `12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast95-id1_read_stats.csv`.

Note that the entire process only took ~10 minutes.

All BLAST outputs are kept in the directory `./GLOBAL`.

metaBEAT has produced results tables in BIOM format for each of the approaches in the corresponding directory.
A nice tool for interactive exploration of results is phinch. Load one of the tables (`*.biom`) and see if you can get an overview of the results.

**OR**

If you experience an error, the chances are this is related to changes in the NCBI taxonomy database. The metaBEAT image comes prepared with a database that contains taxonomic information for taxids. However, new taxids are constantly being added to GenBank, and the database contained in the image (`/usr/bin/taxonomy.db`) is not necessarily up to date. If the above command gives you errors about certain taxids not being found in the database. or KeyErrors relating to a taxid, you will need to do one or all of the following:
1. Create a `gb_to_taxid.csv` file with the problematic taxid and corresponding gi number
2. Update the local taxonomy database that metaBEAT uses.
3. Manually add taxids to the local taxonomy database.
4. Manually search for the gi number/taxid that causes the pipeline to break, and add it along with the corresponding gi number/taxid to the `gi_to_taxid.csv` file.

We have provided the `gb_to_taxid.csv` file created when we performed an unassigned BLAST against the entire NCBI nucleotide database during our analysis. It is in `../Jupyter_notebooks` but must be moved to the same directory where the current notebook is performing the BLAST. In this case, `./3-taxonomic_assignment`.

In [None]:
!mv ../Jupyter_notebooks/gb_to_taxid.csv ../3-taxonomic_assignment/

Now rerun the metaBEAT command to see if taxonomy issues have been resolved.

To update the local taxonomy database:

In [13]:
%%bash

# Remove the old database
rm /usr/bin/taxonomy.db

# Create a new database. This command will first download the latest NCBI taxonomy info (taxdmp.zip) and configure the database
taxit new_database -d /usr/bin/taxonomy.db

# Remove temporal file that has been downloaded by the previous command
rm /usr/bin/taxdmp.zip

usage: taxit [-h] [-V] [-v] [-q]
             {help,add_nodes,add_to_taxtable,check,composition,create,extract_nodes,findcompany,get_lineage,info,lineage_table,lonelynodes,namelookup,new_database,refpkg_intersection,reroot,rollback,rollforward,rp,strip,taxids,taxtable,update,update_taxids}
             ...
taxit: error: unrecognized arguments: -d
rm: cannot remove ‘/usr/bin/taxdmp.zip’: No such file or directory


In [14]:
%%bash

# Remove the old database
rm /usr/bin/taxonomy.db

# Create a new database. This command will first download the latest NCBI taxonomy info (taxdmp.zip) and configure the database
taxit new_database -u ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip -p /usr/bin/

# Remove temporal file that has been downloaded by the previous command
rm /usr/bin/taxdmp.zip

rm: cannot remove ‘/usr/bin/taxonomy.db’: No such file or directory


In [55]:
%%bash

# Remove the old database
rm -r /usr/bin/taxonomy.db

# Make directory for new database
mkdir /usr/bin/taxonomy.db

# This command will download the latest NCBI taxonomy info (taxdmp.zip) into the created directory
taxit new_database -u ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip -p /usr/bin/taxonomy.db/

# Unzip taxdmp.zip file to configure the database. Need to specify the destination directory.
unzip /usr/bin/taxonomy.db/taxdmp.zip -d /usr/bin/taxonomy.db/

# Remove temporal file that was been downloaded by the previous command
rm /usr/bin/taxonomy.db/taxdmp.zip

Archive:  /usr/bin/taxonomy.db/taxdmp.zip
  inflating: /usr/bin/taxonomy.db/citations.dmp  
  inflating: /usr/bin/taxonomy.db/delnodes.dmp  
  inflating: /usr/bin/taxonomy.db/division.dmp  
  inflating: /usr/bin/taxonomy.db/gencode.dmp  
  inflating: /usr/bin/taxonomy.db/merged.dmp  
  inflating: /usr/bin/taxonomy.db/names.dmp  
  inflating: /usr/bin/taxonomy.db/nodes.dmp  
  inflating: /usr/bin/taxonomy.db/gc.prt  
  inflating: /usr/bin/taxonomy.db/readme.txt  


In [79]:
%%bash

# Remove the old database
rm -r /usr/bin/taxonomy.db

# This command will download the latest NCBI taxonomy info (taxdmp.zip) into the created directory
taxit new_database -u ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip -p /usr/bin/

# Unzip taxdmp.zip file to configure the database. Need to specify the destination directory.
unzip /usr/bin/taxdmp.zip -d /usr/bin/taxonomy.db/

# Remove temporal file that was been downloaded by the previous command
rm /usr/bin/taxdmp.zip

Archive:  /usr/bin/taxdmp.zip
  inflating: /usr/bin/taxonomy.db/citations.dmp  
  inflating: /usr/bin/taxonomy.db/delnodes.dmp  
  inflating: /usr/bin/taxonomy.db/division.dmp  
  inflating: /usr/bin/taxonomy.db/gencode.dmp  
  inflating: /usr/bin/taxonomy.db/merged.dmp  
  inflating: /usr/bin/taxonomy.db/names.dmp  
  inflating: /usr/bin/taxonomy.db/nodes.dmp  
  inflating: /usr/bin/taxonomy.db/gc.prt  
  inflating: /usr/bin/taxonomy.db/readme.txt  


rm: cannot remove ‘/usr/bin/taxonomy.db’: No such file or directory


In [70]:
!ls /usr/bin/taxonomy.db

citations.dmp  division.dmp  gencode.dmp  names.dmp  readme.txt
delnodes.dmp   gc.prt	     merged.dmp   nodes.dmp


This will take a few minutes, depending on your internet connection. Once updated, try rerunning the metaBEAT command.

An error still comes up if one or more taxids that are present in the taxid files (`gi_to_taxid.csv`, `gb_to_taxid.csv`, `taxid.txt`), but are not present in the taxonomy database that the current metaBEAT image contains. Reasons for this are not clear. However, if the updated taxonomy database fails again, a possible way around is to manually add the problematic taxid to the taxonomy database. This involves created a new node in the taxonomy database. You will have to create a csv file for each node to be added formatted like so:

"tax_id","parent_id","rank","tax_name"

"1825799","190765","species","Ochlerotatus cantans"

There must only be one row per node but you may add as many new nodes as the number of taxids that metaBEAT could not find in the taxonomy database.

To add the nodes to the taxonomy database, use this command:

In [None]:
#!taxit add_nodes -d /usr/bin/taxonomy.db -N ../new_taxids.csv

In [78]:
!taxit -h

usage: taxit [-h] [-V] [-v] [-q]
             {help,add_nodes,add_to_taxtable,check,composition,create,extract_nodes,findcompany,get_lineage,info,lineage_table,lonelynodes,namelookup,new_database,refpkg_intersection,reroot,rollback,rollforward,rp,strip,taxids,taxtable,update,update_taxids}
             ...

Creation, validation, and modification of reference packages for use with
`pplacer` and related software.

positional arguments:
  {help,add_nodes,add_to_taxtable,check,composition,create,extract_nodes,findcompany,get_lineage,info,lineage_table,lonelynodes,namelookup,new_database,refpkg_intersection,reroot,rollback,rollforward,rp,strip,taxids,taxtable,update,update_taxids}
    help                Detailed help for actions using `help <action>`
    add_nodes           Add nodes and names to a database
    add_to_taxtable     Add nodes to an existing taxtable csv
    check               Validate a reference package
    composition         Show taxonomic composition of a r

If repeating analysis once databases have been created:

In [13]:
# metaBEAT_global.py \
#-Q Querymap_final.txt \
#-R REFmap.txt \
#--cluster --clust_match 0.97 --clust_cov 5 \
#--blast --blast_xml ./GLOBAL/BLAST_0.95/global_blastn.out.xml --min_ident 0.95 --min_ali_length 0.8 \
#-m 12S -n 5 \
#-E -v \
#-@ L.Harper@2015.hull.ac.uk \
#-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast95-id1 &> log95

Repeat the metaBEAT analysis at increasing minimum identity thresholds for the BLAST search. 

In [14]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.96 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast96-id1 &> log0.96

In [15]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.97 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast97-id1 &> log97

In [16]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.98 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast98-id1 &> log98

In [17]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.99 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast99-id1 &> log99

In [18]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 1 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast100-id1 &> log100

After inspection of the taxonomic assignment at different BLAST minimum identity thresholds, it was decided to progress with a minimum identity threshold of 98% as this included sound identification of species. Lower thresholds included spurious species such as *Lota lota*, an introduced fish species not present in ponds, and higher thresholds removed common and native pond species with high read counts such as *Lissotrition vulgaris*. Due to high variability of the 12S rRNA region, a 1 or 2 bp mismatch can prevent assignment of some species at high BLAST identity thresholds.

Re-run the BLAST at minimum identity threshold of 98% so that `./GLOBAL` files, including `global_queries.fasta`, contain these results rather than last BLAST search at 100%. These files will be required to perform a BLAST of the unassigned reads.

In [19]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 5 \
--blast --min_ident 0.98 --min_ali_length 0.8 \
-m 12S -n 5 \
-E -v \
-@ L.Harper@2015.hull.ac.uk \
-o 12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast98-id1 &> log98