__metaBEAT analysis of Illumina seq output for pollen transported by moths__

The first step will be to __trim/clean our raw Illumina data__.

Prepare a text file specifying the samples to be processed including the format and location of the reads.

The below command expects the Illumina data to be present in 2 fastq files (forward and reverse reads) per sample in a directory ./raw_data/. It expects the files to be named 'plateID_L001', followed by 'R1' or 'R2' to identify the forward/reverse read file respectively. 

We need a query map which lists these files along with the primer combinations for each well in each file, and the length of the primer sequence (including heterogeneity spacers, but NOT including any tags) to be trimmed off.

In [1]:
!head Querymap_global.txt

SA177	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	TAGATCGC	20	22
SB140	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	CTCTCTAT	20	23
T19	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	TATCCTCT	20	24
Lg45	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	AGAGTAGA	20	25
Q69	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	GTAAGGAG	20	26
S72	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	ACTGCATA	20	27
SC66	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	AAGGAGTA	20	28
S31	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	CTAAGCCT	20	29
SC74	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fa

In [3]:
ls raw_data/

Moth1_S1_L001_R1_001.fastq.gz  Moth3_S3_L001_R1_001.fastq.gz
Moth1_S1_L001_R2_001.fastq.gz  Moth3_S3_L001_R2_001.fastq.gz
Moth2_S2_L001_R1_001.fastq.gz  Moth4_S4_L001_R1_001.fastq.gz
Moth2_S2_L001_R2_001.fastq.gz  Moth4_S4_L001_R2_001.fastq.gz


In [5]:
%%bash

metaBEAT_global.py -h

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gi_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--min_bit <INT>] [--refpkg <DIR>] [--jplace <FILE>]
                   [--kraken_db <DIR>] [--rm_kraken_db] [-o OUTPUT_PREFIX]
   

In [6]:
%%bash

metaBEAT_global.py \
-Q Querymap_global.txt \
--trim_qual 30 \
--trim_minlength 90 \
--merge \
--product_length 350 \
--merged_only \
-R REFmap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--min_ident 0.90 \
-o MothPollenRestart \
-@ callumjmacgregor@gmail.com
-n 5 -v &> log_restart


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.97.4-global


Wed Oct 19 16:43:52 2016

/usr/bin/metaBEAT_global.py -Q Querymap_global.txt --trim_qual 30 --trim_minlength 90 --merge --product_length 350 --merged_only -R REFmap.txt --cluster --clust_match 1 --clust_cov 5 --min_ident 0.90 -o MothPollenRestart -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'

taxonomy.db found at /usr/bin/taxonomy.db

Parsing querylist file

Number of samples to process: 317
Sequence input format: defaultdict(<type 'int'>, {'fastq': 336})
Barcodes for demultiplexing provided for 336 samples
Cropping instructions provided for 336 samples


######## PROCESSING REFERENCE DATA ########


processing reference/Delosperma.gb

Trimming, pairing and clustering (within samples) has been successful. Next step is to try __identification using BLAST__ and __clustering across samples__.

Produce new querymap based on the one from the previous runs. Specify the clustering results from the last run as input.

In [7]:
%%bash

for sample in $(cat Querymap_global.txt | cut -f 1)
do
    fasta=$(ls -1 $sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	SA177/SA177_trimmed.fasta
SB140	fasta	SB140/SB140_trimmed.fasta
T19	fasta	T19/T19_trimmed.fasta
Lg45	fasta	Lg45/Lg45_trimmed.fasta
Q69	fasta	Q69/Q69_trimmed.fasta
S72	fasta	S72/S72_trimmed.fasta
SC66	fasta	SC66/SC66_trimmed.fasta
S31	fasta	S31/S31_trimmed.fasta
SC74	fasta	SC74/SC74_trimmed.fasta
SA68	fasta	SA68/SA68_trimmed.fasta


Now try doing a BLAST against the entirety of Genbank, using a recently downloaded local version

N.b. sometimes this will fail due to a dodgy GI/TaxID combination. In this case you need to: (1) figure out the GI of the search that failed; (2) go to NCBI and find out the TaxID for that GI; (3) create a file called gi_to_taxid.csv; (4) write the first line of that file as gi,taxid. If this still fails you may not have chosen the right GI, or there may be several dodgy ones.

In [13]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--blast_db Final/nt/nt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenGenbankRestart > log

We want to inspect the outputs of this on a case by case basis. The file we need for this is a .xml file and it would be easier to have it as a .txt file, so we partially rerun the above query to convert it.

In [14]:
%%bash
blastn -query /home/working/GLOBAL/global_centroids.fasta \
-db /home/working/Final/nt/nt \
-out GLOBAL/BLAST_0.95/global_blastn.out.txt

Now we want to download our curated reference list (with help from inspecting the outputs of the above, and a list of East Yorkshire flora from Dick Middleton).

The list of binomial species names was prepared in a simple text file: FinalRefList.txt.

In [15]:
!head FinalRefList.txt

Abies alba
Abies cephalonica
Abies cilicica
Abies concolor
Abies delavayi
Abies firma
Abies grandis
Abies homolepis
Abies lasiocarpa
Abies nordmanniana


In [11]:
!fetch_from_db.py -t FinalRefList.txt -m rbcl -o eyorks_flora_curated -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'


QUERYING GENBANK

check for synonyms for "rbcl" (this is relevant only for Genbank searches)

fetching accessions ..

#	Abies alba	8
#	Abies cephalonica	2
#	Abies cilicica	1
#	Abies concolor	6
#	Abies delavayi	6
#	Abies firma	6
#	Abies grandis	4
#	Abies homolepis	4
#	Abies lasiocarpa	5
#	Abies nordmanniana	4
#	Abies pinsapo	5
#	Abies procera	1
#	Abies veitchii	4
#	Acaena novae-zelandiae	3
#	Acer campestre	17
#	Acer platanoides	6
#	Acer pseudoplatanus	16
#	Achillea alpina	0
#	Achillea distans	0
#	Achillea ligustica	0
#	Achillea millefolium	39
#	Achillea ptarmica	4
#	Aconitum napellus	5
#	Acorus calamus	33
#	Actaea spicata	5
#	Adonis annua	2
#	Adoxa moschatellina	7
#	Aegopodium podagraria	7
#	Aesculus carne

Now we want an additional reference file containing the positive controls (unfortunately this may be less successful as several are unexpectedly missing - so I've included a number of congenerics for those that I know are missing, in the hope of a hit)

In [12]:
!fetch_from_db.py -t PosList.txt -m rbcl -o positives_curated -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'


QUERYING GENBANK

check for synonyms for "rbcl" (this is relevant only for Genbank searches)

fetching accessions ..

#	Bryophyllum pinnatum	0
#	Crassula capitella	0
#	Crassula columnaris	0
#	Crassula deceptor	0
#	Crassula lactea	0
#	Crassula nudicaulis	1
#	Crassula ovata	0
#	Crassula perforata	1
#	Crassula socialis	0
#	Delosperma cooperi	0
#	Delosperma echinatum	3
#	Delosperma jansei	0
#	Delosperma sutherlandii	0
#	Delosperma tradescantioides	0
#	Dracaena aletriformis	2
#	Dracaena draco	3
#	Dracaena fragrans	1
#	Dracaena mannii	1
#	Dracaena marginata	0
#	Dracaena transvaalensis	1
#	Kalanchoe pinnata	7

total number of accessions fetched: 20


downloading 20 records .. processing 1000 accessions per batch

Now run another BLAST with the same parameters as before, but the curated reference database instead of the local Genbank copy

In [21]:
cd AgainstRefDB/

/home/working/AgainstRefDB


Create a new Querymap.txt, telling it to go up one directory first (in order to find the trimmed, tag-assigned sequences)

In [23]:
%%bash

for sample in $(cat ../Querymap_global.txt | cut -f 1)
do
    fasta=$(ls -1 ../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../SA177/SA177_trimmed.fasta
SB140	fasta	../SB140/SB140_trimmed.fasta
T19	fasta	../T19/T19_trimmed.fasta
Lg45	fasta	../Lg45/Lg45_trimmed.fasta
Q69	fasta	../Q69/Q69_trimmed.fasta
S72	fasta	../S72/S72_trimmed.fasta
SC66	fasta	../SC66/SC66_trimmed.fasta
S31	fasta	../S31/S31_trimmed.fasta
SC74	fasta	../SC74/SC74_trimmed.fasta
SA68	fasta	../SA68/SA68_trimmed.fasta


In [44]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistRestart > log_reflist

Again we want a .txt file to inspect

In [51]:
%%bash
blastn -query /home/working/AgainstRefDB/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/GLOBAL/BLAST_0.95/marker_blast_db \
-out GLOBAL/BLAST_0.95/global_blastn.out.txt

USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-perc_identity float_value] [-xdrop_ungap float_value]
    [-xdrop_gap float_value] [-xdrop_gap_final float_value]
    [-searchsp int_value] [-max_hsps_per_subject int_value] [-penalty penalty]
    [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]
    [-template_type type] [-template_length int_value] [-dust DUST_options]
    [-filtering_db filtering_database]
    [-window_masker_taxid window_masker_taxid]
    [-window_mask

In [None]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--blast_db Final/nt/nt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenGenbankRestart > log

In [29]:
%%bash

metaBEAT_global.py \
-B ../GLOBAL/MothPollenGenbankRestart-OTU-denovo.biom \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistRestart > log_reflist.txt

In [None]:
%%bash
blastn -query /home/working/AgainstRefDB/GLOBAL/global_centroids.fasta \
-db /home/working/Final/nt/nt \
-out GLOBAL/BLAST_0.95/global_blastn.out.txt

In [25]:
%%bash

metaBEAT_global.py \
-B GLOBAL/MothPollenGenbank-OTU-denovo.biom \
--blast_xml GLOBAL/BLAST_0.95/global_blastn.out.xml \
--blast --min_ident 0.95 \
-n 5 -v \
-@ callumjmacgregor@gmail.com \
-o MothPollenGenbank > log-assignment_restart.txt

In [26]:
import metaBEAT_global_misc_functions as mb

In [27]:
table = mb.load_BIOM('GLOBAL/BLAST_0.95/MothPollenGenbank-by-taxonomy-readcounts.blast.biom')


Specified BIOM input format 'json' - ok!


In [28]:
mb.find_target(target='Ensete_ventricosum', BIOM=table)

S79.blast	(0.0191 %)
SA98.blast	(0.1277 %)
SB163.blast	(0.0508 %)
