#metaBEAT analysis of Illumina seq output for pollen transported by moths

The first step will be to __trim/clean our raw Illumina data__.

Prepare a text file specifying the samples to be processed including the format and location of the reads.

The below command expects the Illumina data to be present in 2 fastq files (forward and reverse reads) per sample in a directory ./raw_data/. It expects the files to be named 'plateID_L001', followed by 'R1' or 'R2' to identify the forward/reverse read file respectively. 

We need a query map which lists these files along with the primer combinations for each well in each file, and the length of the primer sequence (including heterogeneity spacers, but NOT including any tags) to be trimmed off.

In [13]:
pwd

u'/home/working'

In [14]:
!head Querymap_global.txt

SA177	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	TAGATCGC	20	22
SB140	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	CTCTCTAT	20	23
T19	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	TATCCTCT	20	24
Lg45	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	AGAGTAGA	20	25
Q69	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	GTAAGGAG	20	26
S72	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	ACTGCATA	20	27
SC66	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	TCGCCTTA	AAGGAGTA	20	28
S31	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	AGCGTAGC	CTAAGCCT	20	29
SC74	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fa

In [3]:
ls raw_data/

Moth1_S1_L001_R1_001.fastq.gz  Moth3_S3_L001_R1_001.fastq.gz
Moth1_S1_L001_R2_001.fastq.gz  Moth3_S3_L001_R2_001.fastq.gz
Moth2_S2_L001_R1_001.fastq.gz  Moth4_S4_L001_R1_001.fastq.gz
Moth2_S2_L001_R2_001.fastq.gz  Moth4_S4_L001_R2_001.fastq.gz


In [5]:
%%bash

metaBEAT_global.py -h

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gi_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--min_bit <INT>] [--refpkg <DIR>] [--jplace <FILE>]
                   [--kraken_db <DIR>] [--rm_kraken_db] [-o OUTPUT_PREFIX]
   

Now we run the trimming and clustering; note that there is no assignment method selected at this stage so metaBEAT will not attempt to make any assignments

In [6]:
%%bash

metaBEAT_global.py \
-Q Querymap_global.txt \
--trim_qual 30 \
--trim_minlength 90 \
--merge \
--product_length 350 \
--merged_only \
-R REFmap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--min_ident 0.90 \
-o MothPollenRestart \
-@ callumjmacgregor@gmail.com
-n 5 -v &> log_restart


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.97.4-global


Wed Oct 19 16:43:52 2016

/usr/bin/metaBEAT_global.py -Q Querymap_global.txt --trim_qual 30 --trim_minlength 90 --merge --product_length 350 --merged_only -R REFmap.txt --cluster --clust_match 1 --clust_cov 5 --min_ident 0.90 -o MothPollenRestart -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'

taxonomy.db found at /usr/bin/taxonomy.db

Parsing querylist file

Number of samples to process: 317
Sequence input format: defaultdict(<type 'int'>, {'fastq': 336})
Barcodes for demultiplexing provided for 336 samples
Cropping instructions provided for 336 samples


######## PROCESSING REFERENCE DATA ########


processing reference/Delosperma.gb

Trimming, pairing and clustering (within samples) has been successful. Next step is to try __identification using BLAST__ and __clustering across samples__.

#BLAST

Produce new querymap based on the one from the previous runs. Specify the clustering results from the last run as input.

In [7]:
%%bash

for sample in $(cat Querymap_global.txt | cut -f 1)
do
    fasta=$(ls -1 $sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	SA177/SA177_trimmed.fasta
SB140	fasta	SB140/SB140_trimmed.fasta
T19	fasta	T19/T19_trimmed.fasta
Lg45	fasta	Lg45/Lg45_trimmed.fasta
Q69	fasta	Q69/Q69_trimmed.fasta
S72	fasta	S72/S72_trimmed.fasta
SC66	fasta	SC66/SC66_trimmed.fasta
S31	fasta	S31/S31_trimmed.fasta
SC74	fasta	SC74/SC74_trimmed.fasta
SA68	fasta	SA68/SA68_trimmed.fasta


#Blast against Genbank

Now try doing a BLAST against the entirety of Genbank, using a recently downloaded local version

N.b. sometimes this will fail due to a dodgy GI/TaxID combination. In this case you need to: (1) figure out the GI of the search that failed; (2) go to NCBI and find out the TaxID for that GI; (3) create a file called gi_to_taxid.csv; (4) write the first line of that file as gi,taxid. If this still fails you may not have chosen the right GI, or there may be several dodgy ones.

In [13]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--blast_db Final/nt/nt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenGenbankRestart > log

We want to inspect the outputs of this on a case by case basis. The file we need for this is a .xml file and it would be easier to have it as a .txt file, so we partially rerun the above query to convert it.

In [14]:
%%bash
blastn -query /home/working/GLOBAL/global_centroids.fasta \
-db /home/working/Final/nt/nt \
-out GLOBAL/BLAST_0.95/global_blastn.out.txt

#Blast against reference database

Now we want to download our curated reference list (with help from inspecting the outputs of the above, and a list of East Yorkshire flora from Dick Middleton).

The list of binomial species names was prepared in a simple text file: FinalRefList.txt.

In [15]:
!head FinalRefList.txt

Abies alba
Abies cephalonica
Abies cilicica
Abies concolor
Abies delavayi
Abies firma
Abies grandis
Abies homolepis
Abies lasiocarpa
Abies nordmanniana


In [11]:
!fetch_from_db.py -t FinalRefList.txt -m rbcl -o eyorks_flora_curated -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'


QUERYING GENBANK

check for synonyms for "rbcl" (this is relevant only for Genbank searches)

fetching accessions ..

#	Abies alba	8
#	Abies cephalonica	2
#	Abies cilicica	1
#	Abies concolor	6
#	Abies delavayi	6
#	Abies firma	6
#	Abies grandis	4
#	Abies homolepis	4
#	Abies lasiocarpa	5
#	Abies nordmanniana	4
#	Abies pinsapo	5
#	Abies procera	1
#	Abies veitchii	4
#	Acaena novae-zelandiae	3
#	Acer campestre	17
#	Acer platanoides	6
#	Acer pseudoplatanus	16
#	Achillea alpina	0
#	Achillea distans	0
#	Achillea ligustica	0
#	Achillea millefolium	39
#	Achillea ptarmica	4
#	Aconitum napellus	5
#	Acorus calamus	33
#	Actaea spicata	5
#	Adonis annua	2
#	Adoxa moschatellina	7
#	Aegopodium podagraria	7
#	Aesculus carne

Now we want an additional reference file containing the positive controls (unfortunately this may be less successful as several are unexpectedly missing - so I've included a number of congenerics for those that I know are missing, in the hope of a hit)

In [11]:
!fetch_from_db.py -t PosList.txt -m rbcl -o positives_curated -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'


QUERYING GENBANK

check for synonyms for "rbcl" (this is relevant only for Genbank searches)

fetching accessions ..

#	Bryophyllum pinnatum	0
#	Crassula capitella	0
#	Crassula columnaris	0
#	Crassula deceptor	0
#	Crassula lactea	0
#	Crassula nudicaulis	1
#	Crassula ovata	0
#	Crassula perforata	1
#	Crassula socialis	0
#	Delosperma cooperi	0
#	Delosperma jansei	0
#	Delosperma sutherlandii	0
#	Delosperma tradescantioides	0
#	Dracaena aletriformis	2
#	Dracaena draco	3
#	Dracaena fragrans	1
#	Dracaena mannii	1
#	Dracaena marginata	0
#	Dracaena transvaalensis	1
#	Kalanchoe pinnata	7

total number of accessions fetched: 17


downloading 17 records .. processing 1000 accessions per batch

[Mon Oct 24 2016 11:40:

Now run another BLAST with the same parameters as before, but the curated reference database instead of the local Genbank copy

In [2]:
cd AgainstRefDB/

/home/working/AgainstRefDB


Create a new Querymap.txt, telling it to go up one directory first (in order to find the trimmed, tag-assigned sequences)

In [23]:
%%bash

for sample in $(cat ../Querymap_global.txt | cut -f 1)
do
    fasta=$(ls -1 ../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../SA177/SA177_trimmed.fasta
SB140	fasta	../SB140/SB140_trimmed.fasta
T19	fasta	../T19/T19_trimmed.fasta
Lg45	fasta	../Lg45/Lg45_trimmed.fasta
Q69	fasta	../Q69/Q69_trimmed.fasta
S72	fasta	../S72/S72_trimmed.fasta
SC66	fasta	../SC66/SC66_trimmed.fasta
S31	fasta	../S31/S31_trimmed.fasta
SC74	fasta	../SC74/SC74_trimmed.fasta
SA68	fasta	../SA68/SA68_trimmed.fasta


In [44]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistRestart > log_reflist

Again we want a .txt file to inspect

In [51]:
%%bash
blastn -query /home/working/AgainstRefDB/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/GLOBAL/BLAST_0.95/marker_blast_db \
-out GLOBAL/BLAST_0.95/global_blastn.out.txt

#Blast against curated reference list

I now want to curate the reference list in order to improve the quality of the assignments, and to get as many of them as possible to genus or species level. To do this, I inspect the outputs of the previous BLAST on a case-by-case basis, and determine where there are hits to species that are not plausibly correct. Remove these species from the reference list and repeat the previous steps in a new sub-folder, first copying a couple of unchanged files to the new location.

In [9]:
cp positives_curated.gb Curat1/

In [10]:
cp REFmap.txt Curat1/

In [2]:
cd Curat1/

/home/working/AgainstRefDB/Curat1


In [3]:
ls

[0m[01;32mAccessionBlacklist.txt[0m*                 [01;34mS96[0m/    [01;34mSB15[0m/
eyorks_flora_curat1.fasta               [01;34mS97[0m/    [01;34mSB150[0m/
eyorks_flora_curat1.gb                  [01;34mSA1[0m/    [01;34mSB153[0m/
eyorks_flora_curated.fasta              [01;34mSA104[0m/  [01;34mSB159[0m/
eyorks_flora_curated.gb                 [01;34mSA105[0m/  [01;34mSB163[0m/
[01;34mGLOBAL[0m/                                 [01;34mSA109[0m/  [01;34mSB166[0m/
[01;34mLg27[0m/                                   [01;34mSA111[0m/  [01;34mSB170[0m/
[01;34mLg37[0m/                                   [01;34mSA112[0m/  [01;34mSB171[0m/
[01;34mLg45[0m/                                   [01;34mSA113[0m/  [01;34mSB172[0m/
[01;34mLg53[0m/                                   [01;34mSA114[0m/  [01;34mSB180[0m/
[01;34mLg57[0m/                                   [01;34mSA115[0m/  [01;34mSB182[0m/
[01;34mLg65[0m/                

In [4]:
!head RefListCurat1.txt

Abies alba
Abies cephalonica
Abies cilicica
Abies concolor
Abies delavayi
Abies firma
Abies grandis
Abies homolepis
Abies lasiocarpa
Abies nordmanniana


In [5]:
!fetch_from_db.py -t RefListCurat1.txt -m rbcl -o eyorks_flora_curat1 -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'


QUERYING GENBANK

check for synonyms for "rbcl" (this is relevant only for Genbank searches)

fetching accessions ..

#	Abies alba	8
#	Abies cephalonica	2
#	Abies cilicica	1
#	Abies concolor	6
#	Abies delavayi	6
#	Abies firma	6
#	Abies grandis	4
#	Abies homolepis	4
#	Abies lasiocarpa	5
#	Abies nordmanniana	4
#	Abies pinsapo	5
#	Abies procera	1
#	Abies veitchii	4
#	Acaena novae-zelandiae	3
#	Acer campestre	18
#	Acer platanoides	6
#	Acer pseudoplatanus	16
#	Achillea alpina	0
#	Achillea distans	0
#	Achillea ligustica	0
#	Achillea millefolium	39
#	Achillea ptarmica	4
#	Aconitum napellus	5
#	Acorus calamus	33
#	Actaea spicata	5
#	Adonis annua	2
#	Adoxa moschatellina	7
#	Aegopodium podagraria	7
#	Aesculus carne

#Remove bad sequences

We have a few blacklisted accession numbers to remove from this file:

In [5]:
!head AccessionBlacklist.txt

KU569098.1
Z37446.1
KP643954.1
JN965994.1
JQ412340.1
HQ623997.1
HQ590280.1
KJ204422.1
JN892973.1
JN892974.1


And we also want to run Sativa, to check for any putative mis-labelled sequences in the reference db. This is a lengthy process so let's do it in a separate notebook. However, we can separately just remove the blacklisted accessions from the local copy of the reference db here.

In [8]:
from Bio import SeqIO

ids=[]
temp=[]
blacklist=[]
accessions = open('AccessionBlacklist.txt','r')
dropped=0
numberofrecords=0

for line in accessions:
    ids.append(line.strip())
    

recs_to_keep = {'rbcL': []}
recs_to_drop = {'rbcL': ids}

print "Blacklist recs to drop: %s" %recs_to_drop['rbcL']

records = SeqIO.parse('eyorks_flora_curat1.gb','genbank')

for r in records:       
    numberofrecords+=1
    if not r.id in recs_to_drop['rbcL']: 
        recs_to_keep['rbcL'].append(r.id)
        temp.append(r)
    
    else:
        print "exclude: %s" %r.id
        dropped+=1
        blacklist.append(r.id)
        
out = open('eyorks_flora_curated.gb', 'w')
SeqIO.write(temp, out, 'genbank')
out.close()

print "Read %s records" %numberofrecords
print "Dropped blacklist records: %s (of %s)" %(dropped,len(recs_to_drop['rbcL']))
print "Kept: %s" %len(temp)

del(temp)

Blacklist recs to drop: ['KU569098.1', 'Z37446.1', 'KP643954.1', 'JN965994.1', 'JQ412340.1', 'HQ623997.1', 'HQ590280.1', 'KJ204422.1', 'JN892973.1', 'JN892974.1', 'KF613060.1', 'Y08501.2', 'HG670306.1']
exclude: Y08501.2
exclude: KJ204422.1
exclude: HQ590280.1
exclude: KF613060.1
exclude: JN892973.1
exclude: JN892974.1
exclude: HG670306.1
Read 12636 records
Dropped blacklist records: 7 (of 13)
Kept: 12629


We can see that this has worked. Some records from the blacklist have not been found in the reference database because they have been removed during previous curation steps; those that have been identified now have been removed.

Update the Querymap.txt file to adjust for the additional level down that we're now in. The REFmap.txt file shouldn't need updating as we've placed reference databases with the same names in the new directory.

In [9]:
%%bash

for sample in $(cat ../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../SA177/SA177_trimmed.fasta
SB140	fasta	../../SB140/SB140_trimmed.fasta
T19	fasta	../../T19/T19_trimmed.fasta
Lg45	fasta	../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../Q69/Q69_trimmed.fasta
S72	fasta	../../S72/S72_trimmed.fasta
SC66	fasta	../../SC66/SC66_trimmed.fasta
S31	fasta	../../S31/S31_trimmed.fasta
SC74	fasta	../../SC74/SC74_trimmed.fasta
SA68	fasta	../../SA68/SA68_trimmed.fasta


In [10]:
!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [7]:
ls

[0m[01;32mAccessionBlacklist.txt[0m*                 [01;34mS92[0m/    [01;34mSB147[0m/
[01;34mBreadth2%[0m/                              [01;34mS96[0m/    [01;34mSB15[0m/
[01;34mBreadth3%[0m/                              [01;34mS97[0m/    [01;34mSB150[0m/
[01;34mBreadth5%[0m/                              [01;34mSA1[0m/    [01;34mSB153[0m/
eyorks_flora_curat1.gb                  [01;34mSA104[0m/  [01;34mSB159[0m/
eyorks_flora_curated.gb                 [01;34mSA105[0m/  [01;34mSB163[0m/
[01;34mGLOBAL[0m/                                 [01;34mSA109[0m/  [01;34mSB166[0m/
[01;34mLg27[0m/                                   [01;34mSA111[0m/  [01;34mSB170[0m/
[01;34mLg37[0m/                                   [01;34mSA112[0m/  [01;34mSB171[0m/
[01;34mLg45[0m/                                   [01;34mSA113[0m/  [01;34mSB172[0m/
[01;34mLg53[0m/                                   [01;34mSA114[0m/  [01;34mSB180[0m/
[01;34mL

We can see that everything is now set up correctly to run a BLAST against our new, curated reference db

In [19]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1 > log_curat1

In [20]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/GLOBAL/BLAST_0.95/global_blastn.out.txt

#Restrict width of BLAST

Christoph has now implemented some settings in metaBEAT that allow you to move away from some of the default BLAST settings, which didn't seem to be appropriate for my data. Let's now try using that implementation to see what happens!

First let's just try changing the breadth of the BLAST search (which by default was 10% of the bitscore of the top hit) to lower values: 5%, 3%, 2%.

#5%

In [3]:
cd /home/working/AgainstRefDB/Curat1/Breadth5%/

/home/working/AgainstRefDB/Curat1/Breadth5%


The Querymap and the REFmap can be adjusted from the previous BLAST.

In [9]:
%%bash

for sample in $(cat ../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../S72/S72_trimmed.fasta
SC66	fasta	../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../S31/S31_trimmed.fasta
SC74	fasta	../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../SA68/SA68_trimmed.fasta


In [17]:
!cp ../REFmap.txt .

!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [18]:
cp ../eyorks_flora_curated.gb .

In [19]:
cp ../positives_curated.gb .

In [20]:
ls

eyorks_flora_curated.gb  positives_curated.gb  REFmap.txt
log_breadth5%            Querymap.txt


In [21]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.05 \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1Breadth5 > log_breadth5%

In [22]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/Breadth5%/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/Breadth5%/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/Breadth5%/GLOBAL/BLAST_0.95/global_blastn.out.txt

#3%

In [42]:
cd /home/working/AgainstRefDB/Curat1/Breadth3%/

/home/working/AgainstRefDB/Curat1/Breadth3%


In [43]:
%%bash

for sample in $(cat ../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../S72/S72_trimmed.fasta
SC66	fasta	../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../S31/S31_trimmed.fasta
SC74	fasta	../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../SA68/SA68_trimmed.fasta


In [44]:
!cp ../REFmap.txt .

!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [45]:
cp ../eyorks_flora_curated.gb .

In [46]:
cp ../positives_curated.gb .

In [47]:
ls

eyorks_flora_curated.gb                         [0m[01;34mS33[0m/    [01;34mSA46[0m/   [01;34mSB34[0m/
[01;34mGLOBAL[0m/                                         [01;34mS35[0m/    [01;34mSA47[0m/   [01;34mSB37[0m/
[01;34mLg27[0m/                                           [01;34mS39[0m/    [01;34mSA48[0m/   [01;34mSB4[0m/
[01;34mLg37[0m/                                           [01;34mS43[0m/    [01;34mSA49[0m/   [01;34mSB45[0m/
[01;34mLg45[0m/                                           [01;34mS50[0m/    [01;34mSA5[0m/    [01;34mSB48[0m/
[01;34mLg53[0m/                                           [01;34mS56[0m/    [01;34mSA50[0m/   [01;34mSB51[0m/
[01;34mLg57[0m/                                           [01;34mS59[0m/    [01;34mSA51[0m/   [01;34mSB53[0m/
[01;34mLg65[0m/                                           [01;34mS62[0m/    [01;34mSA53[0m/   [01;34mSB59[0m/
[01;34mLg9[0m/                                         

In [49]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.03 \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1Breadth3 > log_breadth3%

In [50]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/Breadth3%/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/Breadth3%/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/Breadth3%/GLOBAL/BLAST_0.95/global_blastn.out.txt

#2%

In [51]:
cd /home/working/AgainstRefDB/Curat1/Breadth2%/

/home/working/AgainstRefDB/Curat1/Breadth2%


In [52]:
%%bash

for sample in $(cat ../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../S72/S72_trimmed.fasta
SC66	fasta	../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../S31/S31_trimmed.fasta
SC74	fasta	../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../SA68/SA68_trimmed.fasta


In [53]:
!cp ../REFmap.txt .

!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [54]:
cp ../eyorks_flora_curated.gb .

In [55]:
cp ../positives_curated.gb .

In [56]:
ls

eyorks_flora_curated.gb                         [0m[01;34mS33[0m/    [01;34mSA46[0m/   [01;34mSB34[0m/
[01;34mGLOBAL[0m/                                         [01;34mS35[0m/    [01;34mSA47[0m/   [01;34mSB37[0m/
[01;34mLg27[0m/                                           [01;34mS39[0m/    [01;34mSA48[0m/   [01;34mSB4[0m/
[01;34mLg37[0m/                                           [01;34mS43[0m/    [01;34mSA49[0m/   [01;34mSB45[0m/
[01;34mLg45[0m/                                           [01;34mS50[0m/    [01;34mSA5[0m/    [01;34mSB48[0m/
[01;34mLg53[0m/                                           [01;34mS56[0m/    [01;34mSA50[0m/   [01;34mSB51[0m/
[01;34mLg57[0m/                                           [01;34mS59[0m/    [01;34mSA51[0m/   [01;34mSB53[0m/
[01;34mLg65[0m/                                           [01;34mS62[0m/    [01;34mSA53[0m/   [01;34mSB59[0m/
[01;34mLg9[0m/                                         

In [57]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.02 \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1Breadth2 > log_breadth2%

In [58]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/Breadth2%/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/Breadth2%/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/Breadth2%/GLOBAL/BLAST_0.95/global_blastn.out.txt

At this point I have carefully examined the BLAST assignments on a case-by-case basis. I have made the following decisions:

- of the three search breadths, 5% sometimes makes assignments to family or order level that on inspection can clearly be made to genus level (effectively Type II error). 2%, however, sometimes does not consider matches that are quite close to the top hit and therefore seems to risk Type I errors. Therefore I will proceed with 3% search breadth in the final BLAST, which in most cases strikes a suitable balance between the two.

- in some cases, assignments have been made to the species level on account of the "if 100% full-length match, do not consider lower bit-scores" that appear to be possible Type I errors. Therefore I will turn this rule off in the final BLAST.

- in a number of cases, more accurate taxonomic assignments could be made with the removal of certain species from the reference database which are not plausibly present in the samples, either as a wind-carried pollen species or as a flower visited by moths in East Yorkshire during late summer. Therefore I will re-curate the reference database one final time before proceeding with the final BLAST.

- lastly, I have got top-hits to things that are not a very close match. Given how invariable this region appears to be, there seems to be a strong chance that these are not correct identifications; therefore, I will experiment with changing the --min_ident flag to 0.98 or even 0.99 before doing the final BLAST.

#Try increasing the minimum identities for the top hit

I'll try this in conjunction with my planned final conditions: 3% BLAST with no 100% match rule. For comparison I will also run a no-rule BLAST with the original 95% setting, so I can see exactly what my options are.

#95%

In [96]:
cd /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/95ident/

/home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/95ident


In [97]:
%%bash

for sample in $(cat ../../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../../../S72/S72_trimmed.fasta
SC66	fasta	../../../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../../../S31/S31_trimmed.fasta
SC74	fasta	../../../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../../../SA68/SA68_trimmed.fasta


In [98]:
!cp ../../REFmap.txt .

!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [99]:
cp ../../eyorks_flora_curated.gb .

In [100]:
cp ../../positives_curated.gb .

In [101]:
ls

eyorks_flora_curated.gb  positives_curated.gb  Querymap.txt  REFmap.txt


In [102]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.03 \
--bitscore_skim_adjust_off \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1Breadth3-95Ident > log_95ident

In [103]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/95ident/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/95ident/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/95ident/GLOBAL/BLAST_0.95/global_blastn.out.txt

#98%

In [76]:
cd /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/98ident/

/home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/98ident


In [77]:
%%bash

for sample in $(cat ../../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../../../S72/S72_trimmed.fasta
SC66	fasta	../../../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../../../S31/S31_trimmed.fasta
SC74	fasta	../../../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../../../SA68/SA68_trimmed.fasta


In [78]:
!cp ../../REFmap.txt .

!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [79]:
cp ../../eyorks_flora_curated.gb .

In [80]:
cp ../../positives_curated.gb .

In [81]:
ls

eyorks_flora_curated.gb  positives_curated.gb  Querymap.txt  REFmap.txt


In [82]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.03 \
--bitscore_skim_adjust_off \
--REFlist REFmap.txt \
--min_ident 0.98 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1Breadth3-98Ident > log_98ident

In [83]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/98ident/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/98ident/GLOBAL/BLAST_0.98/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/98ident/GLOBAL/BLAST_0.98/global_blastn.out.txt

#99%

In [84]:
cd /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/99ident/

/home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/99ident


In [85]:
%%bash

for sample in $(cat ../../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../../../S72/S72_trimmed.fasta
SC66	fasta	../../../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../../../S31/S31_trimmed.fasta
SC74	fasta	../../../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../../../SA68/SA68_trimmed.fasta


In [86]:
!cp ../../REFmap.txt .

!head REFmap.txt

eyorks_flora_curated.gb	gb
positives_curated.gb	gb


In [87]:
cp ../../eyorks_flora_curated.gb .

In [88]:
cp ../../positives_curated.gb .

In [89]:
ls

eyorks_flora_curated.gb  positives_curated.gb  Querymap.txt  REFmap.txt


In [94]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.03 \
--bitscore_skim_adjust_off \
--REFlist REFmap.txt \
--min_ident 0.99 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenReflistCurat1Breadth3-99Ident > log_99ident

In [95]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/99ident/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/99ident/GLOBAL/BLAST_0.99/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/Breadth3%/NoRule/99ident/GLOBAL/BLAST_0.99/global_blastn.out.txt

There is not very much difference at all between the settings here, so let's just proceed with 95% identities.

#Final BLAST

First I will move the things I need into a new subdirectory. This time I am going to include the list of positive controls in my main species list, so that I can work with a single reference database file.

In [106]:
cd /home/working/AgainstRefDB/Curat1/FINAL/

/home/working/AgainstRefDB/Curat1/FINAL


In [107]:
cp ../AccessionBlacklist.txt .

In [108]:
ls

AccessionBlacklist.txt   Querymap.txt            REFmap.txt
eyorks_flora_curated.gb  [0m[01;32mRefListCuratFinal.txt[0m*  [01;34mSativa[0m/


In [109]:
!head RefListCuratFinal.txt

Abies alba
Abies cephalonica
Abies cilicica
Abies concolor
Abies delavayi
Abies firma
Abies grandis
Abies homolepis
Abies lasiocarpa
Abies nordmanniana


In [69]:
!fetch_from_db.py -t RefListCuratFinal.txt -m rbcl -o eyorks_flora_curated -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'


QUERYING GENBANK

check for synonyms for "rbcl" (this is relevant only for Genbank searches)

fetching accessions ..

#	Abies alba	8
#	Abies cephalonica	2
#	Abies cilicica	1
#	Abies concolor	6
#	Abies delavayi	6
#	Abies firma	6
#	Abies grandis	4
#	Abies homolepis	4
#	Abies lasiocarpa	5
#	Abies nordmanniana	4
#	Abies pinsapo	5
#	Abies procera	1
#	Abies veitchii	4
#	Acaena novae-zelandiae	3
#	Acer campestre	18
#	Acer platanoides	6
#	Acer pseudoplatanus	16
#	Achillea alpina	0
#	Achillea distans	0
#	Achillea ligustica	0
#	Achillea millefolium	39
#	Achillea ptarmica	4
#	Aconitum napellus	5
#	Acorus calamus	33
#	Actaea spicata	5
#	Adonis annua	2
#	Adoxa moschatellina	7
#	Aegopodium podagraria	7
#	Aesculus carne

At this point I will put this reference database back through SATIVA (which will also remove the records in the AccessionBlacklist) and place the final database in this directory for blasting.

Just to keep things separate, I'm going to copy the SATIVA notebook and files into a new subdirectory within this one and run it separately from the previous version of SATIVA.

The above is now complete and ready for BLASTing. We need an updated Querymap and REFmap:

In [110]:
%%bash

for sample in $(cat ../Querymap.txt | cut -f 1)
do
    fasta=$(ls -1 ../../../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../../../SA177/SA177_trimmed.fasta
SB140	fasta	../../../SB140/SB140_trimmed.fasta
T19	fasta	../../../T19/T19_trimmed.fasta
Lg45	fasta	../../../Lg45/Lg45_trimmed.fasta
Q69	fasta	../../../Q69/Q69_trimmed.fasta
S72	fasta	../../../S72/S72_trimmed.fasta
SC66	fasta	../../../SC66/SC66_trimmed.fasta
S31	fasta	../../../S31/S31_trimmed.fasta
SC74	fasta	../../../SC74/SC74_trimmed.fasta
SA68	fasta	../../../SA68/SA68_trimmed.fasta


In [111]:
!echo "eyorks_flora_final.gb\tgb" > REFmap.txt

And the new curated reference db, which has been run through SATIVA:

In [117]:
cp Sativa/eyorks_flora_final.gb .

In [118]:
ls

AccessionBlacklist.txt          [0m[01;34mR9[0m/                     [01;34mSA147[0m/   [01;34mSB104[0m/  [01;34mSB82[0m/
eyorks_flora_curated.gb         [01;34mR90[0m/                    [01;34mSA15[0m/    [01;34mSB106[0m/  [01;34mSB83[0m/
eyorks_flora_final.gb           [01;34mR95[0m/                    [01;34mSA153[0m/   [01;34mSB111[0m/  [01;34mSB85[0m/
[01;34mGLOBAL[0m/                         [01;34mR98[0m/                    [01;34mSA155[0m/   [01;34mSB116[0m/  [01;34mSB86[0m/
[01;34mLg27[0m/                           [01;34mR99[0m/                    [01;34mSA156[0m/   [01;34mSB119[0m/  [01;34mSB87[0m/
[01;34mLg37[0m/                           [01;32mRefListCuratFinal.txt[0m*  [01;34mSA157[0m/   [01;34mSB120[0m/  [01;34mSB91[0m/
[01;34mLg45[0m/                           REFmap.txt              [01;34mSA160[0m/   [01;34mSB127[0m/  [01;34mSB93[0m/
[01;34mLg53[0m/                           [01;34mS100[0m/   

In [119]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.03 \
--bitscore_skim_adjust_off \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenFinal > log_final

In [120]:
%%bash
blastn -query /home/working/AgainstRefDB/Curat1/FINAL/GLOBAL/global_centroids.fasta \
-db /home/working/AgainstRefDB/Curat1/FINAL/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/AgainstRefDB/Curat1/FINAL/GLOBAL/BLAST_0.95/global_blastn.out.txt

#Kraken

Now this is all done, we want to try a different method, called Kraken, which is k-mer based. We will need our reference databases from here in the new folder.

In [25]:
cp positives_curated.gb /home/working/Kraken/

In [26]:
cp eyorks_flora_curated.gb /home/working/Kraken/

In [27]:
cp Querymap.txt /home/working/Kraken/

In [39]:
cd /home/working/Kraken/

/home/working/Kraken


In [40]:
ls

eyorks_flora_curated.gb            log                             Querymap.txt
eyorks_flora_curated.kraken.fasta  positives_curated.gb            REFmap.txt
[0m[01;34mGLOBAL[0m/                            positives_curated.kraken.fasta


Because of an error caused by the very large file size, we have manually constructed the Kraken database outside of the pipeline, using these steps:

In [None]:
from Bio import SeqIO

db_in_gb = 'eyorks_flora_curated.gb'
db_for_kraken = 'eyorks_flora_curated.kraken.fasta'

Seqs = SeqIO.parse(open(db_in_gb,'r'), 'genbank')
Seqs_new = []

for r in Seqs:
#    print r.id
    source=r.features[0]
    for t in source.qualifiers['db_xref']:
        if 'taxon' in t:
            taxid = t.split(":")[1]
#            print taxid
#    print "%s|kraken:taxid|%s" %(r.id,taxid)
    r.id = "%s|kraken:taxid|%s" %(r.id,taxid)
    r.description = r.id
    Seqs_new.append(r)
    
out = open(db_for_kraken,'w')
SeqIO.write(Seqs_new, out, 'fasta')
out.close()

In [None]:
!kraken-build --download-taxonomy --db ./KRAKEN_DB/

In [None]:
!kraken-build --add-to-library eyorks_flora_curated.kraken.fasta --db ./KRAKEN_DB/

In [None]:
!kraken-build --build --threads 5 --db ./KRAKEN_DB/ --jellyfish-hash-size 2900M

And now we can run Kraken, using the same metaBEAT syntax

In [41]:
%%bash

metaBEAT_global.py \
-B ../AgainstRefDB/Curat1/GLOBAL/MothPollenReflistCurat1-OTU-denovo.biom \
--kraken \
--kraken_db GLOBAL/KRAKEN/KRAKEN_DB \
-@ callumjmacgregor@gmail.com \
-n 5 -o Kraken > log

Visual inspection of the Kraken output has revealed some very weird assignments, which are much less plausible than the equivalent BLAST assignments (e.g. species-level assignment to Silene gallica, as opposed to genus-level assignment to Silene due to confusion between S. latifolia and S. dioica. Therefore I will not pursue Kraken any further at this stage.

#Phylogenetic placement (pplacer)

Now we have completed all the steps in the SATIVA process, we also have a cleaned, trimmed reference alignment that can be used to attempt assignments by phylogenetic placement!

We first need to build a reference tree, using the cleaned reference database and the final alignment.

Navigate to a new directory, copy over the relevant files. Then read in the Genbank file and create taxids.txt and seq_info.csv files.

In [121]:
cd /home/working/pplacer

/home/working/pplacer


In [122]:
cp /home/working/AgainstRefDB/Curat1/FINAL/Sativa/postSATIVA_cleaned.gb .

In [123]:
cp /home/working/AgainstRefDB/Curat1/FINAL/rbcL_eyorks_clean_alignment.phylip .

In [127]:
cp /home/working/AgainstRefDB/Curat1/FINAL/rbcL_eyorks_clean_alignment.fasta .

In [124]:
ls

postSATIVA_cleaned.gb  rbcL_eyorks_clean_alignment.phylip


In [125]:
from Bio import SeqIO

gb = 'postSATIVA_cleaned.gb'
seqinfo_file = 'seq_info.csv'
taxid_file = 'taxids.txt'

seq_info = ['"seqname","accession","tax_id","species_name","is.type"']
taxids = []
Seqs = SeqIO.parse(open(gb,'r'), 'genbank')

for r in Seqs:
    sp = r.features[0].qualifiers['organism'][0]
    for db_xref in r.features[0].qualifiers['db_xref']:
        if 'taxon' in db_xref:
            taxid = db_xref.split(":")[1]
    r_seqinfo = '"%s","%s","%s","%s","0"' %(r.id,r.id, taxid, sp)
    seq_info.extend([r_seqinfo])
    taxids.append(taxid)
    
seq_info_out = open(seqinfo_file, 'w')
for l in seq_info:
    seq_info_out.write(l+"\n")
seq_info_out.close()

taxids_out = open(taxid_file, 'w')
for t in list(set(taxids)):
    taxids_out.write(t+"\n")
taxids_out.close()


Build a HMM for the reference alignment, using the program hmmbuild

In [126]:
!hmmbuild -h

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b1 (May 2013); http://hmmer.org/
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmbuild [-options] <hmmfile_out> <msafile>

Basic options:
  -h     : show brief help on version and usage
  -n <s> : name the HMM <s>
  -o <f> : direct summary output to file <f>, not stdout
  -O <f> : resave annotated, possibly modified MSA to file <f>

Options for selecting alphabet rather than guessing it:
  --amino : input alignment is protein sequence data
  --dna   : input alignment is DNA sequence data
  --rna   : input alignment is RNA sequence data

Alternative model construction strategies:
  --fast           : assign cols w/ >= symfrac residues as consensus  [default]
  --hand           : manual construction (requires reference annotation)

In [128]:
!hmmbuild rbcL_ref.hmm rbcL_eyorks_clean_alignment.fasta

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b1 (May 2013); http://hmmer.org/
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             rbcL_eyorks_clean_alignment.fasta
# output HMM file:                  rbcL_ref.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen     W eff_nseq re/pos description
#---- -------------------- ----- ----- ----- ----- -------- ------ -----------
1     rbcL_eyorks_clean_alignment  2309   174   174   293     1.87  0.450 

# CPU time: 0.12u 0.00s 00:00:00.12 Elapsed: 00:00:00.14


Create a reference package for pplacer:

First we will need to compile some information about the taxonomy of the reference sequences.

We start by producing a taxonomy table for the set of taxa that is used as reference. The file taxids.txt is a simple text file that contains the taxonomic ids for all taxa.

In [129]:
!head taxids.txt

271541
500232
204222
61450
59351
202448
57918
241213
313220
267568


We will use a tool from the taxtastic package to fetch the taxonomic information for these taxa from NCBI taxonomy (in the 'taxonomy dump').

In [130]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

The resulting taxa.csv file contains just the taxonomic information relevant for the reference sequences to be used for the phylogenetic placement.

In [131]:
!head taxa.csv

"tax_id","parent_id","rank","tax_name","root","below_root","superkingdom","kingdom","phylum","below_phylum","below_below_phylum","below_below_below_phylum","below_below_below_below_phylum","below_below_below_below_below_phylum","below_below_below_below_below_below_phylum","below_below_below_below_below_below_below_phylum","below_below_below_below_below_below_below_below_phylum","below_below_below_below_below_below_below_below_below_phylum","below_below_below_below_below_below_below_below_below_below_phylum","class","below_class","subclass","below_subclass","order","below_order","suborder","family","below_family","subfamily","below_subfamily","below_below_subfamily","tribe","below_tribe","below_below_tribe","subtribe","genus","below_genus","subgenus","species_group","species","subspecies","varietas","forma"
"1","1","root","root","1","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
"131567","1","below_root","cellular or

We also need to provide information linking the taxids to the sequence ids. This file is called the 'seqinfo' file by taxtastic. We provide this as seq_info.csv.

In [132]:
!head seq_info.csv

"seqname","accession","tax_id","species_name","is.type"
"JN893788.1","JN893788.1","12990","Carpinus betulus","0"
"KJ841577.1","KJ841577.1","45834","Solanum dulcamara","0"
"HM559497.1","HM559497.1","13821","Pteris vittata","0"
"AY034021.1","AY034021.1","74693","Veronica anagallis-aquatica","0"
"KF724370.1","KF724370.1","28472","Epipactis helleborine","0"
"HE963482.1","HE963482.1","76027","Fallopia baldschuanica","0"
"KT695421.1","KT695421.1","65561","Hypericum perforatum","0"
"JN005477.1","JN005477.1","164261","Goodyera repens","0"
"JN893409.1","JN893409.1","35899","Galium odoratum","0"


The reference package also needs to contain a reference tree, the log from the tree inference, the underlying alignment in fasta format as well as the HMM profile that has just been produced to align query sequences to. The tree has been built in the Sativa_final notebook, and we need to generate a raxml info file containing the model parameters:

In [133]:
cp /home/working/AgainstRefDB/Curat1/FINAL/Sativa/RAxML_bestTree.rbcL@mafftLinsi-SATIVA@gappyout0 .

In [134]:
%%bash

raxmlHPC-PTHREADS -f e -T 5 \
-t RAxML_bestTree.rbcL@mafftLinsi-SATIVA@gappyout0 \
-m GTRGAMMA \
-s rbcL_eyorks_clean_alignment.phylip \
-n test











































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Now build the reference package

In [136]:
%%bash
taxit create \
-l rbcL \
-P rbcL.refpkg \
--aln-fasta rbcL_eyorks_clean_alignment.fasta \
--tree-stats RAxML_info.test \
--tree-file RAxML_bestTree.rbcL@mafftLinsi-SATIVA@gappyout0 \
--profile rbcL_ref.hmm \
--seq-info seq_info.csv \
--taxonomy taxa.csv

rerooting at below_below_below_below_phylum
root found at node 2055


Finally, pplacer needs to run an initial BLAST search so we need to copy the relevant files over from the final BLAST.

In [144]:
cp ../AgainstRefDB/Curat1/FINAL/Sativa/eyorks_flora_final.gb .

In [145]:
!echo "eyorks_flora_final.gb\tgb" > REFmap.txt

We are now ready to run assignment with phylogenetic placement!

In [139]:
ls

postSATIVA_cleaned.gb
RAxML_bestTree.rbcL@mafftLinsi-SATIVA@gappyout0
RAxML_info.test
RAxML_log.test
RAxML_result.test
rbcL_eyorks_clean_alignment.fasta
rbcL_eyorks_clean_alignment.phylip
rbcL_eyorks_clean_alignment.phylip.reduced
rbcL_ref.hmm
[0m[01;34mrbcL.refpkg[0m/
seq_info.csv
taxa.csv
taxids.txt


In [146]:
%%bash

metaBEAT_global.py \
-B ../AgainstRefDB/Curat1/GLOBAL/MothPollenReflistCurat1-OTU-denovo.biom \
-R REFmap.txt \
--blast --min_ident 0.95 \
--pplace \
--refpkg rbcL.refpkg/ \
-@ callumjmacgregor@gmail.com \
-n 5 -o pplacer > log

Traceback (most recent call last):
  File "/usr/bin/metaBEAT_global.py", line 2628, in <module>
    taxonomy_count = assign_taxonomy_pplacer(pplacer_out=pplacer_out_dict, tax_dict=tax_dict, v=args.verbose)
  File "/usr/bin/metaBEAT_global.py", line 1516, in assign_taxonomy_pplacer
    index = tax_dict["tax_id"].index(tax_dict[pplacer_out['hit'][query][0]][1])
KeyError: u'126436'


As it transpires, metaBEAT cannot currently handle plants in pplacer - this is because many plant species are not assigned at some of the higher taxonomic levels and it doesn't know how to handle this.

#Re-run final, corrected BLAST

At this point I have discovered an error in the Querymap for the original trimming and clustering which led to positives and negatives being overwritten with each successive plate and retained for plate 4 only, so I am going to just rerun this stage and the final BLAST to correct this.

In [1]:
cd /home/working/FinalCorrected/

/home/working/FinalCorrected


I have updated the Querymap_global file to mark which plate each control comes from so that their names are different for each instance.

In [11]:
!grep "POS" Querymap_global_corrected.txt

POS2_1	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	GCTCAGGA	CTGCGCAT	23	28
POS4_1	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	GCAGCGTA	CTAAGCCT	20	29
POS1_1	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	CATGCCTA	ACTGCATA	25	27
POS3_1	fastq	raw_data/Moth1_S1_L001_R1_001.fastq.gz	raw_data/Moth1_S1_L001_R2_001.fastq.gz	GAGCGCTA	AGAGTAGA	25	25
POS2_2	fastq	raw_data/Moth2_S2_L001_R1_001.fastq.gz	raw_data/Moth2_S2_L001_R2_001.fastq.gz	GCTCAGGA	CTGCGCAT	23	28
POS4_2	fastq	raw_data/Moth2_S2_L001_R1_001.fastq.gz	raw_data/Moth2_S2_L001_R2_001.fastq.gz	GCAGCGTA	CTAAGCCT	20	29
POS1_2	fastq	raw_data/Moth2_S2_L001_R1_001.fastq.gz	raw_data/Moth2_S2_L001_R2_001.fastq.gz	CATGCCTA	ACTGCATA	25	27
POS3_2	fastq	raw_data/Moth2_S2_L001_R1_001.fastq.gz	raw_data/Moth2_S2_L001_R2_001.fastq.gz	GAGCGCTA	AGAGTAGA	25	25
POS2_3	fastq	raw_data/Moth3_S3_L001_R1_001.fastq.gz	raw_data/Mot

I have also copied the raw data across.

In [10]:
ls

[0m[01;32mQuerymap_global_corrected.txt[0m*  [01;34mraw_data[0m/


In [12]:
%%bash

metaBEAT_global.py \
-Q Querymap_global_corrected.txt \
--trim_qual 30 \
--trim_minlength 90 \
--merge \
--product_length 350 \
--merged_only \
--cluster \
--clust_match 1 \
--clust_cov 5 \
-o MothPollenFinalCorrected \
-@ callumjmacgregor@gmail.com
-n 5 -v &> log


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.97.7-global


Sat Nov  5 10:36:33 2016

/usr/bin/metaBEAT_global.py -Q Querymap_global_corrected.txt --trim_qual 30 --trim_minlength 90 --merge --product_length 350 --merged_only --cluster --clust_match 1 --clust_cov 5 -o MothPollenFinalCorrected -@ callumjmacgregor@gmail.com


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'callumjmacgregor@gmail.com'

taxonomy.db found at /usr/bin/taxonomy.db

Parsing querylist file

Number of samples to process: 335
Sequence input format: defaultdict(<type 'int'>, {'fastq': 336})
Barcodes for demultiplexing provided for 336 samples
Cropping instructions provided for 336 samples


Sat Nov  5 10:36:33 2016


### DEMULTIPLEXING ###

assessing basic characteristics
data comes 

Now we can run the final BLAST against the new files

In [13]:
cd FinalBlast/

/home/working/FinalCorrected/FinalBlast


We need an updated Querymap, based on the corrected version in the folder above.

In [14]:
%%bash

for sample in $(cat ../Querymap_global_corrected.txt | cut -f 1)
do
    fasta=$(ls -1 ../$sample/$sample\_trimmed.fasta)
    echo -e "$sample\tfasta\t$fasta"
done > Querymap.txt

head Querymap.txt

SA177	fasta	../SA177/SA177_trimmed.fasta
SB140	fasta	../SB140/SB140_trimmed.fasta
T19	fasta	../T19/T19_trimmed.fasta
Lg45	fasta	../Lg45/Lg45_trimmed.fasta
Q69	fasta	../Q69/Q69_trimmed.fasta
S72	fasta	../S72/S72_trimmed.fasta
SC66	fasta	../SC66/SC66_trimmed.fasta
S31	fasta	../S31/S31_trimmed.fasta
SC74	fasta	../SC74/SC74_trimmed.fasta
SA68	fasta	../SA68/SA68_trimmed.fasta


In [16]:
!grep "POS" Querymap.txt

POS2_1	fasta	../POS2_1/POS2_1_trimmed.fasta
POS4_1	fasta	../POS4_1/POS4_1_trimmed.fasta
POS1_1	fasta	../POS1_1/POS1_1_trimmed.fasta
POS3_1	fasta	../POS3_1/POS3_1_trimmed.fasta
POS2_2	fasta	../POS2_2/POS2_2_trimmed.fasta
POS4_2	fasta	../POS4_2/POS4_2_trimmed.fasta
POS1_2	fasta	../POS1_2/POS1_2_trimmed.fasta
POS3_2	fasta	../POS3_2/POS3_2_trimmed.fasta
POS2_3	fasta	../POS2_3/POS2_3_trimmed.fasta
POS4_3	fasta	../POS4_3/POS4_3_trimmed.fasta
POS1_3	fasta	../POS1_3/POS1_3_trimmed.fasta
POS3_3	fasta	../POS3_3/POS3_3_trimmed.fasta
POS2_4	fasta	../POS2_4/POS2_4_trimmed.fasta
POS4_4	fasta	../POS4_4/POS4_4_trimmed.fasta
POS1_4	fasta	../POS1_4/POS1_4_trimmed.fasta
POS3_4	fasta	../POS3_4/POS3_4_trimmed.fasta


And a REFmap and reference database (fortunately, the error had nothing to do with the reference database, so we can just copy this across from the previous "Final BLAST" folder!)

In [17]:
!echo "eyorks_flora_final.gb\tgb" > REFmap.txt

In [18]:
cp ../../AgainstRefDB/Curat1/FINAL/Sativa/eyorks_flora_final.gb .

In [19]:
ls

eyorks_flora_final.gb  Querymap.txt  REFmap.txt


Ready!

In [24]:
%%bash

metaBEAT_global.py \
-Q Querymap.txt \
--cluster \
--clust_match 1 \
--clust_cov 5 \
--blast \
--bitscore_skim_LCA 0.03 \
--bitscore_skim_adjust_off \
--REFlist REFmap.txt \
--min_ident 0.95 \
-n 5 \
-v \
-@ callumjmacgregor@gmail.com \
-o MothPollenFinal > log_final

In [25]:
%%bash
blastn -query /home/working/FinalCorrected/FinalBlast/GLOBAL/global_centroids.fasta \
-db /home/working/FinalCorrected/FinalBlast/GLOBAL/BLAST_0.95/marker_blast_db \
-out /home/working/FinalCorrected/FinalBlast/GLOBAL/BLAST_0.95/global_blastn.out.txt

Finally we want to move the outputs from this into a folder for downstream analysis in R.

In [6]:
ls GLOBAL/BLAST_0.95/

[0m[01;32mglobal_blastn.out.txt[0m*
[01;32mglobal_blastn.out.xml[0m*
[01;32mmarker_blast_db.nhr[0m*
[01;32mmarker_blast_db.nin[0m*
[01;32mmarker_blast_db.nsq[0m*
[01;32mMothPollenFinal-by-taxonomy-clustercounts.blast.biom[0m*
[01;32mMothPollenFinal-by-taxonomy-clustercounts.blast.tsv[0m*
[01;32mMothPollenFinal-by-taxonomy-readcounts.blast.biom[0m*
[01;32mMothPollenFinal-by-taxonomy-readcounts.blast.tsv[0m*
[01;32mMothPollenFinal-OTU-taxonomy.blast.biom[0m*
[01;32mMothPollenFinal-OTU-taxonomy.blast.tsv[0m*
[01;32mrefs.fasta[0m*
[01;32mtaxa.csv[0m*
[01;32mtaxids.txt[0m*


In [7]:
cp GLOBAL/BLAST_0.95/MothPollenFinal* ../../Analysis/Chapter-4/Data/Raw_data/

In [8]:
ls ../../Analysis/Chapter-4/Data/Raw_data/

[0m[01;32mMothPollenFinal-by-taxonomy-clustercounts.blast.biom[0m*
[01;32mMothPollenFinal-by-taxonomy-clustercounts.blast.tsv[0m*
[01;32mMothPollenFinal-by-taxonomy-readcounts.blast.biom[0m*
[01;32mMothPollenFinal-by-taxonomy-readcounts.blast.tsv[0m*
[01;32mMothPollenFinal-OTU-taxonomy.blast.biom[0m*
[01;32mMothPollenFinal-OTU-taxonomy.blast.tsv[0m*


All done!