Processing data Molecular ecology paper.



In [2]:
!mkdir 12S

Preparing a text file specifying the samples to be processed including the format and location of the reads. 

The below command expects the Illumina data to be present in 2 fastq files (forward and reverse reads) per sample in a directory `./raw_reads/`. It expects the files to be named 'sampleID-marker', followed by '\_1' or '\_2' to identify the forward/reverse read file respectively. sampleID must corresponds to the first column in the file `Sample_accessions.tsv`, marker is either '12S' or 'CytB'. 

Read file names, for example:
```
./raw_reads/Bassenthwaite_01-12S_1.fastq.gz
./raw_reads/Bassenthwaite_01-12S_2.fastq.gz
./raw_reads/Bassenthwaite_01-CytB_1.fastq.gz
./raw_reads/Bassenthwaite_01-CytB_2.fastq.gz
./raw_reads/Bassenthwaite_02-12S_1.fastq.gz
./raw_reads/Bassenthwaite_02-12S_2.fastq.gz
./raw_reads/Bassenthwaite_02-CytB_1.fastq.gz
./raw_reads/Bassenthwaite_02-CytB_2.fastq.gz
./raw_reads/Bassenthwaite_03-12S_1.fastq.gz
./raw_reads/Bassenthwaite_03-12S_2.fastq.gz
```


In [10]:
%%bash

for a in $(cat Sample_accessions.tsv | grep "12S" | cut -f 1 | grep "SampleID" -v)
do
    R1=$(ls -1 ./raw_reads/$a-12S_* | grep "_1.fastq")
    R2=$(ls -1 ./raw_reads/$a-12S_* | grep "_2.fastq")

    echo -e "$a\tfastq\t$R1\t$R2"
done > 12S/Querymap.txt

The resulting file should look e.g. like below:

In [11]:
!head 12S/Querymap.txt

Bassenthwaite_01	fastq	./raw_reads/Bassenthwaite_01-12S_1.fastq.gz	./raw_reads/Bassenthwaite_01-12S_2.fastq.gz
Bassenthwaite_02	fastq	./raw_reads/Bassenthwaite_02-12S_1.fastq.gz	./raw_reads/Bassenthwaite_02-12S_2.fastq.gz
Bassenthwaite_03	fastq	./raw_reads/Bassenthwaite_03-12S_1.fastq.gz	./raw_reads/Bassenthwaite_03-12S_2.fastq.gz
Bassenthwaite_04	fastq	./raw_reads/Bassenthwaite_04-12S_1.fastq.gz	./raw_reads/Bassenthwaite_04-12S_2.fastq.gz
Bassenthwaite_05	fastq	./raw_reads/Bassenthwaite_05-12S_1.fastq.gz	./raw_reads/Bassenthwaite_05-12S_2.fastq.gz
Bassenthwaite_shore-01	fastq	./raw_reads/Bassenthwaite_shore-01-12S_1.fastq.gz	./raw_reads/Bassenthwaite_shore-01-12S_2.fastq.gz
Derwent_01	fastq	./raw_reads/Derwent_01-12S_1.fastq.gz	./raw_reads/Derwent_01-12S_2.fastq.gz
Derwent_02	fastq	./raw_reads/Derwent_02-12S_1.fastq.gz	./raw_reads/Derwent_02-12S_2.fastq.gz
Derwent_03	fastq	./raw_reads/Derwent_03-12S_1.fastq.gz	./raw_reads/Derwent_03-12S_2.fastq.gz
Derwent_04	fastq	./raw_reads

Prepare Refmap file, i.e. text file that specifies the location and the format of the reference to be used.

In [20]:
%%bash

#create symbolic link to reference database
cp reference_DBs/custom_extended_12S.gb 12S/

#Write REFmap
for file in $(ls -1 ./12S/ | grep "gb$")
do
    echo -e "$file\tgb"
done > 12S/REFmap.txt

In [21]:
!cat 12S/REFmap.txt

custom_extended_12S.gb	gb


The 12S amplicon sequenced here is only 106bp long. Readlength used in the MiSeq run was 2x300bp. Our reads are thus longer than our amplicon and we so expect to find primer/adapter sequences in our reads that need to be removed as part of the raw data processing. 

Specifically, forward reads are expected to contain the reverse complement of the reverse primer plus the reverse Illumina adapter (FA501 - FA508), and reverse reads will contain reverse complements of the forward primers and adapters (RB701 - RB712).

The expected sequences have been prepared in the below file `adapters_rc.fasta` and will be used in the trimming algorithm.

In [5]:
!ln -s ../adapters_rc.fasta .

In [6]:
!cat adapters_rc.fasta

>FA501_rc
GGGGTATCTAATCCCAGTCCAATTACCATACGTACGATGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA502_rc
GGGGTATCTAATCCCAGTCCAATTACCATACAGATAGTGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA503_rc
GGGGTATCTAATCCCAGTCCAATTACCATAACTCGCTAGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA504_rc
GGGGTATCTAATCCCAGTCCAATTACCATAACACGCAGGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA505_rc
GGGGTATCTAATCCCAGTCCAATTACCATACTCGATGAGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA506_rc
GGGGTATCTAATCCCAGTCCAATTACCATACACTCACGGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA507_rc
GGGGTATCTAATCCCAGTCCAATTACCATAAGATATCCGTGTAGATCTCGGTGGTCGCCGTATCATT
>FA508_rc
GGGGTATCTAATCCCAGTCCAATTACCATAACGGTGTCGTGTAGATCTCGGTGGTCGCCGTATCATT
>RB701_rc
CTAGAGGAGCCTGTTCTAGGCTGACTGACTCTCGACTTATCTCGTATGCCGTCTTCTGCTTG
>RB702_rc
CTAGAGGAGCCTGTTCTAGGCTGACTGACTCGAAGTATATCTCGTATGCCGTCTTCTGCTTG
>RB703_rc
CTAGAGGAGCCTGTTCTAGGCTGACTGACTTAGCAGCTATCTCGTATGCCGTCTTCTGCTTG
>RB704_rc
CTAGAGGAGCCTGTTCTAGGCTGACTGACTTCTCTATGATCTCGTATGCCGTCTTCTGCTTG
>RB705_rc
CTAGAGGAGCCTGTTCTAGGCTGACTGACTGATCTACGATCTCGTATGC

Run metaBEAT (raw data trimming).

In [1]:
!mkdir trimming

In [8]:
%%bash

metaBEAT.py \
-Q Querymap.txt \
-R REFmap.txt \
--trim_qual 30 \
--trim_adapter adapters_rc.fasta \
--trim_minlength 90 \
--merge \
--product_length 110 \
--cluster --clust_match 1 --clust_cov 3 \
--blast \
-m 12S \
-n 5 \
--min_ident 1 -E -v \
-@ c.hahn@hull.ac.uk \
-o 12S-trim_30-cluster_1c3-blast-min_ident_1.0 &> log

cd trimming/

metaBEAT.py \
-Q ../QUERYmap.txt \
-R ../REFmap.txt \
--trim_adapter ../adapters_rc.fasta \
--trim_minlength 90 \
--merge \
--product_length 120 \
--forward_only \
-m 12S \
-n 5 \
-v \
-o 12S_trim_30-merge \
-@ c.hahn@hull.ac.uk &> log

cd ..