The first step will be to __trim/clean our raw Illumina data__.

In [1]:
!mkdir trimming

In [2]:
cd trimming

/home/working/Cytb/trimming


Prepare a text file specifying the samples to be processed including the format and location of the reads. 

The below command expects the Illumina data to be present in 2 fastq files (forward and reverse reads) per sample in a directory `../../raw_reads/`. It expects the files to be named 'sampleID', followed by '_R1' or '_R2' to identify the forward/reverse read file respectively. 

The raw data need to be downloaded with `How_to_download_Rawdata_from_SRA.ipynb` (see [here](https://github.com/HullUni-bioinformatics/Handley_et_al_2018/blob/master/How_to_download_Rawdata_from_SRA.ipynb))

SampleID must corresponds to the first column in the file `Sample_accessions.tsv` (see [here](https://github.com/HullUni-bioinformatics/Handley_et_al_2018/blob/master/supplementary_data/Sample_accessions.tsv)), marker is `Cytb`.


In [3]:
%%bash

for a in $(cat ../../supplementary_data/Sample_accessions.tsv | grep "Cytb" | cut -f 1 | grep "SampleID" -v)
do
    R1=$(ls -1 ../../raw_reads/$a-Cytb_* | grep "_R1.fastq")
    R2=$(ls -1 ../../raw_reads/$a-Cytb_* | grep "_R2.fastq")

    echo -e "$a\tfastq\t$R1\t$R2"
done > Querymap.txt

The resulting file should look e.g. like below:

In [5]:
!head -n 8 Querymap.txt

WOF01_summer	fastq	../../raw_reads/WOF01_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF01_summer-Cytb_R2.fastq.gz
WOF02_summer	fastq	../../raw_reads/WOF02_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF02_summer-Cytb_R2.fastq.gz
WOF03_summer	fastq	../../raw_reads/WOF03_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF03_summer-Cytb_R2.fastq.gz
WOF04_summer	fastq	../../raw_reads/WOF04_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF04_summer-Cytb_R2.fastq.gz
WOF05_summer	fastq	../../raw_reads/WOF05_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF05_summer-Cytb_R2.fastq.gz
WOF06_summer	fastq	../../raw_reads/WOF06_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF06_summer-Cytb_R2.fastq.gz
WOF07_summer	fastq	../../raw_reads/WOF07_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF07_summer-Cytb_R2.fastq.gz
WOF08_summer	fastq	../../raw_reads/WOF08_summer-Cytb_R1.fastq.gz	../../raw_reads/WOF08_summer-Cytb_R2.fastq.gz


The amplicon is expected to be > 400 bp long. With a readlength of 300 bp we don't expect to see any primer sequences, so it's not necessary to provide the Primer sequence for the trimming algorithm.

__Raw data trimming, removal of adapter sequences and merging of reads__ using the `metaBEAT` pipeline.

In [None]:
%%bash

echo -e "Starttime: $(date)\n"

metaBEAT_global.py \
-Q Querymap.txt \
--trim_qual 30 \
--trim_minlength 100 \
--merge --product_length 400 --forward_only \
-@ haikuilee@gmail.com \
-n 5 -v &> log_trimming

echo -e "Endtime: $(date)\n"

In [6]:
cd ../

/home/working/Cytb


Some stats on the read counts before/after trimming, merging etc. are summarized for you in `metaBEAT_read_stats.csv`.

Next stage of the processing is __chimera detection__ and removal of putative chimeric sequences. We'll do that using `uchime` as implemented in `vsearch`.

In [7]:
!mkdir chimera_detection

In [8]:
cd chimera_detection

/home/working/Cytb/chimera_detection


Convert reference database from Genbank to fasta format to be used in chimera detection.

Prepare Refmap file, i.e. text file that specifies the location and the format of the reference to be used.

The reference sequences in Genbank format should already be present in the `Cytb` directory: `Cytb_evohull_DEC_2017.gb`.

In [9]:
%%bash

#Write REFmap
for file in $(ls -1 ../../supplementary_data/reference_DBs/* | grep "Cytb_evohull_DEC_2017.gb$")
do
      echo -e "$file\tgb"
done > REFmap.txt

In [10]:
!cat REFmap.txt

../../supplementary_data/reference_DBs/Cytb_evohull_DEC_2017.gb	gb


In [None]:
%%bash

metaBEAT_global.py \
-R REFmap.txt \
-f \
-@ haikuilee@gmail.com

This will produce `refs.fasta`.

Now run chimera detection.

In [None]:
%%bash


for a in $(cut -f 1 ../trimming/Querymap.txt)
do
    if [ -s ../trimming/$a/$a\_trimmed.fasta ]
    then
        echo -e "\n### Detecting chimeras in $a ###\n"
        mkdir $a
        cd $a
        vsearch --uchime_ref ../../trimming/$a/$a\_trimmed.fasta --db ../refs.fasta \
        --nonchimeras $a-nonchimeras.fasta --chimeras $a-chimeras.fasta &> log
        cd ..

    else
        echo -e "$a is empty"
    fi
done




In [12]:
cd ..

/home/working/Cytb


Last step is __taxonomic assignment of reads based on a BLAST - LCA approach__ using the metaBEAT pipeline.

In [13]:
!mkdir non-chimeras

In [14]:
cd non-chimeras/

/home/working/Cytb/non-chimeras


Prepare Querymap and Refmap txt files.

In [19]:
%%bash

#Querymap
for a in $(ls -l ../chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
   if [ "$a" != "GLOBAL" ]
   then
      echo -e "$a-nc\tfasta\t../chimera_detection/$a/$a-nonchimeras.fasta"
   fi
done > Querymap.txt



#Write REFmap
for file in $(ls -1 ../../supplementary_data/reference_DBs/* | grep "Cytb_evohull_DEC_2017.gb$")
do
      echo -e "$file\tgb"
done > REFmap.txt

for file in $(ls -1 ../../supplementary_data/reference_DBs/* | grep "RhamphochromisEsox_mt.gb$")
do
      echo -e "$file\tgb"
done >> REFmap.txt

In [20]:
!head -10  Querymap.txt

In [21]:
!cat REFmap.txt

../../supplementary_data/reference_DBs/Cytb_evohull_DEC_2017.gb	gb
../../supplementary_data/reference_DBs/RhamphochromisEsox_mt.gb	gb


__Sequence clustering and taxonomic assignment via metaBEAT.__

In [None]:
%%bash

metaBEAT.py \
-Q Querymap.txt \
-R REFmap.txt \
--cluster --clust_match 1 --clust_cov 3 \
--blast --min_ident 0.95 \
-m Cytb -n 5 \
-E -v -\
-@ haikuilee@gmail.com \
-o Cytb-trim_30-merge-nonchimeras-cluster_1c3-blast-min_ident_0.95 &> log

__Results are under the `GLOBAL` folder__