#Downloading raw read data from SRA

The below cell will download the raw Illumina data associated with the study and rename the files as specified in the text file `SraAccList.txt` provided in the `data` directory.

The cell expects modules of the [SRA Toolkit](http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc) (`prefetch` and `fastq-dump`) to be in your path (tested with SRA Toolkit version 2.3.5). `prefetch` downloads data in sra format and places it per default into `~/ncbi/public/sra/`. `fastq-dump` converts the sra formatted data to gzipped fastq.

In [None]:
%%bash

sra_path="~/ncbi/public/sra"

cd data

for line in $(cat SraAccList.txt)
do 
    name=$(echo -e "$line" | cut -d "," -f 1)
    acc=$(echo -e "$line" | cut -d "," -f 2)
    echo -e "\nDOWNLOADING: $name\t$acc\n################\n"
    prefetch $acc
    fastq-dump --split-files --gzip --defline-seq '@$ac-$sn/$ri' --defline-qual '+' $sra_path/$acc.sra
    mv $acc\_1.fastq.gz $name\_1.fastq.gz
    mv $acc\_2.fastq.gz $name\_2.fastq.gz
done

cd ..

#Processing read data

__EXPLORING CLUSTERING PARAMETER SPACE__

The following cell just performs the clustering steps across a range of clustering parameters.

In [None]:
%%bash

mkdir cluster_space
cd cluster_space

for i in $(seq 0.90 0.01 1)
do
    for j in $(seq 100 50 500)
    do
        echo -e "running with clustering threshold $i coverage $j"
        metaBEAT.py -Q ../../data/QUERYmap --merge --merged_only --length_filter 310 --product_length 400 --clust_match $i --clust_cov $j -n 5 -v --cluster &> log_$i-$j.txt
        mv reads_stats.csv reads_stats_$i-$j.csv
    done
done

cd ..

Now we concatenate all results into a single file and format it ready for R processing.

In [None]:
%%bash

cat cluster_space/reads_stats_0.90-100.csv | head -n 1 > cluster_space/combined_reads_stats.csv
cat cluster_space/reads_stats_* | grep "sample," -v >> cluster_space/combined_reads_stats.csv



The file `cluster_space/combined_reads_stats.csv` is processed with the R script `clustering_paramters_heat.R` to produce Figure 5 in the manuscript.

__PERFORMING FINAL ANALYSES (INCL. TAXONOMIC ASSIGNMENT) FROM RAW READ DATA__

In [None]:
%%bash

mkdir final_analysis
cd final_analysis

The following cell does the trimming and clustering then does a BLAST assignment based on the custom reference database (`positives.gb`) using the metaBEAT pipeline.

In [None]:
!metaBEAT.py -Q ../../data/QUERYmap -R ../../data/REFmap --merge --merged_only --length_filter 310 --product_length 400 --clust_match 0.95 --clust_cov 50 --trim_minlength 100 -n 4 --cluster --PCR_primer ../../data/PCR_primers.fasta -E -b --min_ident 0.95 > log

Read summary stats produced by metaBEAT (`reads_stats.csv`) were processed in R with the script `trimming_results_script.R` to produce Figs 3 and 4.

The file `metaBEAT.tsv` contains the results from the taxonomic assignment in human-readable text format. The pipeline also produces the results in [BIOM](http://biom-format.org/) format (`metaBEAT.biom`). The next cell will re-format the text file for subsequent processing in R.

In [None]:
!cat metaBEAT.tsv | grep "# " -v | sed 's/#//' | sed 's/\.blast//g' | sed 's/ /_/' | perl -ne 'chomp; @a=split("\t"); pop(@a); $out=join("\t", @a); print "$out\n"' > metaBEAT-processed.tsv

 `metaBEAT-processed.tsv` was used to produce Figure 6 with the R script `well_composition_script.R`. 