Skip to content

Latest commit

 

History

History
185 lines (157 loc) · 7.11 KB

File metadata and controls

185 lines (157 loc) · 7.11 KB

Data Access

Making use of the prefetched data and the sra_getter.sh script, as described here.

Performing quality checks

The script used for this, quality_controller.sh, takes two arguments, constititing the first and last accession in the list to be prefetched, same as with prefetching. It performs FastQC quality control on each fastq file loaded, and removes those large files afterwards, a necessity when working on a small working directory drive.

user@server:~$ bash quality_controller.sh 1 106
looking at SRR1427728
...
#!/bin/bash
#makes fastqs from SRA by accession stored in textfile and by number of accession in that file.
#takes two arguments: the line number of the first
#and the last accession in that file to make fastqs from.
#it then performs fastqc quality control and deletes the fastqs again.

#go through all accessions
i=1
for accession in $(cat sra_accessions.txt)
do
	#look only at selected accessions
	if [ $i -ge $1 ] && [ $i -le $2 ]
	then
		echo looking at $accession
		bash sra_getter.sh $accession get > "$accession"_tempfile.txt
		fasterq-dump -S $accession
		bash sra_getter.sh $accession putback $(cat "$accession"_tempfile.txt)
		rm "$accession"_tempfile.txt
		fastqc "$accession"* -q &
		wait
		rm "$accession"*.fastq
	fi
#look at next accession
let "i=i+1"

done

The produced quality control files were then used to determine filter parameters for trimming.

Trimming and screening parameters

To better determine which parameters to use for trimming with TrimGalore, we wanted to examine the properties of the contigs that would later be produced (using mothur) from these trimmed sequences. For that, the following R script screeningtester_graphics.R was used, which plots four histograms of a contigs.report file as produced by mothur. These display overall contig length, overlap length, number of mismatches and number of a,biguities for one sample. Using the outputs of both this and the previous step with fastqc, parameters for TrimGalore and the later contig screening with mothur were arrived at. It is called from the command line like this:

user@server:~$ R --quiet --vanilla -f screeningtester_graphics.R --args SRR1427728_1_val
#a script to visualize parameters of the contigs created from forward and reverse fastqs for one sample.
#called as R --quiet --vanilla -f screeningtester_graphics.R --args "$accession"_1_val_1
#or R --quiet --vanilla -f screeningtester_graphics.R --args "$accession"_1
filename <- commandArgs(trailingOnly = TRUE) 
mytable <- read.table(paste0(filename,".contigs.report"),header=T)
#plotting four histograms: contig length, oveerlap length, number of mismatches and number of ambiguities.
#each has 50 braks, meaning numbre of distinct columns in the histogram to be plotted.
Length_hist <- hist(mytable$Length, plot=F, breaks=50)
Overlap_hist <- hist(mytable$Overlap_Length, plot=F, breaks=50)
Mismatch_hist <- hist(mytable$MisMatches, plot=F, breaks=50)
Ambig_hist <- hist(mytable$Num_Ns, plot=F, breaks=50)

#writing into pdf with set x ranges (may need modification to your needs)
pdf(paste0(filename,"_hist.pdf"))
plot(Length_hist,xaxt="n")
xtick<-seq(0, 400, by=5)
axis(side=1, at=xtick)

plot(Overlap_hist,xaxt="n")
xtick<-seq(0, 400, by=2)
axis(side=1, at=xtick)

plot(Mismatch_hist,xaxt="n")
xtick<-seq(0, 100, by=1)
axis(side=1, at=xtick)

plot(Ambig_hist,xaxt="n")
xtick<-seq(0, 100, by=1)
axis(side=1, at=xtick)

dev.off()

A wrapper script (contig_histogrammer.sh) in bash was produced, which loops, as previous scripts, though the acession list, applies this R script as well as allowing an option to either trim or not trim fastqs before making contigs.

user@server:~$ bash contig_histogrammer.sh 1 106 trim
looking at SRR1427728
...
#!/bin/bash
#makes fastqs from SRA by accession stored in textfile and by number of accession in that file. then makes contigs with or without trimming, and histograms from that.
#takes three arguments: the line number of the first and the last accession in that file to make fastqs from; and "trim" or "notrim"


operation=$3

#make sure prefetch is working:
cd ~/
export PATH=$PATH:$PWD/sratoolkit.2.11.2-ubuntu64/bin
cd ~/WORKING_DIR/
if [ $(which prefetch | wc -l) -eq 1 ]
then
	#go through all accessions
	i=1
	for accession in $(cat sra_accessions.txt)
	do
	
		#look only at selected accessions
		if [ $i -ge $1 ] && [ $i -le $2 ]
		then
			echo looking at $accession
			#retrieve prefetched file
			bash sra_getter.sh $accession get > "$accession"_tempfile.txt
			#load fastqs
			fasterq-dump -S $accession
			bash sra_getter.sh $accession putback $(cat "$accession"_tempfile.txt)
			rm "$accession"_tempfile.txt
			if [ $operation = "trim" ]
			then
				#here different trimming parameters can be chosen for metagenomic or metatranscriptomic data
				a=$(grep $(echo $accession) accessions_with_some_info.txt | awk '{print $6}')
				if [ $a = METAGENOMIC ]
				then
					~/TrimGalore-0.6.7/trim_galore -j 6 -q 30 --paired --three_prime_clip_R1 10 --three_prime_clip_R2 10 --output_dir ~/WORKING_DIR/ ~/WORKING_DIR/"$accession"_1.fastq ~/WORKING_DIR/"$accession"_2.fastq
				fi
				if [ $a = METATRANSCRIPTOMIC ]
				then
					~/TrimGalore-0.6.7/trim_galore -j 6 -q 30 --paired --three_prime_clip_R1 10 --three_prime_clip_R2 10 --output_dir ~/WORKING_DIR/ ~/WORKING_DIR/"$accession"_1.fastq ~/WORKING_DIR/"$accession"_2.fastq
				fi
				rm "$accession"*.fastq
				#making contigs from trimmed data
				~/mothur_1_45_3/mothur "#set.dir(output=~/WORKING_DIR/);make.contigs(ffastq=~/WORKING_DIR/"$accession"_1_val_1.fq, rfastq=~/WORKING_DIR/"$accession"_2_val_2.fq, processors=6, checkorient=T);quit()"
				#plotting histograms of contig parameters
				R --quiet --vanilla -f screeningtester_graphics.R --args "$accession"_1_val_1
			else
				#making contigs directly from untrimmed fastqs
				~/mothur_1_45_3/mothur "#set.dir(output=~/WORKING_DIR/);make.contigs(ffastq=~/WORKING_DIR/"$accession"_1.fastq, rfastq=~/WORKING_DIR/"$accession"_2.fastq, processors=6, rename=T, checkorient=T);quit()"
				#plotting histograms of contig parameters
				R --quiet --vanilla -f screeningtester_graphics.R --args "$accession"_1
			fi
			#clearing the working directory
			rm "$accession"*contigs* mothur*logfile "$accession"*report* "$accession"*fq
		fi
	#look at next accession
	let "i=i+1"
	done
	
	#move output files into folders
	for i in $(ls SRR*.pdf)
	do
		a=$(grep $(echo $i | sed 's/_.*//') accessions_with_some_info.txt | awk '{print $6}')
		if [ $operation = "trim" ]
		then
			if [ $a = METAGENOMIC ]
			then
				mv $i ./histograms/trimming/metagenomics/$i
			fi
			if [ $a = METATRANSCRIPTOMIC ]
			then
				mv $i ./histograms/trimming/metatranscriptomics/$i
			fi
		else
			if [ $a = METAGENOMIC ]
			then
				mv $i ./histograms/untrimmed/metagenomics/$i
			fi
			if [ $a = METATRANSCRIPTOMIC ]
			then
				mv $i ./histograms/untrimmed/metatranscriptomics/$i
			fi
		fi
	done
fi