## Quality control
To see whether a dataset is worth processing, it is necessary to assess the quality of the dataset.This set of code is run on the raw reads from the Illumina sequencer:

In [1]:

fastqc ../data/working/*.fastq \
--outdir=../data/results


SyntaxError: invalid syntax (<ipython-input-1-896ea53919a6>, line 1)

This script does what it needs to, but kind of indiscriminately dumps everything in control. Kind of want to add a step where it puts everything together in a nice little box. The next step is to trim reads using Trimgalore:

In [3]:
#need to write this script Trimgalore

In [None]:
#this is an additional filtering step you can take
iu-gen-configs ../data/working/cavernosa_samples.txt -o ../data/working/quality_configs
iu-filter-quality-minoche ../data/working/quality_configs/Coral1.ini
iu-filter-quality-minoche ../data/working/quality_configs/Coral2.ini
iu-filter-quality-minoche ../data/working/quality_configs/Coral3.ini
iu-filter-quality-minoche ../data/working/quality_configs/Coral4.ini

Your samples file needs to be a tab-delimited file with in the first column your sample name, in the second column the corresponding R1 file path, and in the third column your corresponding R2 file path. THE FIRST ROW MUST BE 'Sample' 'r1' and 'r2' OR IT WON'T DO ANYTHING. 

Aside from that, this software won't run if the filename is occupied, for fear of removing important information. 

### Read normalisation


In [None]:
#set parameters
SAMPLENAME=
READSPATH=
SCRIPTPATH=
OUTPATH=
READNAME1= read_name_without_variable_part_R1
READNAME2= read_name_without_variable_part_R2

#check paths


#normalise reads in a loop
for f in Coral1 Coral2 Coral4 Coral5
do
$SCRIPTPATH/bbnorm.sh in=$READSPATH/"f"_"$READNAME1".fq out="OUTPATH"/"$f"_norm_R1.fq target=70 min=5 || { echo 'Forward reads not normalised, exiting.' && exit; }
$SCRIPTPATH/bbnorm.sh in=$READSPATH/"f"_"$READNAME2".fq out="OUTPATH"/"$f"_norm_R2.fq target=70 min=5 || { echo 'Reverse reads not normalised, exiting.' && exit; }
done

### Read filtering
To remove host reads, I have used Bowtie2's "build" function to create an index of the genome that was available, then aligned those using bowtie2. Then you separate mapped and unmapped reads using samtools, which can also separate them into different files and re-purpose them into fastq files. 

In [None]:
#set parameters:
SAMPLENAME=
GENOME=
READSPATH=/path/to/reads/
INDEX="$GENOME"_DB

#this is how you build a bowtie2 index from a known genome:
bowtie2-build ../data/working/M.cavernosa.fasta "$INDEX"

#split out: bowtie2-build path/to/input index-name
for f in <sample1> <sample2> <sample3> <sample4>
do
bowtie2 -p 8 -x ../data/working/$INDEX -1 "$READSPATH"/"$f"_R1_001_val_1.fq -2 "$READSPATH"/"$f"_R2_001_val_2.fq -S ../data/working/"$f"_mapped_and_unmapped.sam
#this re-aligns your reads back to the index;

samtools view -bS ../data/working/"$f"_mapped_and_unmapped.sam > ../data/working/"$f"_mapped_and_unmapped.bam
#this converts the sam file from bowtie to a bam file for processing

samtools view -b -f 12 -F 256 ../data/working/"$f"_mapped_and_unmapped.bam > ../data/working/"$f"_bothReadsUnmapped.bam
#this extracts only the reads of which both do not match against the host genome

samtools sort -n -m 5G -@ 2 ../data/working/"$f"_bothReadsUnmapped.bam -o ../data/working/"$f"_bothReadsUnmapped_sorted.bam
samtools fastq -@ 8 ../data/working/"$f"_bothReadsUnmapped_sorted.bam \
    -1 "$f"_host_removed_R1.fastq \
    -2 "$f"_host_removed_R2.fastq \
    -0 /dev/null -s /dev/null -n

#this sorts the file so both mates are together and then extracts them back as .fastq files

This leaves you with trimmed, filtered reads. After this, you should run FastQC again. Be mindful that you change its logging location and the name of the directory, OR IT WILL OVERWRITE THE OTHER FILES. Now comes the first step towards a metagenomic workflow: combining your read files. Mark this well: in order for this to be scientifically accurate, you need to combine samples that are somehow connected: 
* A time series of samples works
* A series of samples of the same habitat/species
* The same sample, taken in different locations

To practice, its perfectly okay to combine your samples at random. To find something interesting, do as above!

In [None]:
#The following code will combine your read files into large files that can be used for de novo co-assembly
cat *_R1.fastq > reads_R1_ALL.fastq
cat *_R2.fastq > reads_R2_ALL.fastq

Now, you are ready to proceed with the de novo co-assembly of your samples