# De Novo Genome Assembly
This is a notebook walkthrough for performing de novo genome assembly of a small organism. 
This notebook assumes you have completed the lectures and notebooks for all Introduction and Language classes- including the Alignment notebook. 

## Dataset download
First we need to get our hands on the data for the analysis. In this case, we will be downloading a set of Illumina based reads from the [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena). At the end of the workbook, see if you can figure out what the organism was just from the DNA!

In [None]:
%%bash
mkdir raw_denovo && cd raw_denovo && \
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR840/001/SRR8404401/SRR8404401_1.fastq.gz 
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR840/001/SRR8404401/SRR8404401_2.fastq.gz && cd ..

## FASTQ Quality Check
This is the first step for quality checking your samples. To check your FASTQ files, use FASTQC which is published by the [Babraham Bioinformatics](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) team as the first line QC tool.

This tool will evaluate the overall health of your sample. In particular, you will want to look at:
- Total number of sequences
    - Make sure it aligns with your targetted expectations
- Per base sequencing quality
    - It should drop over time, especially in 2x300bp reads. Ensure that it doesnt drop too far, usually you want Q30>90% if possible
- QC content
    - Make sure it aligns with your genome of interest
- %N
    - N is an ambiguous base. You dont want many of these in your sample (if any)
- Sequence Length
    - Make sure the length you put in to the sequencer is whats coming out
- Duplication levels
    - Keep as low as possible, but any library with PCR you expect some
- Adapter content
    - This will be driven by your insert size and sequencing length. If you are sequencing through your entire insert and in to the adapter on the other side, you would expect contamination here. Usually best to avoid that, but if you see some you will need to trim it out. 

In [None]:
%%bash
fastqc raw_denovo/SRR3279404_1.fastq.gz
fastqc raw_denovo/SRR3279404_2.fastq.gz

To view your FASTQC output, navigate into the raw_align directory (containing the raw files for alignment) and look for .html files that are the output. Download those and then open them in your browser to explore the output.

## Read Trimming
Trimming the reads is often a critical step in preparing your samples for alignment. This remove ambiguous bases, low quality bases/reads, and also removes any adapter content that may be in the read. For most trimming cases, [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) is a viable tool. Another commonly used tool is [CutAdapt](https://cutadapt.readthedocs.io/en/stable/), but here we will use trimmomatic. 

Parameters used:
- PE
    - Denotes the input as paired end reads. Following that, you need to put the two input files, then trimmed-1-unpaired, trimmed-1-paired, trimmed-2-unpaired, trimmed-2-paired for output files
- ILLUMINACLIP
    - To clip off specifically any Illumina reads that are found. The numbers that follow are the seed mismatch (maximum count of mismatches to identify adapter), palindrome clip threshold (to remove possible identical adapters on both ends of reads), and the simple clip threshold (how accurate the adapter is to read beyond the seed)
- LEADING TRAILING
    - Removes bases from the start and end of the read if they are below thresholds. "2" is Illumina for "low quality"
- SLIDING WINDOW
    - Looks at the read in a sliding window frame and takes the average of multiple bases. The first number is the window size and the second is quality score. Again "2" is Illumina for "low quality", but this allows for 1 base to not potentially trash an otherwise high quality read. 
- MINLEN
    - The minimum length of a read after trimming to accept. 
Here you'll notice we use a longer read length for acceptance. Thats due to the fact that this data set has much longer reads (check your FASTQC!) vs the alignment dataset, and we traditionally need larger fragments for de novo assembly. 

In [None]:
%%bash
trimmomatic PE raw_denovo/SRR3279404_1.fastq.gz raw_denovo/SRR3279404_1.fastq.gz \
            raw_denovo/SRR3279404_1_paired.fastq.gz raw_denovo/SRR3279404_1_unpaired.fastq.gz \
            raw_denovo/SRR3279404_2_paired.fastq.gz raw_denovo/SRR3279404_1_unpaired.fastq.gz \
            ILLUMINACLIP:raw_align/adapters.fasta:2:40:15 \
            LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 \
            MINLEN:140 

And then do a QC check on the samples again to see how they are after trimming!

We also introduce a tool called [MultiQC](https://multiqc.info/). This is a great QC aggregation tool. Play around with it a bit, but after running this you can download the multiqc file, open it in a browser, and view things aggregated. 

In [None]:
%%bash
fastqc raw_denovo/SRR3279404_1_paired.fastq.gz
fastqc raw_denovo/SRR3279404_2_paired.fastq.gz                    
multiqc raw_denovo/.

Now so far, this has been near identical to the Alignment workflow. Here we get into the assembly and things start to change. 

## SPAdes assembly
Now we are getting to the assembly. To do this, we will be utilizing [SPAdes](http://cab.spbu.ru/software/spades/). This is an assembler based upon de Bruijn graph algorithms and can be utilized for both short and long reads assembly.

For better or worse, genome assembly has some art to it and is often an iterative process. Therefore, we will use three different aspects of the SPAdes assembler to walk through how it can change. The parameters listed below reflect this "experiment" design. Please see the documentation for usage with mate-pair, single end reads, multiple libraries, or hybrid assemblies. 

SPAdes natively uses several modules, the parameters below allow selection of some if you wish. Primarily the workflow includes first [BayesHammer](http://bioinf.spbau.ru/en/spades/bayeshammer) as an error correction tool, followed by the actual SPAdes assembly, and lastly followed by a MismatchCorrector. 

Parameters:
- -1
    - The first read of your pair
- -2
    - The second read of your pair
- -o
    - The output directory
- -k
    - K mer sizes to use. It will automatically select based on maximum read length in most cases. 
- --cov-cutoff
    - The coverage cutoff to pass assembly. Default is off
- --careful
    - Runs a Mismatch corrector to try and reduce the number of mismatches and short indels. Usually only for small genomes
- --only-assembler
    - Automatically includes an error correction module. Adding this removes it. 

First we will run it just as the assembler only. 

In [None]:
%%bash
spades.py -1 raw_denovo/SRR3279404_1_paired.fastq.gz -2 raw_denovo/SRR3279404_2_paired.fastq.gz \
          -o spades-default-assembly/ --only-assembler

Followed by the native error correction

In [None]:
%%bash
spades.py -1 raw_denovo/SRR3279404_1_paired.fastq.gz -2 raw_denovo/SRR3279404_2_paired.fastq.gz \
          -o spades-errorcorrected-assembly/

And lastly with the mismatch correction included as well

In [None]:
%%bash
spades.py -1 raw_denovo/SRR3279404_1_paired.fastq.gz -2 raw_denovo/SRR3279404_2_paired.fastq.gz \
          -o spades-careful-assembly/ --careful

## Quast
Now that we have our assemblies, lets start to take a look at them and see which one is the best to move forward with. A great tool for this is [Quast](http://quast.sourceforge.net/quast). This tool calculates a variety of metrics to evaluate your assembly quality. 

There are quite a few parameters to review depending upon your use case and how you want to visualize the output. I would recommend toying with it a bit, but we will cover the basics. 

Parameters:
- -o
    - path to output
- -b
    - Include BUSCO outputs
- -l
    - List of the names for your assemblies (if comparing multiple)
- -r
    - Path to a reference file (if applicable). Will compare the assembly to the reference.
- -g
    - Path to the annotation file (if applicable, usually gff format).
- --gene-finding
    - Enables an automatic gene finder

In [None]:
%%bash
quast.py -o quast_output -b \
    -l "SPAdes-default, SPAdes-error-corrected, SPAdes-careful" \
    spades-default-assembly/contigs.fasta \
    spades-errorcorrected-assembly/contigs.fasta \
    spades-careful-assembly/contigs.fasta \

Things to look for in the output:
- Number of contigs
    - Target is as close to the number of chromosomes as possible
- Total length of contigs
    - Target is the total length of the predicted genome
- N50/L50
    - Number of contigs and then the length of the contigs that account for 50% of the size. This also can be seen as N75, or N90 for 75% and 90% respectively. As your assembly get closer to completion, youll want to use the higher values.
- BUSCO %
    - A look at how complete your genome is based on orthologs. Do note- this is just a prediction and shouldnt be taken as full gospel. 

## Prokka annotations
Now that we have our assembled genome, lets take a shot at annotating it. There are several tools to do this, but [Prokka](https://github.com/tseemann/prokka) is generally considered the bacterial standard. Prokka wraps several tools that predict coding regions, rRNA genes, tRNA genes, non-coding RNA, and signal leader peptides. It then uses a stepwise BLAST approach to annotate the coding regions identified based on first a curated user provided list, then UniProt, then RefSeq, and lastly a set of hidden markov model databases such as Pfam. Any remaining are classified as "hypothetical". As you might expect, there are a variety of parameters here and once again, I would advise reading the documentation for advanced usage. For basic usage:

Parameters:
- --outdir
    - The output directory
- --centre
    - Sequencing center ID. This is required for submission compliance
- --compliant
    - To force submission compliance for Genbank and ENA. Defaults to --genes --mincontiglen 200
- --kingdom
    - can change the annotation mode. Default is Bacteria
- --usegenus
    - Use a specific genus for targeting blast. Can also focus with gram negative positive, a list of proteins etc. 

In [None]:
%%bash
prokka --outdir prokka_output --centre MAGIC --compliant spades-careful-assembly/contigs.fasta

You'll receive a variety of output files. A few of note:
- .gff file
    - Master Annotation file
- .gbk file
    - Standard Genbank file
- .fna file
    - Nucleotide FASTA
- .txt file
    - Output stat file
- .tsv file
    - Output file of all the features found