# Assembly Quality Control with QUAST

[QUality ASsessment Tool](http://quast.sourceforge.net/docs/manual.html#sec2) is tool that generates statsistics and quality information about a genome assembly. This tool can help you to understand how well your assembly was done - how complete is it? There are a number of ways to ask this question. You can also examine muliple assemblies and compare between them. If there is a reference assembly (i.e., an assembly that was generated and published by researchers elsewhere) you can compare your assembly with the accepted standard (did you do as well, better, worse?).  

## What you'll need to run this notebook

1. You will need any genome assemblies (fasta files: .fa, .fasta). 
3. S.polyrhiza has a reference genome, we will use this as a comparison. 
2. We recommend running this notebook with at least 16 CPUs and 64GB of RAM 

### Watch the video introduction for a little bit of background on how this assembly QC tool works

[Compare Multiple assemblies using QUAST](https://youtu.be/rLsEH2XIuIE)

## Installing QUAST

As always, we will install the software; again we will use conda. 

**Important**: Make sure you execute each numbered step 

1. We will search for the tool we want to install

We will use the `conda search` command and the channel (`-c`) flag to search [bioconda](https://bioconda.github.io/)

In [None]:
conda search quast -c bioconda

There is another tool that creates a [circos plot](http://circos.ca/); this does not come with QUAST so we'll need that too. 

In [None]:
conda search circos -c bioconda

2. Create a conda enviornment

Conda uses something called "enviornments" which are essentially isolated configurations on our computer where we can included all the needed compatible tools and exlude other tools which are unnesessary or would have conflicts with our desired tool. We will use the `-y` option to install without prompting the user for input, the `--name` option to name the enviornment for the tool. We will enforce versioning (`tool==version`) so that we know what version of a tool was used to do an analysis should we wish to repeat the analysis. 

**Tip**: Use the latest version where possible, but if you get an error with dependancies, using a lower version may help. Some tools may never be installed successfully using conda, but we will face those when we have too. 

**Bonus tip**: In the installation command below we also specify the [build](https://medium.com/webgentle/what-is-the-software-build-all-you-need-to-know-4046b0e674bb) by adding an additional `=` after the version (5.0.2) and copying from the build column from our search results  above. In this case we are specifying to use tools compatible with Python 3.7 (a more up-to-date version than currently installed). We also add the `-c` command to tell conda that it can search for all the software [dependancies](https://askubuntu.com/questions/361741/what-are-dependencies) in the bioconda AND the conda-forge software repositories. We will simultaneously install the Circos package. 


In [None]:
conda create -y --name quast quast==5.0.2=py37pl5321hfecc14a_6 circos==0.69.8=hdfd78af_1 -c bioconda -c conda-forge

3. We will use the `conda init` command so that conda can be configured for this shell

In [None]:
conda init

4. **DON'T SKIP**: We need to restart the computer's [kernal](https://en.wikipedia.org/wiki/Kernel_(operating_system)). Go to the **Kernal** menu and choose **Restart Kernal**

5. Finally, we can activate the conda enviornment (created with the name used for the environment). When you run the next cell it should return the name of the environment.  

In [None]:
conda activate quast

## Running QUAST

We have example assemblies from Flye, Readbean, and Canu, as well as a reference for this species. We can use them in comparison with your own. 

**Tip**: When using commands or searching for files, the tab key will help you autocomplete (and help ensure the files and commands you think you have are actually accessible).



## Questions to think about before running this software

There are a number of terms to be familiar with before you examine the QUAST reports. Wikipedia and YouTube will be your friend here: 

1. What is a [contig](https://en.wikipedia.org/wiki/Contig)?
2. What is the N50 metric? (See [Different Assembly statistics (N50, L50, NG50, LG50, NA50, NGA50 and Misassemblies)](https://youtu.be/ViXzKrQo25k))
3. What is a [reference genome](https://en.wikipedia.org/wiki/Reference_genome)?


### Setting up BUSCO

Before we start the QUAST tool, we will also download the [BUSCO](https://busco.ezlab.org/) dataset. One of the ways we can determine if we have a complete assembly is to search for genes we would expect to be present. For different forms of life (e.g., bacteria, animals, etc.) there are genes we would expect to be present. The percentage of those genes that are detected in your assembly is a clue to its quality.  

As suggested in the installtion output (the conda create command above) we should run another QUAST command to install some additional tools. This has not always worked so we have some optional commands below if you get error messages. For this job, we need Augustus (a software for gene prediction) and the BUSCO eukaryote library to install. We can ignore other failed installations. Perhaps an updated QUAST version will fix this. 

In [None]:
# first try this
quast-download-busco

In [None]:
# If you get "ERROR! Failed downloading eukaryota database" run this cell

wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/assembly_qc/busco/eukaryota_odb9.tar.gz \
 --output-document /opt/anaconda3/envs/quast/lib/python3.7/site-packages/quast_libs/busco/eukaryota.tar.gz

### Example 1 (optional) - running QUAST on example assemblies

We have selected two draft gene assemblies you can examine:

- `spolyrhiza_wtdbg.ctg.fa`; assembled by Readbean
- `spolyrhiza_canu.contigs.fasta`; assembled by canu

We can examine these using QUAST and you can add your own genome assembly to the comparison. 

1. First, since we will have a lot of results, let's make an output directory first to keep results organizes

In [None]:
mkdir -p quast_example_QC

2. Since there is an draft assembly (.faa, another extension for a fasta file) for other S.polyrhiza, we can download and use that in our comparisons. We will also get the gene annotations from this assembly (.gff file). 


In [None]:
#Reference genome
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/assembly_qc/busco/Spirodela_polyrhiza_strain_7498.faa

#Reference annotations
wget https://data.cyverse.org/dav-anon/iplant/projects/cyverse_training/classrooms/dnalc_genome_camp/jupyter_notebooks/assembly_qc/busco/Spirodela_polyrhiza_strain_7498.gff

We'll use the following options from the [QUAST manual](http://quast.sourceforge.net/docs/manual.html#sec2.2):

- `-o` - Specify the output directory
- `-r` - Specify the reference genome
- `--features` - Specify the gene locations and names from a gene annotation file
- `--threads` - Specify number of threads
- `--eukaryote` - Indicate data is eykaryotic
- `--circos` - Create circos plot
- `--labels` - A list of lables for the assemblies - ensure these are in the order you enter your assembly paths

**Note**: QUAST is done when you get the **Thank you for using QUAST!** message. 

In [None]:
quast -o quast_example_QC \
 --no-sv \
 --no-snps \
 --circos \
 -r Spirodela_polyrhiza_strain_7498.faa \
 --features Spirodela_polyrhiza_strain_7498.gff \
 --threads 40 \
 --labels "Readbean example, Canu example" \
 spolyrhiza_wtdbg.ctg.fa spolyrhiza_canu.contigs.fasta


3. We can copy the results to our output folder

In [None]:
cp -r quast_example_QC data/output

4. In your Jupyter file browser, navigate to the `quast_example_QC` to see your results. You can view some of the reports in this Jupyter Enviornment (i.e. graphs and text files). The HTML reports will not render completely, so you will need to go to CyVerse data store to view the results once you terminate this analysis and go to your ouput folder. 

## Results

The [QUAST manual](http://quast.sourceforge.net/docs/manual.html#sec2.2) goes through each report in detail, 
but here are some highlights to look for:

- `basic_stats/cumulative_plot.pdf` - shows how close your assembly is to the reference genome (i.e. how many contigs before all/most of the reference genome are covered)
- `basic_stats/GC_content_plot.pdf` - shows how the ratio of G and C nuclotides compares between assemblies
- `circos/circos.png` - shows a [circos plot](http://circos.ca/intro/genomic_data/) comparing the genome assemblies to a reference

## Challenge - Use QUAST with your assembly

Now, it's up to you to use  `QUAST` to evaluate assemblies you have made


### What to do

1. Use the `mkdir` command to make a unique output directory to save your results (e.g. QUAST_my_assembly). 

In [None]:
mkdir 

2. Complete the command below to try an assembly using the same parameters as the example above. 

- For the labels, inside quotes enter a title for each one of your assemblies you'd like to compare to the reference - separated by a comma
- On the last line of the command, enter the file name of each of your assembles (separated by a space for each file if more than one). The order entered should match what you have put in the labels option.
- You can consult the [QUAST manual](http://quast.sourceforge.net/docs/manual.html#sec2.2) to see if there are other changes and/or reports you'd like to try. 


In [None]:
quast -o quast_example_QC \
 --no-sv \
 --no-snps \
 --circos \
 -r Spirodela_polyrhiza_strain_7498.faa \
 --features Spirodela_polyrhiza_strain_7498.gff \
 --threads 40 \
 --labels "YOUR LABELS" \
 file1.fasta 

## Document your work

We need to keep good track on what changes were made/how a file was produced so that we can fully document our work. In this exercise, it will be critical to know your settings so we can compare results across everyone who does this experiment. You will also be able to go back and reproduce your work if needed. 

**Make sure to save a copy of this notebook**

When you terminate your application in CyVerse the results and data should be written back. You can also select this notebook in the file browser and choose Save and Export Notebook As (HTML) to save an easy-to-read version you can view anytime. 

1. Copy your QUAST results folder to `data/ouput`

In [None]:
cp -r 

2. Save and then copy your notebook to `data/output`

In [None]:
cp assembly_qc_quast.ipynb data/output