# Genome Assembly with Canu

[Canu](https://canu.readthedocs.io/en/latest/quick-start.html) "specializes in assembling PacBio or Oxford Nanopore sequences. Canu operates in three phases: correction, trimming and assembly. The correction phase will improve the accuracy of bases in reads. The trimming phase will trim reads to the portion that appears to be high-quality sequence, removing suspicious regions such as remaining SMRTbell adapter. The assembly phase will order the reads into contigs, generate consensus sequences and create graphs of alternate paths.

For eukaryotic genomes, coverage more than 20x is enough to outperform current hybrid methods, however, between 30x and 60x coverage is the recommended minimum. More coverage will let Canu use longer reads for assembly, which will result in better assemblies."

## What you'll need to run this notebook

1. You will need a set of sequencing reads produced previously (i.e. from running your `ReadQC-with-fastp` notebook). 
2. We recommend running this notebook with at least 32 CPUs and 64GB of RAM - genome assembly is usually computationally intensive. Some programs may take days even on a very powerful machine. 

### Watch the video introduction for a little bit of background on how this assembly tool works

[Canu: Long Read Genome Assembly Tool](https://youtu.be/rb9w9y9e9gs)

## Installing Canu

As always, we will install the software; again we will use conda. 

**Important**: Make sure you execute each numbered step 

1. We will search for the tool we want to install

We will use the `conda search` command and the channel (`-c`) flag to search [bioconda](https://bioconda.github.io/)

In [None]:
conda search canu -c bioconda

2. Create a conda enviornment

Conda uses something called "enviornments" which are essentially isolated configurations on our computer where we can included all the needed compatible tools and exlude other tools which are unnesessary or would have conflicts with our desired tool. We will use the `-y` option to install without prompting the user for input, the `--name` option to name the enviornment for the tool. We will enforce versioning (`tool==version`) so that we know what version of a tool was used to do an analysis should we wish to repeat the analysis. 

**Tip**: Use the latest version where possible, but if you get an error with dependancies, using a lower version may help. Some tools may never be installed successfully using conda, but we will face those when we have too. 

**Bonus tip**: In the installation command below we also specify the [build](https://medium.com/webgentle/what-is-the-software-build-all-you-need-to-know-4046b0e674bb) by adding an additional `=` after the version (2.5) and copying from the build column from our search results  above. 


In [None]:
conda create -y --name canu canu==2.2=ha47f30e_0 -c bioconda -c conda-forge

3. We will use the `conda init` command so that conda can be configured for this shell

In [None]:
conda init

4. **DON'T SKIP**: We need to restart the computer's [kernal](https://en.wikipedia.org/wiki/Kernel_(operating_system)). Go to the **Kernal** menu and choose **Restart Kernal**

5. Finally, we can activate the conda enviornment (created with the name used for the environment). When you run the next cell it should return the name of the environment.  

In [None]:
conda activate canu

## Running Canu

Previously, we ran our assembly on a small subset of reads (100 or 1000 reads). Here we may as well run on the entire set of reads from your quality control step. 

**Tip**: When using commands or searching for files, the tab key will help you autocomplete (and help ensure the files and commands you think you have are actually accessible).



## Questions to answer before running this software

It's important you know these answers. You will likely do more than one assembly so that you can tweak the settings and make improvements. You may need to refer to your `fastp` output to answer these. 

1. What is the name of the file containing your cleaned reads
2. How many reads total do you have in your input? 
3. How many nucleotides do you have in your input? 
4. What is the mean length of your reads?
5. What percentage of reads are Q20 ([phred score](https://en.wikipedia.org/wiki/Phred_quality_score)) or above?
6. What is your estimated coverage?

To answer question #6 remember, coverage is how many nucleotides you have/the size (also in nucleotides) of the genome. For example, if you have 5 Gigabases (5GiB, 5,000,000,000 bases) and your genome is 150Mb, 150,000,000 you have ~ 33X coverage (33 times the genome size). 

## Notes on Canu

Canu is a "beast" of a program, actually it is several programs for correction of reads, trimming, and construction of contigs. It can produce very nice assemblies, but it may take a long time (days - week+) and use a lot of resources 

 >We don’t have a good way to estimate of disk space used for the assembly. It varies with genome size, repeat content, and sequencing depth. A human genome sequenced with PacBio or Nanopore at 40-50x typically requires 1-2TB of space at the peak. Plants, unfortunately, seem to want a lot of space. 10TB is a reasonable guess. We’ve seen it as bad as 20TB on some very repetitive genomes. ([Canu FAQ](https://canu.readthedocs.io/en/latest/faq.html)). 
 
From the [Canu documentation](https://canu.readthedocs.io/en/latest/quick-start.html) here are the steps Canu will undertake: 


- Load reads into the read database, gkpStore.
- Compute k-mer counts in preparation for the overlap computation.
- Compute overlaps.
- Load overlaps into the overlap database, ovlStore.
- Do something interesting with the reads and overlaps.
  - The read correction task will replace the original noisy read sequences with consensus sequences computed from overlapping reads.
   - The read trimming task will use overlapping reads to decide what regions of each read are high-quality sequence, and what regions should be trimmed. After trimming, the single largest high-quality chunk of sequence is retained.
   - The unitig construction task finds sets of overlaps that are consistent, and uses those to place reads into a multialignment layout. The layout is then used to generate a consensus sequence for the unitig.

**When running this program expect that it will take significant time. You may need to come back regularly to extend your time on VICE. You will get warnings by email from CyVerse as you approach your time limit. Go to analyses and select the running analysis for this notebook to extend the time.***


### Canu options

There are many more possibe `canu` options then we will play with. Here are the important ones for our exercises. You can see the whole list of options in the [canu parameter reference](https://canu.readthedocs.io/en/latest/parameter-reference.html). 

- `-p`: Specify a prefix name to label all files 
- `-d`: Specify a directory to output files
- `genomeSize=<number>[g|m|k]`: A genome size estimate in billions (g), millions (m), or thousands (k). 
- `maxInputCoverage=<number>`: randomly down-sample input reads to this coverage
- `-nanopore`: Optimizes settings for nanopore data

The type of read (i.e. Nanopore) should be followed with the path to the reads to assemble. 

**Tip**: Use the help command to see all options

In [None]:
canu --help

### Example 1 (optional) - running canu on `spolyrhiza_reads_filtered.fastq.gz`

The file `spolyrhiza_reads_filtered.fastq.gz` was generated in advance using the raw reads (`spolyrhiza_reads.fastq.gz`) cleaned and flitered with `fastp`. The resulting data set has the following characteristics: 

- Total reads: 1.06 million
- Total bases: 6.57 billion 
- Q20 bases: 4.39 billion
- Mean read length: 6149bp

If you like, you can run `canu` on this dataset (since we know it works and will generate an assembly). The following command is a good example to show you how to run `canu`. You can modify it to run with your own input file. 

**Tip**: The *S.polyrhiza* genome size is about 150 million bases

1. First, since we will have a lot of results, let's make an output directory first to keep results organizes

In [None]:
mkdir -p data/output/canu_example_assembly

2. Redbean has two steps to generate the assembly. First reads are assembled into a "contig layout" and another command is used to "condense" those contigs into the final consensus in FASTA. 

First we run the command below to assemble (this will take **several** hours)

**IMPORTANT - how to know when this is finnished**: The [\*] you see next to a cell indicates that a program is running. If you see this, the program is still running (even though a given stage of the program may say "DONE"). 


In [None]:
canu -p example_spolyrhiza \
     -d data/output/canu_example_assembly \
     genomeSize=150m \
     maxInputCoverage=100 \
     -nanopore \
     /data/input/concat_fastq/spolyrhiza_reads_filtered.fastq.gz

-- canu 2.2
--
-- CITATIONS
--
-- For 'standard' assemblies of PacBio or Nanopore reads:
--   Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
--   Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
--   Genome Res. 2017 May;27(5):722-736.
--   http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.

### Results

Like most assemblers, at each step `canu` will generate several intermediate outputs. In the output folder, you will see several folders of results. The ultimate file will be called <YOUR-PREFIX>.contigs.fasta (`your-prefix` will be whatever you specified in the command you used to start canu. 

## Challenge - Use Canu to assemble your reads

Now, it's up to you to use  `canu` to generate your own genome assembly using the reads generated by fastp. 


### What to do

1. Use the `mkdir` command to make a unique output directory to save your results (e.g. canu_assembly_1, canu_assembly_2). Make a new folder each time you do a new assembly. 

In [None]:
mkdir 

2. Complete the command below to try an assembly using the same parameters as the example above. You will have to adjust the f prefix name (`-p`) something of your choice, specify the output directory (`-d`). Canus will atomatically use available CPU threads; CPU number was specified when you launched the application in the Discovery Environment. 

**Note about completing the cells below**
- Enter your options after the flag (e.g. `-p`) and before the `\`
- Leave a space between what you write and the `\`
- Look closely how the working commands in the example are written if you have difficulties
- Look closely for errors (unable to find file,  unable to write file, etc.)
- Don't forget to end the command with the path to your reads. 
- Ask for help if you can't get something to work

In [None]:
canu -p  \
     -d  \
     genomeSize= \
     maxInputCoverage=100 \
     -nanopore \
     

3. Things to change. There are many things you can change which may improve (or worsen) the quality of your assembly. For now, it does not make sense to try too many things. Since canu does its own read correction, and interesting thing to try would be running it with unfiltered/uncorrected reads vs. reads you filtered with `fastp`. 




## Try something

Generate another assembly to see what parameters make a difference. 

Possible changes to try

- You were asked to create two different `fastp` outputs, try them both with the same settings
- Try canu on reads that were unfliltered by `fastp`. 

Add as many cells as you need to try alternative assembles. You should also try at least one other assembler. 

## Document your work

We need to keep good track on what changes were made/how a file was produced so that we can fully document our work. In this exercise, it will be critical to know your settings so we can compare results across everyone who does this experiment. You will also be able to go back and reproduce your work if needed. 

**Make sure to save a copy of this notebook**

When you terminate your application in CyVerse the results and data should be written back. You can also select this notebook in the file browser and choose Save and Export Notebook As (HTML) to save an easy-to-read version you can view anytime. 