# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022
    
Adapted by J. Orjuela (DIADE-IRD) - mai 2023


# <span style="color:#006E7F">__TP2 - Assembly and correction__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Assemblies </span>  

Quite contiguous genome assemblies can be generated using long sequencing read but assembly is not a quiet pace. Eukaryotic genomes assembly is a complex task (large genome size, high rate of repeated sequences, high heterozygosity level and even polyploidy). While prokaryotic genomes may appear less challenging, specific features such as circular DNA molecules, must be taken into consideration to achieve high quality assembly.

* For assembly, ONT recommend sequencing a human genome to a minimum depth of 30x of 25–35 kb reads.
However, sequencing to a depth of 60x (min 20x) is advisable to obtain the best assembly metrics. We also recommend basecalling in high accuracy/sup/Dorado mode. Greatest contig N50 is usually obtained with Shasta and Flye. Polishing/Correction is also recommended (Medaka). https://nanoporetech.com/sites/default/files/s3/literature/human-genome-assembly-workflow.pdf


* Long reads simplify genome assembly, with the ability to span repeat-rich sequences (characteristic of  antimicrobial resistance genes) and structural variants. Nanopore sequencing also shows a lack of bias in GC-rich regions, in contrast to other sequencing platforms. To perform microbial genome assembly, we suggest using the third-party *de novo* assembly tool Flye. We also recommend one round of polishing with Medaka. https://nanoporetech.com/sites/default/files/s3/literature/microbial-genome-assembly-workflow.pdf 

Flye  https://github.com/fenderglass/Flye 

Canu  https://canu.readthedocs.io/en/latest/quick-start.html

Miniasm  https://github.com/lh3/miniasm + Minipolish version https://github.com/rrwick/Minipolish

Shasta  https://github.com/chanzuckerberg/shasta

Smartdenovo  https://github.com/ruanjue/smartdenovo

Raven  https://github.com/lbcb-sci/raven

### <span style="color: #4CACBD;"> 1.1 Assembly using Flye </span>

We are going to assembly the three algae samples using Flye https://github.com/fenderglass/Flye

Flye generates the concatenation of multiple disjoint genomic segments called disjointigs to build a repeat graph. Reads are mapped to this repeat graph to resolve conflicts (unbridged repeats) and output contigs.

In [None]:
# create a repertory for save flye assembly results
mkdir -p ~/work/RESULTS/4222_FLYE
cd ~/work/RESULTS/
pwd

In [None]:
# Run flye with genome size of about 15Mb
time flye --nano-hq ~/work/DATA/ONT/4222_RB2.fastq.gz --genome-size 15000000 --out-dir 4222_FLYE --threads 8

### How many contigs were obtained by Flye ? 

### What do you think about N50 and lenght contig mean.

## <span style="color: #4CACBC;"> 2. Assemblies correction with Medaka </span>  

Correction can improve the consensus sequence for a draft genome assembly.

Medaka uses fast5 files to correct contigs using trained models. These models are freely available.

Medaka allows you to train a model by using fast5 from your favorite species. You can use it directly to obtain a consensus from you favorite organism.

### <span style="color: #4CACBC;"> 2.3 Correct assemblies with Medaka </span>  

We will use medaka to correct assemblies from FLYE


### <span style="color: #4CACBC;"> 2.3.1 Before to correct assemblies, index them </span>  

### Index and map reads on raw flye assembly

In [None]:
cd ~/work/RESULTS/4222_FLYE
pwd

In [None]:
time samtools faidx assembly.fasta 
time minimap2 -d assembly.fasta.mmi assembly.fasta

In [None]:
whereis samtools

### Create a medaka repertory to save medaka results from FLYE analysis


In [None]:
mkdir -p ~/work/RESULTS/4222_FLYE_MEDAKA

### <span style="color: #4CACBC;"> 2.3.2 Medaka_consensus </span>  


Medaka is a tool to create a consensus sequence from nanopore sequencing data. 

This task is performed by using neural networks applied to a pileup of individual sequencing reads against a draft assembly.

It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.

As input medaka accepts reads in either a .fasta or a .fastq file. It requires a draft assembly as a .fasta.

### Check the usage of medaka_consensus

In [None]:
medaka_consensus

### Check the medaka model to use

Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and  iv) the basecaller version

{pore}_{device}_{caller variant}_{caller version}

examples: 

r941_min_fast_g303 : MiniON R9.4.1 flowcells using the fast Guppy basecaller version 3.0.3. 


In [None]:
medaka tools list_models

### Medaka in FLYE ASSEMBLY

We run medaka on the FLYE assembly.

In [None]:
# go to medaka repertory results
cd ~/work/RESULTS/4222_FLYE_MEDAKA

In [None]:
time medaka_consensus -i ~/work/DATA/ONT/4222_RB2.fastq.gz -d ~/work/RESULTS/4222_FLYE/assembly.fasta -o MEDAKA_CONSENSUS -t 8

### <span style="color: #4CACBC;"> Conclusion </span>  


1. Here, we have build a genome sequence from an algae by using Flye.

2. We have corrected the raw flye assembly with medaka models.

Genome size of reference and assembly obtained by Flye is really different? why ?

* you can try to compare both using Dgenies https://dgenies.toulouse.inra.fr/

What do you observe? 


### Now ...

You can do similar pipeline using the RAVEN assembler in the 4222 algae sample.

You can do similar protocol with the B8 and G11 samples.

We will compare the results in next practical "Quality Assemblies". 