# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022
    
Adapted by J. Orjuela (DIADE-IRD) - mai 2023


# <span style="color:#006E7F">__TP2 - Assembly and correction__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Assemblies </span>  

More contiguous genome assemblies can be generated using long sequencing read but assembly is not a quiet pace. Eukaryotic genomes assembly is a complex task (large genome sizes, high rates of repeated sequences, high heterozygosity levels and even polyploidy). While prokaryotic genomes may appear less challenging, specific features such as circular DNA molecules, must be taken into consideration to achieve high quality assembly.

* For assembly, ONT recommend sequencing a human genome to a minimum depth of 30x of 25–35 kb reads.
However, sequencing to a depth of 60x is advisable to obtain the best assembly metrics. We also recommend basecalling in high accuracy mode. Greatest contig N50 is usually obtained with Shasta and Flye. Polishing/Correction is also recommended (Racon and Medaka). https://nanoporetech.com/sites/default/files/s3/literature/human-genome-assembly-workflow.pdf


* Long reads simplify genome assembly, with the ability to span repeat-rich sequences (characteristic of  antimicrobial resistance genes) and structural variants. Nanopore sequencing also shows a lack of
bias in GC-rich regions, in contrast to other sequencing platforms. To perform microbial genome assembly, we suggest using the third-party de novo assembly tool Flye. We also recommend one round of polishing with Medaka. https://nanoporetech.com/sites/default/files/s3/literature/microbial-genome-assembly-workflow.pdf 

Flye  https://github.com/fenderglass/Flye 

Canu  https://canu.readthedocs.io/en/latest/quick-start.html

Miniasm  https://github.com/lh3/miniasm + Minipolish version https://github.com/rrwick/Minipolish

Shasta  https://github.com/chanzuckerberg/shasta

Smartdenovo  https://github.com/ruanjue/smartdenovo

Raven  https://github.com/lbcb-sci/raven

### <span style="color: #4CACBD;"> 1.1 Assembly using Flye </span>

We are going to assembly some Clones by using Flye https://github.com/fenderglass/Flye

Flye generates the concatenation of multiple disjoint genomic segments called disjointigs to build a repeat graph. Reads are mapped to this repeat graph to resolve conflicts (unbridged repeats) and output contigs.

In [1]:
# create a repertory for save flye assembly results
mkdir -p ~/work/RESULTS/4222_FLYE
cd ~/work/RESULTS/
pwd

/home/jovyan/work/RESULTS


In [2]:
# Run flye with genome size of about 15Mb
time flye --nano-hq ~/work/DATA/4222_RB2.fastq.gz --genome-size 15000000 --out-dir 4222_FLYE --threads 8

[2023-10-28 08:29:15] INFO: Starting Flye 2.9.2-b1786
[2023-10-28 08:29:15] INFO: >>>STAGE: configure
[2023-10-28 08:29:15] INFO: Configuring run
[2023-10-28 08:29:31] INFO: Total read length: 746791553
[2023-10-28 08:29:31] INFO: Input genome size: 15000000
[2023-10-28 08:29:31] INFO: Estimated coverage: 49
[2023-10-28 08:29:31] INFO: Reads N50/N90: 20829 / 6197
[2023-10-28 08:29:31] INFO: Minimum overlap set to 6000
[2023-10-28 08:29:31] INFO: >>>STAGE: assembly
[2023-10-28 08:29:31] INFO: Assembling disjointigs
[2023-10-28 08:29:31] INFO: Reading sequences
[2023-10-28 08:29:43] INFO: Building minimizer index
[2023-10-28 08:29:43] INFO: Pre-calculating index storage
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2023-10-28 08:30:10] INFO: Filling index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2023-10-28 08:30:53] INFO: Extending reads
[2023-10-28 08:33:43] INFO: Overlap-based coverage: 30
[2023-10-28 08:33:43] INFO: Median overlap divergence: 0.0796524
0% 10% 30% 60% 70% 80% 90% 

### How many contigs were obtained by Flye ? 

### What do you think about N50 and lenght contig mean.

## <span style="color: #4CACBC;"> 2. Assemblies correction with Medaka </span>  

Correction can improve the consensus sequence for a draft genome assembly.

Medaka uses fast5 files to correct contigs using trained models. These models are freely available.

Medaka allows you to train a model by using fast5 from your favorite species. You can use it directly to obtain a consensus from you favorite organism.

### <span style="color: #4CACBC;"> 2.3 Correct assemblies with Medaka </span>  

We will use medaka to correct assemblies from FLYE and FLYE+RACON analysis.


### <span style="color: #4CACBC;"> 2.3.1 Before to correct assemblies, index them </span>  

### Index raw flye assembly

In [3]:
cd ~/work/RESULTS/4222_FLYE
pwd

/home/jovyan/work/RESULTS/4222_FLYE


In [4]:
time samtools faidx assembly.fasta 
time minimap2 -d assembly.fasta.mmi assembly.fasta


real	0m0.155s
user	0m0.107s
sys	0m0.005s
[M::mm_idx_gen::1.561*1.00] collected minimizers
[M::mm_idx_gen::2.030*1.45] sorted minimizers
[M::main::2.492*1.37] loaded/built the index for 31 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 31
[M::mm_idx_stat::2.586*1.35] distinct minimizers: 3863882 (94.32% are singletons); average occurrences: 1.098; average spacing: 5.331; total length: 22623684
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -d assembly.fasta.mmi assembly.fasta
[M::main] Real time: 2.615 sec; CPU: 3.528 sec; Peak RSS: 0.215 GB

real	0m2.784s
user	0m3.059s
sys	0m0.484s


In [5]:
whereis samtools

samtools: /usr/local/bin/samtools /opt/conda/envs/busco/bin/samtools


### Create a medaka repertory to save medaka results from FLYE analysis


In [6]:
mkdir -p ~/work/RESULTS/4222_FLYE_MEDAKA

### <span style="color: #4CACBC;"> 2.3.2 Medaka_consensus </span>  


Medaka is a tool to create a consensus sequence from nanopore sequencing data. 

This task is performed by using neural networks applied to a pileup of individual sequencing reads against a draft assembly.

It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.

As input medaka accepts reads in either a .fasta or a .fastq file. It requires a draft assembly as a .fasta.

### Check the usage of medaka_consensus

In [7]:
medaka_consensus

TF_CPP_MIN_LOG_LEVEL is set to '3'
Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.

medaka 1.11.0
------------

Assembly polishing via neural networks. Medaka is optimized
to work with the Flye assembler.

medaka_consensus [-h] -i <fastx> -d <fasta>

    -h  show this help text.
    -i  fastx input basecalls (required).
    -d  fasta input assembly (required).
    -o  output folder (default: medaka).
    -g  don't fill gaps in consensus with draft sequence.
    -r  use gap-filling character instead of draft sequence (default: None)
    -m  medaka model, (default: r1041_e82_400bps_sup_v4.2.0).
        Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_260bps_fast_g632 r1041_e82_260bps_hac_g632 r1041_e82_260bps_hac_v4.0.0 r1041_e82_260bps_hac_v4.1.0 r1041_e82_260bps_sup_g632 r1041_e82_260bps_sup_v4.0.0 r1041_e82_260bps_sup_v4.1.0 r1041_e82_400bps_fast

: 1

### Check the medaka model to use

Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and  iv) the basecaller version

{pore}_{device}_{caller variant}_{caller version}

examples: 

r941_min_fast_g303 : MiniON R9.4.1 flowcells using the fast Guppy basecaller version 3.0.3. 


In [8]:
medaka tools list_models

Cannot import pyabpoa, some features may not be available.
Available: r103_fast_g507, r103_fast_snp_g507, r103_fast_variant_g507, r103_hac_g507, r103_hac_snp_g507, r103_hac_variant_g507, r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r103_sup_g507, r103_sup_snp_g507, r103_sup_variant_g507, r1041_e82_260bps_fast_g632, r1041_e82_260bps_fast_variant_g632, r1041_e82_260bps_hac_g632, r1041_e82_260bps_hac_v4.0.0, r1041_e82_260bps_hac_v4.1.0, r1041_e82_260bps_hac_variant_g632, r1041_e82_260bps_hac_variant_v4.1.0, r1041_e82_260bps_sup_g632, r1041_e82_260bps_sup_v4.0.0, r1041_e82_260bps_sup_v4.1.0, r1041_e82_260bps_sup_variant_g632, r1041_e82_260bps_sup_variant_v4.1.0, r1041_e82_400bps_fast_g615, r1041_e82_400bps_fast_g632, r1041_e82_400bps_fast_variant_g615, r1041_e82_400bps_fast_variant_g632, r1041_e82_400bps_hac_g615, r1041_e82_400bps_hac_g632, r1041_e82_400bps_hac_v4.0.0, r1041_e82_400bps_hac_v4.1.0, r1041_e82_400bps_hac_v4.2.0, r1

### Medaka in FLYE ASSEMBLY

We run medaka on the FLYE assembly.

In [9]:
# go to medaka repertory results
cd ~/work/RESULTS/4222_FLYE_MEDAKA

In [10]:
time medaka_consensus -i ~/work/DATA/4222_RB2.fastq.gz -d ~/work/RESULTS/4222_FLYE/assembly.fasta -o MEDAKA_CONSENSUS -t 8

TF_CPP_MIN_LOG_LEVEL is set to '3'
Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.
Attempting to automatically select model version.
Checking program versions
This is medaka 1.11.0
Cannot import pyabpoa, some features may not be available.
Program    Version    Required   Pass     
bcftools   1.13       1.11       True     
bgzip      1.13+ds    1.11       True     
minimap2   2.24       2.11       True     
samtools   1.11       1.11       True     
tabix      1.13+ds    1.11       True     
Cannot import pyabpoa, some features may not be available.
[18:50:50 - MdlStrTF] Successfully removed temporary files from /tmp/tmpmarvpwcx.
Cannot import pyabpoa, some features may not be available.
[18:50:51 - MdlStrTF] Successfully removed temporary files from /tmp/tmpvu7vtvvl.
Aligning basecalls to draft
Using the existing fai index file /home/jovyan/work/RESULTS/4222_FLYE/assembly.fasta.fai
Creating mmi index file /home/jov

### <span style="color: #4CACBC;"> Conclusion </span>  


1. Here, we have obtained a genome from an algae by using Flye.

2. We have corrected raw flye assembly and polished with medaka models.

Genome size of reference and assembly obtained by Flye is really different? why ?

* you can try to compare both using Dgenies https://dgenies.toulouse.inra.fr/

What do you observe? 


### Now ...

You can do similar pipeline using the RAVEN assembler in the 4222 algae sample.

You can do similar protocol with the B8 and G11 samples.

We will compare the results in next practical "Quality Assemblies". 