# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022


# <span style="color:#006E7F">__TP2 - Assembly and correction__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Assemblies </span>  

More contiguous genome assemblies can be generated using long sequencing read but assembly is not a quiet pace. Eukaryotic genomes assembly is a complex task (large genome sizes, high rates of repeated sequences, high heterozygosity levels and even polyploidy). While prokaryotic genomes may appear less challenging, specific features such as circular DNA molecules, must be taken into consideration to achieve high quality assembly.

* For assembly, ONT recommend sequencing a human genome to a minimum depth of 30x of 25–35 kb reads.
However, sequencing to a depth of 60x is advisable to obtain the best assembly metrics. We also recommend basecalling in high accuracy mode. Greatest contig N50 is usually obtained with Shasta and Flye. Polishing/Correction is also recommended (Racon and Medaka). https://nanoporetech.com/sites/default/files/s3/literature/human-genome-assembly-workflow.pdf


* Long reads simplify genome assembly, with the ability to span repeat-rich sequences (characteristic of  antimicrobial resistance genes) and structural variants. Nanopore sequencing also shows a lack of
bias in GC-rich regions, in contrast to other sequencing platforms. To perform microbial genome assembly, we suggest using the third-party de novo assembly tool Flye. We also recommend one round of polishing with Medaka. https://nanoporetech.com/sites/default/files/s3/literature/microbial-genome-assembly-workflow.pdf 

Flye  https://github.com/fenderglass/Flye 

Canu  https://canu.readthedocs.io/en/latest/quick-start.html

Miniasm  https://github.com/lh3/miniasm + Minipolish version https://github.com/rrwick/Minipolish

Shasta  https://github.com/chanzuckerberg/shasta

Smartdenovo  https://github.com/ruanjue/smartdenovo

Raven  https://github.com/lbcb-sci/raven

### <span style="color: #4CACBD;"> 1.1 Assembly using Flye </span>

We are going to assembly some Clones by using Flye https://github.com/fenderglass/Flye

Flye generates the concatenation of multiple disjoint genomic segments called disjointigs to build a repeat graph. Reads are mapped to this repeat graph to resolve conflicts (unbridged repeats) and output contigs.

In [1]:
# declare your clone number
CLONE="Clone10"
# create a repertory for save flye assembly results
cd ~/work/RESULTS/
pwd

/home/jovyan/work/RESULTS


In [2]:
# Run flye
time flye --nano-raw ../DATA/${CLONE}/ONT/${CLONE}.fastq.gz --genome-size 1000000 --out-dir FLYE --threads 4

[2022-11-09 13:36:53] INFO: Starting Flye 2.9.1-b1780
[2022-11-09 13:36:53] INFO: >>>STAGE: configure
[2022-11-09 13:36:53] INFO: Configuring run
[2022-11-09 13:36:56] INFO: Total read length: 166436048
[2022-11-09 13:36:56] INFO: Input genome size: 1000000
[2022-11-09 13:36:56] INFO: Estimated coverage: 166
[2022-11-09 13:36:56] INFO: Reads N50/N90: 22325 / 7615
[2022-11-09 13:36:56] INFO: Minimum overlap set to 8000
[2022-11-09 13:36:56] INFO: >>>STAGE: assembly
[2022-11-09 13:36:56] INFO: Assembling disjointigs
[2022-11-09 13:36:56] INFO: Reading sequences
[2022-11-09 13:37:01] INFO: Counting k-mers:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2022-11-09 13:38:00] INFO: Filling index table (1/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2022-11-09 13:38:25] INFO: Filling index table (2/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2022-11-09 13:39:53] INFO: Extending reads
[2022-11-09 13:42:02] INFO: Overlap-based coverage: 122
[2022-11-09 13:42:02] INFO: Median overlap div

### <span style="color: #4CACBD;"> 1.2 Assembly using Raven </span>

In [3]:
# create a repertory for save raven assembly results
mkdir ~/work/RESULTS/RAVEN
cd ~/work/RESULTS/RAVEN
pwd

/home/jovyan/work/RESULTS/RAVEN


In [4]:
# Run raven
time raven -p 0 --graphical-fragment-assembly ${CLONE}_raven.gfa -t 4 ../../DATA/${CLONE}/ONT/${CLONE}.fastq.gz > ${CLONE}_raven.fasta

[raven::] loaded 11235 sequences 2.466825s
[raven::Graph::Construct] minimized 0 - 11235 / 11235 4.062874s
[raven::Graph::Construct] mapped sequences 8.123777s
[raven::Graph::Construct] annotated piles 0.079130s
[raven::Graph::Construct] removed contained sequences 0.030713s
[raven::Graph::Construct] removed chimeric sequences 0.002846s
[raven::Graph::Construct] reached checkpoint 0.047079s
[raven::Graph::Construct] minimized 0 - 214 / 214 0.440931s
[raven::Graph::Construct] mapped valid sequences 0.531934s
[raven::Graph::Construct] updated overlaps 0.000331s
[raven::Graph::Construct] removed false overlaps 0.010451s
[raven::Graph::Construct] stored 428 nodes 0.061572s
[raven::Graph::Construct] stored 3100 edges 0.000000s
[raven::Graph::Construct] reached checkpoint 0.076549s
[raven::Graph::Construct] 13.484354s
[raven::Graph::Assemble] removed transitive edges 0.002083s
[raven::Graph::Assemble] reached checkpoint 0.068241s
[raven::Graph::Assemble] removed tips and bubbles 0.035455s
[r

* How many contigs were obtained by Flye and Raven ? Please fill in results in the shared [file](https://lite.framacalc.org/9pd3-ont_sg_2021)
* Calculate first statistics about assemblies: N50 and lenght contig mean.

In [5]:
grep -c ^'>' ~/work/RESULTS/*/*.fasta

[35m[K/home/jovyan/work/RESULTS/FLYE/assembly.fasta[m[K[36m[K:[m[K4
[35m[K/home/jovyan/work/RESULTS/RAVEN/Clone10_raven.fasta[m[K[36m[K:[m[K3


## <span style="color: #4CACBC;"> 2. Polishing assemblies with Racon </span>  

Racon corrects raw contigs generated by rapid assembly methods with original ONT reads.

From 2 to 4 racon rounds are usually used by the community. 

Polish contigs assembled by Flye using two rounds of racon !.

In [6]:
racon

[racon::] error: missing input file(s)!
usage: racon [options ...] <sequences> <overlaps> <target sequences>

    #default output is stdout
    <sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences used for correction
    <overlaps>
        input file in MHAP/PAF/SAM format (can be compressed with gzip)
        containing overlaps between sequences and target sequences
    <target sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences which will be corrected

    options:
        -u, --include-unpolished
            output unpolished target sequences
        -f, --fragment-correction
            perform fragment correction instead of contig polishing
            (overlaps file should contain dual/self overlaps!)
        -w, --window-length <int>
            default: 500
            size of window on which POA is performed
        -q, --quality-threshold <float>
            

: 1

## Racon first round

In [7]:
mkdir -p ~/work/RESULTS/FLYE_RACON
cd ~/work/RESULTS/FLYE_RACON
time minimap2 -t 6 ../FLYE/assembly.fasta ../../DATA/${CLONE}/ONT/${CLONE}.fastq.gz 1> assembly.minimap4racon1.paf
time racon -t 6 ../../DATA/${CLONE}/ONT/${CLONE}.fastq.gz assembly.minimap4racon1.paf ../FLYE/assembly.fasta > assembly.racon1.fasta

[M::mm_idx_gen::0.067*1.06] collected minimizers
[M::mm_idx_gen::0.080*1.65] sorted minimizers
[M::main::0.080*1.65] loaded/built the index for 4 target sequence(s)
[M::mm_mapopt_update::0.085*1.62] mid_occ = 5
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 4
[M::mm_idx_stat::0.089*1.59] distinct minimizers: 226987 (98.75% are singletons); average occurrences: 1.016; average spacing: 5.343
[M::worker_pipeline::5.068*3.68] mapped 11235 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 6 ../FLYE/assembly.fasta ../../DATA/Clone10/ONT/Clone10.fastq.gz
[M::main] Real time: 5.072 sec; CPU: 18.667 sec; Peak RSS: 0.180 GB

real	0m5.085s
user	0m18.559s
sys	0m0.116s
[racon::Polisher::initialize] loaded target sequences 0.009216 s
[racon::Polisher::initialize] loaded sequences 2.476519 s
[racon::Polisher::initialize] loaded overlaps 0.013746 s
[racon::Polisher::initialize] transformed data into windows 0.232758 s
[racon::Polisher::] total = 132.830566 s

real	2m12.875s

## Racon second round

In [8]:
# second round
cd ~/work/RESULTS/FLYE_RACON
time minimap2 -t 4 assembly.racon1.fasta ../../DATA/${CLONE}/ONT/${CLONE}.fastq.gz 1> assembly.minimap4racon2.paf
time racon -t 4 ../../DATA/${CLONE}/ONT/${CLONE}.fastq.gz assembly.minimap4racon2.paf assembly.racon1.fasta > assembly.racon2.fasta

[M::mm_idx_gen::0.071*1.05] collected minimizers
[M::mm_idx_gen::0.088*1.59] sorted minimizers
[M::main::0.088*1.59] loaded/built the index for 4 target sequence(s)
[M::mm_mapopt_update::0.099*1.53] mid_occ = 5
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 4
[M::mm_idx_stat::0.106*1.49] distinct minimizers: 225838 (98.72% are singletons); average occurrences: 1.017; average spacing: 5.338
[M::worker_pipeline::5.405*2.66] mapped 11235 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 4 assembly.racon1.fasta ../../DATA/Clone10/ONT/Clone10.fastq.gz
[M::main] Real time: 5.410 sec; CPU: 14.363 sec; Peak RSS: 0.176 GB

real	0m5.426s
user	0m14.240s
sys	0m0.136s
[racon::Polisher::initialize] loaded target sequences 0.009998 s
[racon::Polisher::initialize] loaded sequences 2.735626 s
[racon::Polisher::initialize] loaded overlaps 0.015789 s
[racon::Polisher::initialize] transformed data into windows 0.220627 s
[racon::Polisher::] total = 152.995375 s

real	2m33.038s


## <span style="color: #4CACBC;"> 3. Assemblies correction with Medaka </span>  

Correction can improve the consensus sequence for a draft genome assembly.

Medaka uses fast5 files to correct contigs using trained models. These models are freely available.

Medaka allows you to train a model by using fast5 from your favorite species. You can use it directly to obtain a consensus from you favorite organism.

### <span style="color: #4CACBC;"> 1.3 Correct assemblies with Medaka </span>  

### <span style="color: #4CACBC;"> 1.3.1 Before to correct assemblies, index them </span>  

In [3]:
cd ~/work/RESULTS/FLYE_RACON
pwd

/home/jovyan/work/RESULTS/FLYE_RACON


In [4]:
time samtools faidx assembly.racon2.fasta 
time minimap2 -d assembly.racon2.fasta.mmi assembly.racon2.fasta


real	0m0.063s
user	0m0.006s
sys	0m0.018s
[M::mm_idx_gen::0.049*1.07] collected minimizers
[M::mm_idx_gen::0.063*1.48] sorted minimizers
[M::main::0.083*1.37] loaded/built the index for 4 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 4
[M::mm_idx_stat::0.089*1.34] distinct minimizers: 225782 (98.72% are singletons); average occurrences: 1.017; average spacing: 5.337
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -d assembly.racon2.fasta.mmi assembly.racon2.fasta
[M::main] Real time: 0.097 sec; CPU: 0.127 sec; Peak RSS: 0.015 GB

real	0m0.102s
user	0m0.106s
sys	0m0.022s


In [2]:
#create a medaka repertory
mkdir -p ~/work/RESULTS/FLYE_RACON_MEDAKA
cd ~/work/RESULTS/FLYE_RACON_MEDAKA
pwd

/home/jovyan/work/RESULTS/FLYE_RACON_MEDAKA


### <span style="color: #4CACBC;"> 1.3.2 Medaka_consensus </span>  


Medaka is a tool to create a consensus sequence from nanopore sequencing data. 

This task is performed by using neural networks applied to a pileup of individual sequencing reads against a draft assembly.

It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.

As input medaka accepts reads in either a .fasta or a .fastq file. It requires a draft assembly as a .fasta.

### Check the usage of medaka_consensus

In [3]:
medaka_consensus


medaka 1.7.2
------------

Assembly polishing via neural networks. Medaka is optimized
to work with the Flye assembler.

medaka_consensus [-h] -i <fastx> -d <fasta>

    -h  show this help text.
    -i  fastx input basecalls (required).
    -d  fasta input assembly (required).
    -o  output folder (default: medaka).
    -g  don't fill gaps in consensus with draft sequence.
    -r  use gap-filling character instead of draft sequence (default: None)
    -m  medaka model, (default: r941_min_hac_g507).
        Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_260bps_fast_g632 r1041_e82_260bps_hac_g632 r1041_e82_260bps_sup_g632 r1041_e82_400bps_fast_g615 r1041_e82_400bps_fast_g632 r1041_e82_400bps_hac_g615 r1041_e82_400bps_hac_g632 r1041_e82_400bps_sup_g615 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r10_min_high_g303 r10_min_high_g340 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g51

: 1

### Check the medaka model to use

Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and  iv) the basecaller version

{pore}_{device}_{caller variant}_{caller version}

examples: 

r941_min_fast_g303 : MiniON R9.4.1 flowcells using the fast Guppy basecaller version 3.0.3. 


In [4]:
medaka tools list_models

Available: r103_fast_g507, r103_fast_snp_g507, r103_fast_variant_g507, r103_hac_g507, r103_hac_snp_g507, r103_hac_variant_g507, r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r103_sup_g507, r103_sup_snp_g507, r103_sup_variant_g507, r1041_e82_260bps_fast_g632, r1041_e82_260bps_fast_variant_g632, r1041_e82_260bps_hac_g632, r1041_e82_260bps_hac_variant_g632, r1041_e82_260bps_sup_g632, r1041_e82_260bps_sup_variant_g632, r1041_e82_400bps_fast_g615, r1041_e82_400bps_fast_g632, r1041_e82_400bps_fast_variant_g615, r1041_e82_400bps_fast_variant_g632, r1041_e82_400bps_hac_g615, r1041_e82_400bps_hac_g632, r1041_e82_400bps_hac_variant_g615, r1041_e82_400bps_hac_variant_g632, r1041_e82_400bps_sup_g615, r1041_e82_400bps_sup_variant_g615, r104_e81_fast_g5015, r104_e81_fast_variant_g5015, r104_e81_hac_g5015, r104_e81_hac_variant_g5015, r104_e81_sup_g5015, r104_e81_sup_g610, r104_e81_sup_variant_g610, r10_min_high_g303, r10_min_high_g340, r941

### Medaka in FLYE + RACON

We run medaka on the FLYE + polished with racon assembly.

In [5]:
# go to medaka repertory results
cd ~/work/RESULTS/FLYE_RACON_MEDAKA

In [None]:
# activate medaka conda environement
conda activate medaka

In [6]:
# medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r941_min_high_g360
time medaka_consensus -i ~/work/DATA/${CLONE}/ONT/${CLONE}.fastq.gz -d ~/work/RESULTS/FLYE_RACON/assembly.racon2.fasta -o MEDAKA_CONSENSUS -t 4 -m r941_min_high_g360

Checking program versions
This is medaka 1.7.2
Program    Version    Required   Pass     
bcftools   1.10.2     1.11       False    
bgzip      1.10.2-3   1.11       False    
minimap2   2.17       2.11       True     
samtools   1.10       1.11       False    
tabix      1.10.2-3   1.11       False    

real	0m6.217s
user	0m5.755s
sys	0m2.678s


: 1

In [None]:
conda deactivate

### <span style="color: #4CACBC;"> Conclusion </span>  


1. Here, we have obtained a pseudomolecule of a clone by using Flye.

2. We have polished it twice with racon.

3. We have corrected it with medaka models.

You can do similar pipeline using the RAVEN assembler with your favorite Clone.

We will compare the results in next practical "Quality Assemblies". 