# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022
    
Adapted by J. Orjuela (DIADE-IRD) - mai 2023


# <span style="color:#006E7F">__TP2 - Assembly and correction__ <a class="anchor" id="data"></span>  
    
## <span style="color: #4CACBC;"> 1. Assemblies </span>  

More contiguous genome assemblies can be generated using long sequencing read but assembly is not a quiet pace. Eukaryotic genomes assembly is a complex task (large genome sizes, high rates of repeated sequences, high heterozygosity levels and even polyploidy). While prokaryotic genomes may appear less challenging, specific features such as circular DNA molecules, must be taken into consideration to achieve high quality assembly.

* For assembly, ONT recommend sequencing a human genome to a minimum depth of 30x of 25–35 kb reads.
However, sequencing to a depth of 60x is advisable to obtain the best assembly metrics. We also recommend basecalling in high accuracy mode. Greatest contig N50 is usually obtained with Shasta and Flye. Polishing/Correction is also recommended (Racon and Medaka). https://nanoporetech.com/sites/default/files/s3/literature/human-genome-assembly-workflow.pdf


* Long reads simplify genome assembly, with the ability to span repeat-rich sequences (characteristic of  antimicrobial resistance genes) and structural variants. Nanopore sequencing also shows a lack of
bias in GC-rich regions, in contrast to other sequencing platforms. To perform microbial genome assembly, we suggest using the third-party de novo assembly tool Flye. We also recommend one round of polishing with Medaka. https://nanoporetech.com/sites/default/files/s3/literature/microbial-genome-assembly-workflow.pdf 

Flye  https://github.com/fenderglass/Flye 

Canu  https://canu.readthedocs.io/en/latest/quick-start.html

Miniasm  https://github.com/lh3/miniasm + Minipolish version https://github.com/rrwick/Minipolish

Shasta  https://github.com/chanzuckerberg/shasta

Smartdenovo  https://github.com/ruanjue/smartdenovo

Raven  https://github.com/lbcb-sci/raven

### <span style="color: #4CACBD;"> 1.1 Assembly using Flye </span>

We are going to assembly some Clones by using Flye https://github.com/fenderglass/Flye

Flye generates the concatenation of multiple disjoint genomic segments called disjointigs to build a repeat graph. Reads are mapped to this repeat graph to resolve conflicts (unbridged repeats) and output contigs.

In [1]:
# create a repertory for save flye assembly results
mkdir -p ~/work/RESULTS/FLYE
cd ~/work/RESULTS/
pwd

/home/jovyan/work/RESULTS


In [2]:
# Run flye 15Mb
time flye --nano-raw ~/work/DATA/4222_RB2.fastq.gz --genome-size 15000000 --out-dir FLYE --threads 8

[2023-05-18 14:59:14] INFO: Starting Flye 2.9.1-b1780
[2023-05-18 14:59:14] INFO: >>>STAGE: configure
[2023-05-18 14:59:14] INFO: Configuring run
[2023-05-18 14:59:32] INFO: Total read length: 746791553
[2023-05-18 14:59:32] INFO: Input genome size: 15000000
[2023-05-18 14:59:32] INFO: Estimated coverage: 49
[2023-05-18 14:59:32] INFO: Reads N50/N90: 20829 / 6197
[2023-05-18 14:59:32] INFO: Minimum overlap set to 6000
[2023-05-18 14:59:32] INFO: >>>STAGE: assembly
[2023-05-18 14:59:32] INFO: Assembling disjointigs
[2023-05-18 14:59:32] INFO: Reading sequences
[2023-05-18 14:59:51] INFO: Counting k-mers:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2023-05-18 15:01:31] INFO: Filling index table (1/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2023-05-18 15:02:58] INFO: Filling index table (2/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2023-05-18 15:04:42] INFO: Extending reads
[2023-05-18 15:06:38] INFO: Overlap-based coverage: 33
[2023-05-18 15:06:38] INFO: Median overlap dive

### How many contigs were obtained by Flye ? 

### What do you think about N50 and lenght contig mean.

## <span style="color: #4CACBC;"> 2. Polishing assemblies with Racon </span>  

Racon corrects raw contigs generated by rapid assembly methods with original ONT reads.

From 2 to 4 racon rounds are usually used by the community. 

Polish contigs assembled by Flye using two rounds of racon !.

In [4]:
racon

[racon::] error: missing input file(s)!
usage: racon [options ...] <sequences> <overlaps> <target sequences>

    #default output is stdout
    <sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences used for correction
    <overlaps>
        input file in MHAP/PAF/SAM format (can be compressed with gzip)
        containing overlaps between sequences and target sequences
    <target sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences which will be corrected

    options:
        -u, --include-unpolished
            output unpolished target sequences
        -f, --fragment-correction
            perform fragment correction instead of contig polishing
            (overlaps file should contain dual/self overlaps!)
        -w, --window-length <int>
            default: 500
            size of window on which POA is performed
        -q, --quality-threshold <float>
            

: 1

## Racon first round

In [5]:
mkdir -p ~/work/RESULTS/FLYE_RACON
cd ~/work/RESULTS/FLYE_RACON

In [6]:
time minimap2 -t 8 ../FLYE/assembly.fasta  ~/work/DATA/4222_RB2.fastq.gz 1> assembly.minimap4racon1.paf
time racon -t 8 ~/work/DATA/4222_RB2.fastq.gz assembly.minimap4racon1.paf ../FLYE/assembly.fasta > assembly.racon1.fasta

[M::mm_idx_gen::0.424*1.01] collected minimizers
[M::mm_idx_gen::0.474*1.68] sorted minimizers
[M::main::0.474*1.68] loaded/built the index for 33 target sequence(s)
[M::mm_mapopt_update::0.577*1.55] mid_occ = 21
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 33
[M::mm_idx_stat::0.616*1.52] distinct minimizers: 3858964 (94.03% are singletons); average occurrences: 1.101; average spacing: 5.331
[M::worker_pipeline::9.486*4.02] mapped 42406 sequences
[M::worker_pipeline::11.132*4.57] mapped 20357 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 8 ../FLYE/assembly.fasta /home/jovyan/work/DATA/4222_RB2.fastq.gz
[M::main] Real time: 11.161 sec; CPU: 50.932 sec; Peak RSS: 0.879 GB

real	0m11.191s
user	0m50.323s
sys	0m0.633s
[racon::Polisher::initialize] loaded target sequences 0.163065 s
[racon::Polisher::initialize] loaded sequences 8.077494 s
[racon::Polisher::initialize] loaded overlaps 0.089194 s
[racon::Polisher::initialize] transformed data into windows 0.7

## Racon second round

In [7]:
# second round
cd ~/work/RESULTS/FLYE_RACON
time minimap2 -t 8 assembly.racon1.fasta ~/work/DATA/4222_RB2.fastq.gz 1> assembly.minimap4racon2.paf
time racon -t 8 ~/work/DATA/4222_RB2.fastq.gz assembly.minimap4racon2.paf assembly.racon1.fasta > assembly.racon2.fasta

[M::mm_idx_gen::0.449*1.00] collected minimizers
[M::mm_idx_gen::0.494*1.60] sorted minimizers
[M::main::0.494*1.60] loaded/built the index for 33 target sequence(s)
[M::mm_mapopt_update::0.580*1.51] mid_occ = 21
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 33
[M::mm_idx_stat::0.621*1.48] distinct minimizers: 3855426 (94.01% are singletons); average occurrences: 1.104; average spacing: 5.323
[M::worker_pipeline::12.020*5.06] mapped 42406 sequences
[M::worker_pipeline::14.756*5.58] mapped 20357 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 8 assembly.racon1.fasta /home/jovyan/work/DATA/4222_RB2.fastq.gz
[M::main] Real time: 14.804 sec; CPU: 82.443 sec; Peak RSS: 0.963 GB

real	0m14.836s
user	1m21.783s
sys	0m0.690s
[racon::Polisher::initialize] loaded target sequences 0.151780 s
[racon::Polisher::initialize] loaded sequences 8.508921 s
[racon::Polisher::initialize] loaded overlaps 0.090673 s
[racon::Polisher::initialize] transformed data into windows 0.6

## <span style="color: #4CACBC;"> 3. Assemblies correction with Medaka </span>  

Correction can improve the consensus sequence for a draft genome assembly.

Medaka uses fast5 files to correct contigs using trained models. These models are freely available.

Medaka allows you to train a model by using fast5 from your favorite species. You can use it directly to obtain a consensus from you favorite organism.

### <span style="color: #4CACBC;"> 1.3 Correct assemblies with Medaka </span>  

We will use medaka to correct assemblies from FLYE and FLYE+RACON analysis.


### <span style="color: #4CACBC;"> 1.3.1 Before to correct assemblies, index them </span>  

index raw flye assembly

In [8]:
cd ~/work/RESULTS/FLYE
pwd

/home/jovyan/work/RESULTS/FLYE


In [9]:
time samtools faidx assembly.fasta 
time minimap2 -d assembly.fasta.mmi assembly.fasta


real	0m0.144s
user	0m0.115s
sys	0m0.009s
[M::mm_idx_gen::0.423*1.00] collected minimizers
[M::mm_idx_gen::0.595*1.58] sorted minimizers
[M::main::0.721*1.48] loaded/built the index for 33 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 33
[M::mm_idx_stat::0.753*1.46] distinct minimizers: 3858964 (94.03% are singletons); average occurrences: 1.101; average spacing: 5.331
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -d assembly.fasta.mmi assembly.fasta
[M::main] Real time: 0.759 sec; CPU: 1.103 sec; Peak RSS: 0.217 GB

real	0m0.769s
user	0m0.927s
sys	0m0.185s


index polished assembly

In [10]:
cd ~/work/RESULTS/FLYE_RACON
pwd

/home/jovyan/work/RESULTS/FLYE_RACON


In [11]:
time samtools faidx assembly.racon2.fasta 
time minimap2 -d assembly.racon2.fasta.mmi assembly.racon2.fasta


real	0m0.114s
user	0m0.090s
sys	0m0.017s
[M::mm_idx_gen::0.433*1.00] collected minimizers
[M::mm_idx_gen::0.611*1.58] sorted minimizers
[M::main::0.790*1.45] loaded/built the index for 33 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 33
[M::mm_idx_stat::0.828*1.43] distinct minimizers: 3855804 (94.02% are singletons); average occurrences: 1.103; average spacing: 5.325
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -d assembly.racon2.fasta.mmi assembly.racon2.fasta
[M::main] Real time: 0.836 sec; CPU: 1.191 sec; Peak RSS: 0.216 GB

real	0m0.845s
user	0m1.019s
sys	0m0.180s


In [12]:
#create a medaka repertory to stock medaka results from FLYE and FLYE-RACON2 analysis
mkdir -p ~/work/RESULTS/FLYE_MEDAKA
mkdir -p ~/work/RESULTS/FLYE_RACON_MEDAKA

### <span style="color: #4CACBC;"> 1.3.2 Medaka_consensus </span>  


Medaka is a tool to create a consensus sequence from nanopore sequencing data. 

This task is performed by using neural networks applied to a pileup of individual sequencing reads against a draft assembly.

It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.

As input medaka accepts reads in either a .fasta or a .fastq file. It requires a draft assembly as a .fasta.

### Check the usage of medaka_consensus

In [13]:
medaka_consensus


medaka 1.7.2
------------

Assembly polishing via neural networks. Medaka is optimized
to work with the Flye assembler.

medaka_consensus [-h] -i <fastx> -d <fasta>

    -h  show this help text.
    -i  fastx input basecalls (required).
    -d  fasta input assembly (required).
    -o  output folder (default: medaka).
    -g  don't fill gaps in consensus with draft sequence.
    -r  use gap-filling character instead of draft sequence (default: None)
    -m  medaka model, (default: r941_min_hac_g507).
        Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_260bps_fast_g632 r1041_e82_260bps_hac_g632 r1041_e82_260bps_sup_g632 r1041_e82_400bps_fast_g615 r1041_e82_400bps_fast_g632 r1041_e82_400bps_hac_g615 r1041_e82_400bps_hac_g632 r1041_e82_400bps_sup_g615 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r10_min_high_g303 r10_min_high_g340 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g51

: 1

### Check the medaka model to use

Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and  iv) the basecaller version

{pore}_{device}_{caller variant}_{caller version}

examples: 

r941_min_fast_g303 : MiniON R9.4.1 flowcells using the fast Guppy basecaller version 3.0.3. 


In [14]:
medaka tools list_models

Available: r103_fast_g507, r103_fast_snp_g507, r103_fast_variant_g507, r103_hac_g507, r103_hac_snp_g507, r103_hac_variant_g507, r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r103_sup_g507, r103_sup_snp_g507, r103_sup_variant_g507, r1041_e82_260bps_fast_g632, r1041_e82_260bps_fast_variant_g632, r1041_e82_260bps_hac_g632, r1041_e82_260bps_hac_variant_g632, r1041_e82_260bps_sup_g632, r1041_e82_260bps_sup_variant_g632, r1041_e82_400bps_fast_g615, r1041_e82_400bps_fast_g632, r1041_e82_400bps_fast_variant_g615, r1041_e82_400bps_fast_variant_g632, r1041_e82_400bps_hac_g615, r1041_e82_400bps_hac_g632, r1041_e82_400bps_hac_variant_g615, r1041_e82_400bps_hac_variant_g632, r1041_e82_400bps_sup_g615, r1041_e82_400bps_sup_variant_g615, r104_e81_fast_g5015, r104_e81_fast_variant_g5015, r104_e81_hac_g5015, r104_e81_hac_variant_g5015, r104_e81_sup_g5015, r104_e81_sup_g610, r104_e81_sup_variant_g610, r10_min_high_g303, r10_min_high_g340, r941

### Medaka in FLYE + RACON

We run medaka on the FLYE + polished with racon assembly.

In [15]:
# activate medaka conda environement
conda activate medaka

(medaka) 

: 1

In [16]:
# go to medaka repertory results
cd ~/work/RESULTS/FLYE_MEDAKA

(medaka) (medaka) 

: 1

In [17]:
time medaka_consensus -i ~/work/DATA/4222_RB2.fastq.gz -d ~/work/RESULTS/FLYE/assembly.fasta -o MEDAKA_CONSENSUS -t 4 -m r941_min_high_g360

Checking program versions
This is medaka 1.7.2
Program    Version    Required   Pass     
bcftools   1.16       1.11       True     
bgzip      1.16       1.11       True     
minimap2   2.24       2.11       True     
samtools   1.16.1     1.11       True     
tabix      1.16       1.11       True     
Aligning basecalls to draft
Using the existing fai index file /home/jovyan/work/RESULTS/FLYE/assembly.fasta.fai
Creating mmi index file /home/jovyan/work/RESULTS/FLYE/assembly.fasta.map-ont.mmi
[M::mm_idx_gen::0.462*1.00] collected minimizers
[M::mm_idx_gen::0.632*1.53] sorted minimizers
[M::main::0.783*1.43] loaded/built the index for 33 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 33
[M::mm_idx_stat::0.818*1.41] distinct minimizers: 3858964 (94.03% are singletons); average occurrences: 1.101; average spacing: 5.331; total length: 22656985
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -I 16G -x map-ont -d /home/jovyan/work/RESULTS/FLYE/assembly.

: 1

In [18]:
# go to medaka repertory results
cd ~/work/RESULTS/FLYE_RACON_MEDAKA

(medaka) (medaka) 

: 1

In [None]:
# medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r941_min_high_g360
time medaka_consensus -i ~/work/DATA/4222_RB2.fastq.gz -d ~/work/RESULTS/FLYE_RACON/assembly.racon2.fasta -o MEDAKA_CONSENSUS -t 4 -m r941_min_high_g360

In [None]:
conda deactivate

### <span style="color: #4CACBC;"> Conclusion </span>  


1. Here, we have obtained a pseudomolecule of a clone by using Flye.

2. We have polished it twice with racon.

3. We have corrected raw fly assembly and polished with medaka models.

You can do similar pipeline using the RAVEN assembler with your favorite sample.

We will compare the results in next practical "Quality Assemblies". 