# <span style="color:green">Formation à Abidjan 2023</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD) and A. Comte (PHIM-IRD) 

Septembre 2023

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP4 : De novo assembly of the Metavirome](#tp4) 

[1. Assembly with Flye](#flye)

   * [1.1 Launch Flye](#runflye)
   * [1.2. Estimate the quality of the assembly](#quality)
       * [1.2.1. QUAST](quast)
       * [1.2.2. CheckV](checkv)
   * [1.3. Polishing of the meta-assembly](#polish)
   * [1.4. Estimate the quality of the polished assembly](#qualpolish)
       * [1.4.1. CheckV](checkpolish)
   * [1.5. Taxonomic assignation of contigs](#contigs)
   * [1.6. Visualise the coverage of the reads in an interesting contig](#coverage)
      * [1.6.1. Remapping of the reads in an interesting contig](contigcool)
      * [1.6.2. Visualise the coverage / depht of the reads on the contig on `TABLET`](tablet)
</span>

***

# <span style="color:#006E7F">__TP4 : De novo assembly of the Metavirome__ <a class="anchor" id="tp3"></span>  


Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. Metagenome assembly is an efficient approach to deciphering the “microbial dark matter” in the microbiota based on metagenomic sequencing.

The objective is to reconstruct viruses in presence in the dataset.

A lot of assembler tools exists for long reads. Here we will focus on FLYE which is fast and often really accurate:

- Flye : https://github.com/fenderglass/Flye

But you can also have a look at SPAdes, an other assembler working with metagenome.

- SPAdes : https://github.com/ablab/spades

## <span style="color: #4CACBC;"> 1. Assembly with Flye<a class="anchor" id="flye"> </span>

Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2023/ASSEMBLY/FLYE
cd ~/work/SG-ONT-2023/ASSEMBLY/FLYE

### <span style="color: #4CACBC;"> 1.1 Launch Flye<a class="anchor" id="runflye"> </span>

In [None]:
flye --help

In [None]:
# this can take time to run
time flye --meta --nano-hq READS -o out_flye

**How many contig do we have after the assembly?**

### <span style="color: #4CACBC;"> 1.2. Estimate the quality of the assembly<a class="anchor" id="quality"> </span>

#### <span style="color: #4CACBC;"> 1.2.1. QUAST<a class="anchor" id="quast"> </span>

QUAST evaluates genome assemblies.

QUAST works both with and without a reference genome.
The tool accepts multiple assemblies, thus is suitable for comparison. 

http://quast.sourceforge.net/quast

In [None]:
metaquast.py -h

In [None]:
time metaquast.py ../../ASSEMBLY/FLYE/out_flye/assembly.fasta --silent

**Observe QUAST output**

In [None]:
head -25 quast_results/latest/report.txt

#### <span style="color: #4CACBC;"> 1.2.2. CheckV<a class="anchor" id="checkv"> </span>

CheckV assesses the quality and completeness of metagenome-assembled viral genomes

In [None]:
checkv --help

In [None]:
# download database
time checkv download_database ./

In [None]:
export CHECKVDB=~/work/SG-ONT-2023/ASSEMBLY/FLYE/checkv-db-v1.4

In [None]:
time checkv end_to_end out_flye/assembly.fasta output_checkv

**observe the different output files**

Is there any interresting high-quality viral contig?

Contig_31 seems to be of high quality, viral and has a size of 6877 nucleotides.

### <span style="color: #4CACBC;"> 1.3. Polishing of the meta-assembly <a class="anchor" id="polish"> </span>

Medaka is a tool to create a consensus sequence of nanopore sequencing data. This task is performed using neural networks applied a pileup of individual sequencing reads against a draft assembly. It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.

https://denbi-nanopore-training-course.readthedocs.io/en/latest/polishing/medaka/medaka.html

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2023/ASSEMBLY/MEDAKA
cd ~/work/SG-ONT-2023/ASSEMBLY/MEDAKA

In [None]:
medaka_consensus -h

In [None]:
# can be a little long to run
conda activate medaka
time medaka_consensus -i ../../CLEANING/reads_vs_ananas_unmapped.fastq -d ../FLYE/out_flye/assembly.fasta -m r941_prom_sup_g507 -o medaka_polishing
conda deactivate

### <span style="color: #4CACBC;"> 1.4. Estimate the quality of the polished assembly<a class="anchor" id="qualpolish"> </span>

#### <span style="color: #4CACBC;"> 1.4.1. CheckV<a class="anchor" id="checkpolish"> </span>

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2023/ASSEMBLY/MEDAKA/checkv
cd ~/work/SG-ONT-2023/ASSEMBLY/MEDAKA/checkv

In [None]:
export CHECKVDB=~/work/SG-ONT-2023/ASSEMBLY/FLYE/checkv-db-v1.5

In [None]:
checkv end_to_end ../medaka_polishing/consensus.fasta output_checkv_polished

Compare the output of checkv between polished assembly and non polished assembly ? Is there any differences ?

**What specie / genus of virus is contig_31?**

### <span style="color: #4CACBC;"> 1.5. Taxonomic assignation of contigs<a class="anchor" id="contigs"> </span>

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2023/ASSEMBLY/DIAMOND
cd ~/work/SG-ONT-2023/ASSEMBLY/DIAMOND

We are going to reuse the viral diamond database used in TP3

In [None]:
#create a symbolic link of the diamond database
ln -s ~/work/SG-ONT-2023/ASSIGNATION/DIAMOND/viral.dmnd viral.dmnd

In [None]:
diamond blastx --quiet -d viral.dmnd --outfmt 6 stitle qtitle pident length mismatch gapopen qstart qend sstart send evalue bitscore -q ../MEDAKA/medaka_polishing/consensus.fasta -o diamond-matches.csv

In [None]:
head diamond-matches.csv

In [None]:
grep "contig_31" diamond-matches.csv

Looks like contig_31 is a Vitivirus.

### <span style="color: #4CACBC;"> 1.6. Visualise the coverage of the reads in an interesting contig<a class="anchor" id="coverage"> </span>

#### <span style="color: #4CACBC;"> 1.6.1. Remapping of the reads in an interesting contig<a class="anchor" id="contigcool"> </span>

In [None]:
mkdir -p ~/work/SG-ONT-2023/ASSEMBLY/CONTIG31
cd  ~/work/SG-ONT-2023/ASSEMBLY/CONTIG31

In [None]:
##extract contig31 from the multifasta file
samtools faidx ../MEDAKA/medaka_polishing/consensus.fasta contig_31 > contig_31.fasta

In [None]:
minimap2 -ax map-ont contig_31.fasta ../../CLEANING/reads_vs_ananas_unmapped.fastq > contig31_vs_reads.sam

In [None]:
samtools flagstats contig31_vs_reads.sam

In [None]:
samtools view -b -S contig31_vs_reads.sam > contig31_vs_reads.bam

In [None]:
samtools sort contig31_vs_reads.bam -o contig31_vs_reads_sorted.bam

In [None]:
samtools index contig31_vs_reads_sorted.bam

In [None]:
samtools coverage contig31_vs_reads_sorted.bam

#### <span style="color: #4CACBC;"> 1.6.2. Visualise the coverage / depht of the reads on the contig on TABLET<a class="anchor" id="tablet"> </span>
* 1. Open tablet
* 2. click on Open Assembly
* 3. import the sorted bam as primary assembly and the contig_31.fasta as reference.

Explore the data