# <span style="color:green">Formation au Burkina Faso 2022</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD) and A. Comte (PHIM-IRD) 

Septembre 2022

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP4 : De novo assembly of the Metavirome](#tp4) 

[1. Assembly with Flye](#flye)

   * [1.1 Launch Flye](#runflye)
   * [1.2. Estimate the quality of the assembly](#quality)
       * [1.2.1. QUAST](quast)
       * [1.2.2. CheckV](checkv)
   * [1.3. Polishing of the meta-assembly](#polish)
   * [1.4. Estimate the quality of the polished assembly](#qualpolish)
       * [1.4.1. CheckV](checkpolish)
   * [1.5. Taxonomic assignation of contigs](#contigs)
   * [1.6. Visualise the coverage of the reads in an interesting contig](#coverage)
      * [1.6.1. Remapping of the reads in an interesting contig](contigcool)
      * [1.6.2. Visualise the coverage / depht of the reads on the contig on `TABLET`](tablet)
</span>

***

# <span style="color:#006E7F">__TP4 : De novo assembly of the Metavirome__ <a class="anchor" id="tp3"></span>  


Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. Metagenome assembly is an efficient approach to deciphering the “microbial dark matter” in the microbiota based on metagenomic sequencing.

The objective is to reconstruct viruses in presence in the dataset.

A lot of assembler tools exists for long reads. Here we will focus on FLYE which is fast and often really accurate:

- Flye : https://github.com/fenderglass/Flye

But you can also have a look at SPAdes, an other assembler working with metagenome.

- SPAdes : https://github.com/ablab/spades

## <span style="color: #4CACBC;"> 1. Assembly with Flye<a class="anchor" id="flye"> </span>

Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.

In [1]:
# create working repository
mkdir -p ~/work/SG-ONT-2022/ASSEMBLY/FLYE
cd ~/work/SG-ONT-2022/ASSEMBLY/FLYE

### <span style="color: #4CACBC;"> 1.1 Launch Flye<a class="anchor" id="runflye"> </span>

In [2]:
flye --help

usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
	     --nano-corr | --nano-hq ) file1 [file_2 ...]
	     --out-dir PATH

	     [--genome-size SIZE] [--threads int] [--iterations int]
	     [--meta] [--polish-target] [--min-overlap SIZE]
	     [--keep-haplotypes] [--debug] [--version] [--help] 
	     [--scaffold] [--resume] [--resume-from] [--stop-after] 
	     [--read-error float] [--extra-params]

Assembly of long reads with repeat graphs

optional arguments:
  -h, --help            show this help message and exit
  --pacbio-raw path [path ...]
                        PacBio regular CLR reads (<20% error)
  --pacbio-corr path [path ...]
                        PacBio reads that were corrected with other methods
                        (<3% error)
  --pacbio-hifi path [path ...]
                        PacBio HiFi reads (<1% error)
  --nano-raw path [path ...]
                        ONT regular reads, pre-Guppy5 (<20% error)
  --nano-corr path [path ...]
     

In [3]:
# this can take time to run
time flye --meta --nano-hq ../../CLEANING/reads_vs_ananas_unmapped.fastq -o out_flye

[2022-09-05 21:44:01] INFO: Starting Flye 2.9-b1768
[2022-09-05 21:44:01] INFO: >>>STAGE: configure
[2022-09-05 21:44:01] INFO: Configuring run
[2022-09-05 21:44:04] INFO: Total read length: 174205060
[2022-09-05 21:44:04] INFO: Reads N50/N90: 435 / 276
[2022-09-05 21:44:04] INFO: Minimum overlap set to 1000
[2022-09-05 21:44:04] INFO: >>>STAGE: assembly
[2022-09-05 21:44:04] INFO: Assembling disjointigs
[2022-09-05 21:44:04] INFO: Reading sequences
[2022-09-05 21:44:05] INFO: Building minimizer index
[2022-09-05 21:44:05] INFO: Pre-calculating index storage
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2022-09-05 21:44:09] INFO: Filling index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2022-09-05 21:44:18] INFO: Extending reads
[2022-09-05 21:44:24] INFO: Overlap-based coverage: 22
[2022-09-05 21:44:24] INFO: Median overlap divergence: 0.0776014
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2022-09-05 21:45:00] INFO: Assembled 54 disjointigs
[2022-09-05 21:45:00] INFO: Generating se

**How many contig do we have after the assembly?**

### <span style="color: #4CACBC;"> 1.2. Estimate the quality of the assembly<a class="anchor" id="quality"> </span>

#### <span style="color: #4CACBC;"> 1.2.1. QUAST<a class="anchor" id="quast"> </span>

QUAST evaluates genome assemblies.

QUAST works both with and without a reference genome.
The tool accepts multiple assemblies, thus is suitable for comparison. 

http://quast.sourceforge.net/quast

In [4]:
metaquast.py -h

MetaQUAST: Quality Assessment Tool for Metagenome Assemblies
Version: 5.2.0

Usage: python /opt/conda/envs/quast/bin/metaquast.py [options] <files_with_contigs>

Options:
-o  --output-dir  <dirname>       Directory to store all result files [default: quast_results/results_<datetime>]
-r   <filename,filename,...>      Comma-separated list of reference genomes or directory with reference genomes
--references-list <filename>      Text file with list of reference genome names for downloading from NCBI
-g  --features [type:]<filename>  File with genomic feature coordinates in the references (GFF, BED, NCBI or TXT)
                                  Optional 'type' can be specified for extracting only a specific feature type from GFF
-m  --min-contig  <int>           Lower threshold for contig length [default: 500]
-t  --threads     <int>           Maximum number of threads [default: 25% of CPUs]

Advanced options:
-s  --split-scaffolds                 Split assemblies by continuous fragments

In [5]:
time metaquast.py ../../ASSEMBLY/FLYE/out_flye/assembly.fasta --silent

/opt/conda/envs/quast/bin/metaquast.py ../../ASSEMBLY/FLYE/out_flye/assembly.fasta --silent


System information:
  OS: Linux-5.10.0-11-cloud-amd64-x86_64-with-debian-bullseye-sid (linux_64)
  Python version: 3.7.12
  CPUs number: 8

Started: 2022-09-05 21:57:44

Logging to /home/jovyan/work/SG-ONT-2022/ASSEMBLY/FLYE/quast_results/results_2022_09_05_21_57_44/metaquast.log
NOTICE: Maximum number of threads is set to 2 (use --threads option to set it manually)

Contigs:
  Pre-processing...
  ../../ASSEMBLY/FLYE/out_flye/assembly.fasta ==> assembly

No references are provided, starting to search for reference genomes in SILVA 16S rRNA database and to download them from NCBI...

2022-09-05 21:57:47
NOTICE: Permission denied accessing /opt/conda/envs/quast/lib/python3.7/site-packages/quast_libs/silva. Silva will be downloaded to home directory /home/jovyan/.quast
 0.0% of 197133995 bytes
 1.0% of 197133995 bytes
 2.0% of 197133995 bytes
 3.0% of 197133995 bytes
 4.0% of 197133995 bytes
 5.0

**Observe QUAST output**

In [6]:
head -25 quast_results/latest/report.txt

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    assembly
# contigs (>= 0 bp)         32      
# contigs (>= 1000 bp)      32      
# contigs (>= 5000 bp)      3       
# contigs (>= 10000 bp)     0       
# contigs (>= 25000 bp)     0       
# contigs (>= 50000 bp)     0       
Total length (>= 0 bp)      89929   
Total length (>= 1000 bp)   89929   
Total length (>= 5000 bp)   20138   
Total length (>= 10000 bp)  0       
Total length (>= 25000 bp)  0       
Total length (>= 50000 bp)  0       
# contigs                   32      
Largest contig              7608    
Total length                89929   
GC (%)                      45.28   
N50                         2455    
N90                         2144    
auN                         3444.4  
L50                         12      
L90                         28      
# N's per 100 kbp          

#### <span style="color: #4CACBC;"> 1.2.2. CheckV<a class="anchor" id="checkv"> </span>

CheckV assesses the quality and completeness of metagenome-assembled viral genomes

In [7]:
checkv --help

CheckV v0.9.0: assessing the quality of metagenome-assembled viral genomes
https://bitbucket.org/berkeleylab/checkv

usage: checkv <program> [options]

programs:
    end_to_end          run full pipeline to estimate completeness, contamination, and identify closed genomes
    contamination       identify and remove host contamination on integrated proviruses
    completeness        estimate completeness for genome fragments
    complete_genomes    identify complete genomes based on terminal repeats and flanking host regions
    quality_summary     summarize results across modules
    download_database   download the latest version of CheckV's database
    update_database     update CheckV's database with your own complete genomes

optional arguments:
  -h, --help  show this help message and exit


In [8]:
# download database
time checkv download_database ./


CheckV v0.9.0: download_database
[1/4] Checking latest version of CheckV's database...
[2/4] Downloading 'checkv-db-v1.4'...
[3/4] Extracting 'checkv-db-v1.4'...
[4/4] Building DIAMOND database...
Run time: 223.36 seconds
Peak mem: 1.28 GB

real	3m43.798s
user	1m55.478s
sys	0m27.053s


In [11]:
export CHECKVDB=~/work/SG-ONT-2022/ASSEMBLY/FLYE/checkv-db-v1.4

In [12]:
time checkv end_to_end out_flye/assembly.fasta output_checkv


CheckV v0.9.0: contamination
[1/8] Reading database info...
[2/8] Reading genome info...
[3/8] Calling genes with Prodigal...
[4/8] Reading gene info...
[5/8] Running hmmsearch...
[6/8] Annotating genes...
[7/8] Identifying host regions...
[8/8] Writing results...
Run time: 88.41 seconds
Peak mem: 0.22 GB

CheckV v0.9.0: completeness
[1/8] Skipping gene calling...
[2/8] Initializing queries and database...
[3/8] Running DIAMOND blastp search...
[4/8] Computing AAI...
[5/8] Running AAI based completeness estimation...
[6/8] Running HMM based completeness estimation...
[7/8] Determining genome copy number...
[8/8] Writing results...
Run time: 34.59 seconds
Peak mem: 1.47 GB

CheckV v0.9.0: complete_genomes
[1/7] Reading input sequences...
[2/7] Finding complete proviruses...
[3/7] Finding direct/inverted terminal repeats...
[4/7] Filtering terminal repeats...
[5/7] Checking genome for completeness...
[6/7] Checking genome for large duplications...
[7/7] Writing results...
Run time: 0.01

**observe the different output files**

Is there any interresting high-quality viral contig?

Contig_31 seems to be of high quality, viral and has a size of 6877 nucleotides.

### <span style="color: #4CACBC;"> 1.3. Polishing of the meta-assembly <a class="anchor" id="polish"> </span>

Medaka is a tool to create a consensus sequence of nanopore sequencing data. This task is performed using neural networks applied a pileup of individual sequencing reads against a draft assembly. It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.

https://denbi-nanopore-training-course.readthedocs.io/en/latest/polishing/medaka/medaka.html

In [13]:
# create working repository
mkdir -p ~/work/SG-ONT-2022/ASSEMBLY/MEDAKA
cd ~/work/SG-ONT-2022/ASSEMBLY/MEDAKA

In [14]:
medaka_consensus -h


medaka 1.6.1
------------

Assembly polishing via neural networks. Medaka is optimized
to work with the Flye assembler.

medaka_consensus [-h] -i <fastx> -d <fasta>

    -h  show this help text.
    -i  fastx input basecalls (required).
    -d  fasta input assembly (required).
    -o  output folder (default: medaka).
    -g  don't fill gaps in consensus with draft sequence.
    -m  medaka model, (default: r941_min_hac_g507).
        Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_400bps_fast_g615 r1041_e82_400bps_hac_g615 r1041_e82_400bps_sup_g615 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r10_min_high_g303 r10_min_high_g340 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g514 r941_min_fast_g303 r941_min_fast_g507 r941_min_hac_g507 r941_min_high_g303 r941_min_high_g330 r941_min_high_g340_rle r941_min_high_g344 r941_min_high_g351 r941_min_high_g360 r941_min_sup_g507 r941_prom_fast

In [15]:
# can be a little long to run
conda activate medaka
time medaka_consensus -i ../../CLEANING/reads_vs_ananas_unmapped.fastq -d ../FLYE/out_flye/assembly.fasta -m r941_prom_sup_g507 -o medaka_polishing
conda deactivate

(medaka) Checking program versions
This is medaka 1.6.1
Program    Version    Required   Pass     
bcftools   1.15.1     1.11       True     
bgzip      1.15.1     1.11       True     
minimap2   2.24       2.11       True     
samtools   1.15.1     1.11       True     
tabix      1.15.1     1.11       True     
Aligning basecalls to draft
Creating fai index file /home/jovyan/work/SG-ONT-2022/ASSEMBLY/FLYE/out_flye/assembly.fasta.fai
Creating mmi index file /home/jovyan/work/SG-ONT-2022/ASSEMBLY/FLYE/out_flye/assembly.fasta.map-ont.mmi
[M::mm_idx_gen::0.011*1.25] collected minimizers
[M::mm_idx_gen::0.016*1.76] sorted minimizers
[M::main::0.021*1.58] loaded/built the index for 32 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 32
[M::mm_idx_stat::0.022*1.55] distinct minimizers: 15028 (92.42% are singletons); average occurrences: 1.174; average spacing: 5.097; total length: 89929
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -I 16G -x map-ont -d /h

: 1

### <span style="color: #4CACBC;"> 1.4. Estimate the quality of the polished assembly<a class="anchor" id="qualpolish"> </span>

#### <span style="color: #4CACBC;"> 1.4.1. CheckV<a class="anchor" id="checkpolish"> </span>

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2022/ASSEMBLY/MEDAKA/checkv
cd ~/work/SG-ONT-2022/ASSEMBLY/MEDAKA/checkv

In [None]:
export CHECKVDB=~/work/SG-ONT-2022/ASSEMBLY/FLYE/checkv-db-v1.2

In [None]:
checkv end_to_end ../medaka_polishing/consensus.fasta output_checkv_polished

Compare the output of checkv between polished assembly and non polished assembly ? Is there any differences ?

**What specie / genus of virus is contig_31?**

### <span style="color: #4CACBC;"> 1.5. Taxonomic assignation of contigs<a class="anchor" id="contigs"> </span>

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2022/ASSEMBLY/DIAMOND
cd ~/work/SG-ONT-2022/ASSEMBLY/DIAMOND

We are going to reuse the viral diamond database used in TP3

In [None]:
#create a symbolic link of the diamond database
ln -s ~/work/SG-ONT-2022/ASSIGNATION/DIAMOND/viral.dmnd viral.dmnd

In [None]:
diamond blastx --quiet -d viral.dmnd --outfmt 6 stitle qtitle pident length mismatch gapopen qstart qend sstart send evalue bitscore -q ../MEDAKA/medaka_polishing/consensus.fasta -o diamond-matches.csv

In [None]:
head diamond-matches.csv

In [None]:
grep "contig_31" diamond-matches.csv

Looks like contig_31 is a Vitivirus.

### <span style="color: #4CACBC;"> 1.6. Visualise the coverage of the reads in an interesting contig<a class="anchor" id="coverage"> </span>

#### <span style="color: #4CACBC;"> 1.6.1. Remapping of the reads in an interesting contig<a class="anchor" id="contigcool"> </span>

In [None]:
mkdir -p ~/work/SG-ONT-2022/ASSEMBLY/CONTIG31
cd  ~/work/SG-ONT-2022/ASSEMBLY/CONTIG31

In [None]:
##extract contig31 from the multifasta file
samtools faidx ../MEDAKA/medaka_polishing/consensus.fasta contig_31 > contig_31.fasta

In [None]:
minimap2 -ax map-ont contig_31.fasta ../../CLEANING/reads_vs_ananas_unmapped.fastq > contig31_vs_reads.sam

In [None]:
samtools flagstats contig31_vs_reads.sam

In [None]:
samtools view -b -S contig31_vs_reads.sam > contig31_vs_reads.bam

In [None]:
samtools sort contig31_vs_reads.bam -o contig31_vs_reads_sorted.bam

In [None]:
samtools index contig31_vs_reads_sorted.bam

In [None]:
samtools coverage contig31_vs_reads_sorted.bam

#### <span style="color: #4CACBC;"> 1.6.2. Visualise the coverage / depht of the reads on the contig on TABLET<a class="anchor" id="tablet"> </span>
* 1. Open tablet
* 2. click on Open Assembly
* 3. import the sorted bam as primary assembly and the contig_31.fasta as reference.

Explore the data