# <span style="color:green">Formation au Burkina Faso 2022</span> - Initiation à l’analyse de données Minion pour l'analyse de métagénome viraux

Created by J. Orjuela (DIADE-IRD), D. Filloux (PHIM-CIRAD) and A. Comte (PHIM-IRD) 

Septembre 2022

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>
   

[TP3 : TAXONOMIC ASSIGNATION OF READS](#tp3) 

[1. Use Diamond for taxonomic assignation](#diamond)

   * [1.1. Download Genomic viral bank](#viraldbdiamond)
   * [1.2. Create Diamond Database](#diamondmakedb)
   * [1.3. Lauch Diamond](#rundiamond) 
   
[2. Use KAIJU for taxonomic assignation](#kraken2)
   * [3.1 Create Kaiju viruses database](#kaijudb)
   * [3.2 Launch Kaiju](#kaiju)  
   * [3.3. Adding taxa names to output file<)](#kaijunames) 
   * [3.4 Creating input file for Krona)](#kronainput) 

[3. (BONUS) Use KRAKEN2 for taxonomic assignation](#kraken2)
   * [3.1. Download a viral database](#viraldb)
   * [3.2. Run Kraken](#kraken)
   * [3.3. Vizualise Kraken output with Krona](#krakenkrona)
   
</span>

***


# <span style="color:#006E7F">__TP3 : TAXONOMIC ASSIGNATION OF READS__ <a class="anchor" id="tp3"></span>  


Taxonomic assignment is the process of assigning an Operational Taxonomic Unit (OTUs, that is, groups of related individuals) to sequences, that can be reads or contigs. To assign an OTU to a sequence it is compared against a database, but this comparison can be done in different ways. The comparison database in this assignment process must be constructed using complete genomes. There are many programs for doing taxonomic mapping, almost all of them follows one of the next strategies:


- BLAST: Using BLAST or DIAMOND, these mappers search for the most likely hit for each sequence within a database of genomes (i.e. mapping). This strategy is slow.

- K-mers: A genome database is broken into pieces of length k, so as to be able to search for unique pieces by taxonomic group, from lowest common ancestor (LCA), passing through phylum to species. Then, the algorithm breaks the query sequence (reads, contigs) into pieces of length k, look for where these are placed within the tree and make the classification with the most probable position.

- Markers: They look for markers of a database made a priori in the sequences to be classified and assign the taxonomy depending on the hits obtained.

https://carpentries-incubator.github.io/metagenomics/06-taxonomic/index.html

## <span style="color: #4CACBC;"> 1. Use Diamond for taxonomic assignation<a class="anchor" id="diamond"> </span>

### <span style="color: #4CACBC;"> 1.1. Download Genomic viral bank<a class="anchor" id="viraldbdiamond"> </span>

In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2022/ASSIGNATION/DIAMOND

In [None]:
# go inside repository
cd ~/work/SG-ONT-2022/ASSIGNATION/DIAMOND
pwd

In [None]:
# refseq viral database pre-dowloaded from ncbi (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/)
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2022/viral.protein.faa

The database you use will determine the result you get for your data.

You can customise it by adding organism to the fasta file used.

Imagine you are searching for a lineage that was recently discovered and it is not part of the available databases. Would you find it?

### <span style="color: #4CACBC;"> 1.2. Create Diamond Database<a class="anchor" id="diamondmakedb"> </span>

In [None]:
diamond makedb --in viral.protein.faa -d viral

### <span style="color: #4CACBC;"> 1.3. Lauch Diamond<a class="anchor" id="rundiamond"> </span>

In [None]:
# Complete the command line below
diamond blastx --outfmt 6 stitle qtitle pident length mismatch gapopen qstart qend sstart send evalue bitscore ....

In [None]:
# observer les hits dans la database qui ont eu le plus de correspondance dans les données:
awk -F '\t' '{print $1}' diamond-matches.csv | sort | uniq -c | sort -n | tail -20

**Observe the csv file results and comment it.**

Be careful of the separator --> choose tab

## <span style="color: #4CACBC;"> 2. Use KAIJU for taxonomic assignation <a class="anchor" id="kaiju"> </span>

Kaiju is a program for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from microbial and viral genomes.

Kaiju can be also used via a web server. https://kaiju.binf.ku.dk/server

In [None]:
# create working repository
mkdir ~/work/SG-ONT-2022/ASSIGNATION/KAIJU

In [None]:
cd ~/work/SG-ONT-2022/ASSIGNATION/KAIJU

### <span style="color: #4CACBC;"> 2.1 Create Kaiju viruses database<a class="anchor" id="kaijudb"> </span>

In [None]:
kaiju-makedb -s viruses

### <span style="color: #4CACBC;"> 2.2 Launch Kaiju (a little bit long to run)<a class="anchor" id="kaiju"> </span>

In [None]:
kaiju --help

In [None]:
kaiju -t nodes.dmp -z 4 -f viruses/kaiju_db_viruses.fmi -i ~/work/SG-ONT-2022/CLEANING/reads_vs_ananas_unmapped.fastq -v -o kaiju.out

**output format**

Kaiju will print one line for each read or read pair. The default output format contains three columns separated by tabs. Using the option -v enables the verbose output, which will print additional columns:

- either C or U, indicating whether the read is classified or unclassified.
- name of the read
- NCBI taxon identifier of the assigned taxon
- the length or score of the best match used for classification
- the taxon identifiers of all database sequences with the best match
- the accession numbers of all database sequences with the best match
- matching fragment sequence(s)

In [None]:
head kaiju.out

### <span style="color: #4CACBC;"> 2.3. Adding taxa names to output file<a class="anchor" id="kaijunames"> </span>

In [None]:
kaiju-addTaxonNames -t nodes.dmp -n names.dmp -i kaiju.out -o kaiju.names.out

In [None]:
head kaiju.names.out

### <span style="color: #4CACBC;"> 3.4 Creating input file for Krona<a class="anchor" id="kronainput"> </span>

In [None]:
kaiju2krona -t nodes.dmp -n names.dmp -i kaiju.out -o kaiju.out.krona

In [None]:
ktImportText -o kaiju.out.html kaiju.out.krona

**Observe the results**

Now open the HTML file by clicking on it on the left menu.

If you have an error : "Javascript must be enabled to view this page", please click on "trust HTML".

What can you see on this Krona?

We are interested in **vitiviruses**. Try to zoom in on this genus.

## <span style="color: #4CACBC;"> 3. (BONUS) Use KRAKEN2 for taxonomic assignation<a class="anchor" id="kraken2"> </span>

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

In [None]:
kraken2 --help

### <span style="color: #4CACBC;"> 3.1. Download a viral database<a class="anchor" id="viraldb"> </span>

For this TP we will download a pre-made simplified kraken database.

MiniKraken DB_8GB (6.0 GB): A pre-built 8 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq (as of Oct. 18, 2017). This can be used by users without the computational resources needed to build a Kraken database. This contains around 5% of kmers from the original standard database. 
It can be found here: https://ccb.jhu.edu/software/kraken/

You can build your own custom Database (see https://github.com/DerrickWood/kraken2/wiki/Manual). However, it take a lot of ressources and time.


In [None]:
# create working repository
mkdir -p ~/work/SG-ONT-2022/ASSIGNATION/KRAKEN

In [None]:
# run Kraken
cd ~/work/SG-ONT-2022/ASSIGNATION/KRAKEN

In [None]:
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training-2022/minikraken2_v2_8GB_201904.tgz

In [None]:
#uncompress the database
tar zxvf minikraken2_v2_8GB_201904.tgz

In [None]:
# Inspect the database content
kraken2-inspect --db minikraken2_v2_8GB_201904_UPDATE/ | head -15

### <span style="color: #4CACBC;"> 3.2. run Kraken<a class="anchor" id="kraken"> </span>

In [None]:
kraken2 --db minikraken2_v2_8GB_201904_UPDATE/ ../../CLEANING/reads_vs_ananas_unmapped.fastq --report report.txt --report-minimizer-data --> output_kraken

**Standard Kraken Output Format**

Each sequence (or sequence pair, in the case of paired reads) classified by Kraken 2 results in a single line of output. Kraken 2's output lines contain five tab-delimited fields; from left to right, they are:

- "C"/"U": a one letter code indicating that the sequence was either classified or unclassified.

- The sequence ID, obtained from the FASTA/FASTQ header.

- The taxonomy ID Kraken 2 used to label the sequence; this is 0 if the sequence is unclassified.

- The length of the sequence in bp. In the case of paired read data, this will be a string containing the lengths of the two sequences in bp, separated by a pipe character, e.g. "98|94".

- A space-delimited list indicating the LCA mapping of each k-mer in the sequence(s). For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:

        - the first 13 k-mers mapped to taxonomy ID #562
        - the next 4 k-mers mapped to taxonomy ID #561
        - the next 31 k-mers contained an ambiguous nucleotide
        - the next k-mer was not in the database
        - the last 3 k-mers mapped to taxonomy ID #562


In [None]:
head output_kraken

**Report format output**

The format with the --report-minimizer-data flag, then, is similar to that described in [Sample Report Output Format], but slightly different. The fields in this new format, from left-to-right, are:

- 1. Percentage of fragments covered by the clade rooted at this taxon
- 2. Number of fragments covered by the clade rooted at this taxon
- 3. Number of fragments assigned directly to this taxon
- 4. Number of minimizers in read data associated with this taxon (new)
- 5. An estimate of the number of distinct minimizers in read data associated with this taxon (new)
- 6. A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
- 7. NCBI taxonomic ID number
- 8. Indented scientific name


In [None]:
head -10 report.txt

### <span style="color: #4CACBC;"> 3.3. Vizualise kraken output with krona<a class="anchor" id="krakenkrona"> </span>

In [None]:
ktImportTaxonomy -m 3 -t 5 report.txt -o kraken.html 2> krakenkrona.err