
# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  


Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE) - Septembre 2021 Formation SouthGreen

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022
    
Adapted by J. Orjuela (DIADE-IRD) - mai 2023


# <span style="color:#006E7F">__TP3 - CONTIGS QUALITY__ <a class="anchor" id="data"></span>  
    
# <span style="color: #4CACBC;"> 1. Perform a comparison of assemblies using QUAST</span>  


The metrics of assemblies can be evaluated using a quality assessment tool such as [QUAST](http://quast.bioinf.spbau.ru/manual.html).

In part 2, we have assembled raw data using various assemblers and also polished/corrected it. 

Aggregate all assemblies you have produced for your sample in a new folder named "AGGREGATED".

We're going to compare it with QUAST.

### ⚠️ If you had some troubles with precedent assemblies ... 

if you had a problem, you can also download directly AGGREGATED directory from the distant server 

In [None]:
cd ~/work/RESULTS/
wget https://itrop.ird.fr/algae_data/AGGREGATED.tar.gz 
tar -xvf AGGREGATED.tar.gz

### ⚠️ if you want continue with your assemblies please ...

In [None]:
mkdir -p ~/work/RESULTS/AGGREGATED
cd ~/work/RESULTS/AGGREGATED

### Order assemblies obtained in the samples 

#### sample 4222

In [None]:
ln -s ~/work/RESULTS/4222_FLYE/assembly.fasta 4222_FLYE.fasta
ln -s ~/work/RESULTS/4222_FLYE_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta 4222_FLYE_MEDAKA.fasta
ln -s ~/work/RESULTS/4222_RAVEN/assembly.fasta 4222_RAVEN.fasta
ln -s ~/work/RESULTS/4222_RAVEN_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta 4222_RAVEN_MEDAKA.fasta

#### sample B8

In [None]:
ln -s ~/work/RESULTS/B8_FLYE/assembly.fasta B8_FLYE.fasta
ln -s ~/work/RESULTS/B8_FLYE_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta B8_FLYE_MEDAKA.fasta
ln -s ~/work/RESULTS/B8_RAVEN/assembly.fasta B8_RAVEN.fasta
ln -s ~/work/RESULTS/B8_RAVEN_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta B8_RAVEN_MEDAKA.fasta

#### sample G11

In [None]:
ln -s ~/work/RESULTS/G11_FLYE/assembly.fasta G11_FLYE.fasta
ln -s ~/work/RESULTS/G11_FLYE_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta G11_FLYE_MEDAKA.fasta
ln -s ~/work/RESULTS/G11_RAVEN/assembly.fasta G11_RAVEN.fasta
ln -s ~/work/RESULTS/G11_RAVEN_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta G11_RAVEN_MEDAKA.fasta

#### check directory with symbolic links

In [None]:
ls -l ~/work/RESULTS/AGGREGATED

### <span style="color: #4CACBD;"> 1.1  Run QUAST on assemblies and compare them </span>

In [None]:
cd ~/work/RESULTS/AGGREGATED/
time quast.py *.fasta -o QUAST

#### Go to the QUAST directory and check the file content.

#### Looking at the output statistics, what are the main differences between the assemblies?

#### Look total sizes, number of contigs, N50 statistics.

#### For each sample what is the best one ?

### <span style="color: #4CACBD;"> 1.2  Comparison of the assemblies against the reference genome of a closely organism </span>

It is possible to compare assemblies against a given reference genome of a closely related organism. 

Use corrected assemblies and compare it with the reference genome

### Compare best algae assemblies against the reference sequence using QUAST

create for that a QUAST_REF directory into the AGGREGATED one

#### Which are the assemblies closest in size relative to the reference ?

#### Does this have the largest N50 and fewest number of contigs?

#### Which assembly does obtain the best contiguity metrics?


## <span style="color: #4CACBD;"> 2. Assessing gene space using BUSCO </span>


Benchmarking Universal Single-Copy Orthologs ([BUSCO](https://busco.ezlab.org/busco_userguide.html)) helps to check if you have a good assembly at the genic level, by searching the expected single-copy lineage-conserved orthologs in any newly-sequenced genome from an appropriate phylogenetic clade.

Calculate the gene space completion for each assembly

In [None]:
mkdir -p ~/work/RESULTS/AGGREGATED/BUSCO
cd ~/work/RESULTS/AGGREGATED/BUSCO

Lineage can be choose in busco database using --list-datasets parameter

In [None]:
# busco env
conda activate busco

In [None]:
busco --list-datasets

In [None]:
ASSEMBLY="/home/jovyan/work/RESULTS/AGGREGATED/4222_FLYE_MEDAKA.fasta"
LINEAGE=chlorophyta_odb10
busco -i "$ASSEMBLY" -l "$LINEAGE" -c "${CPUS:-8}" -m genome -o 4222_FLYE_MEDAKA_BUSCO 
# deactivate busco env
conda deactivate

### What does the gene space look like for this assembly?

### Run BUSCO for the best assemblies ...

## <span style="color: #4CACBD;"> 3. Read alignment statistics - remapping</span>

Read congruency is an important measure in determining assembly accuracy. Clusters of read pairs that align incorrectly are strong indicators of mis-assembly.

How well do the reads align back to the draft assemblies? Use minimap2 and samtools to assess the basic alignment statistics.

Make a folder for your results.

We will use 4222_FLYE_MEDAKA.fasta assembly and ONT READS as well ILLUMINA READS.

In [None]:
mkdir -p ~/work/RESULTS/REMAPPING/ONT
cd ~/work/RESULTS/REMAPPING/ONT

In [None]:
ASSEMBLY="/home/jovyan/work/RESULTS/AGGREGATED/4222_FLYE_MEDAKA.fasta"
ONT="/home/jovyan/work/DATA/ONT/4222_RB2.fastq.gz"

In [None]:
#symbolic link from last assembly in the current repertory REMAPPING
ln -s ${ASSEMBLY} 4222_FLYE_MEDAKA.fasta
# overwrite ASSEMBLY variable
ASSEMBLY=4222_FLYE_MEDAKA.fasta

## I. Mapping on assemblies with ONT reads

In [None]:
minimap2 -ax map-ont -t 4 ${ASSEMBLY} ${ONT} | samtools sort -@ 1 -T "${ASSEMBLY/.fasta/}" -O BAM -o "${ASSEMBLY/.fasta/_ONT_minimap2.bam}" -
samtools index "${ASSEMBLY/.fasta/_ONT_minimap2.bam}"
samtools view -F 0x904 -c "${ASSEMBLY/.fasta/_ONT_minimap2.bam}"

### What is the percentage of aligned ONT reads in your assembly?

## II. Mapping on assemblies with ILLUMINA reads now !

In [None]:
mkdir -p ~/work/RESULTS/REMAPPING/ILLUMINA
cd ~/work/RESULTS/REMAPPING/ILLUMINA

illumina data is available for only two samples in the distant server : https://itrop.ird.fr/algae_data/ILLUMINA.tar.gz

Don't forget to download it by using `wget` and decompres the directory with `tar` command ! 

You can use minimap2 to align short reads to assemblies!

`minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam      # short genomic paired-end reads`

In [None]:
cd ~/work/DATA
wget  https://itrop.ird.fr/algae_data/ILLUMINA.tar.gz
tar -xvf ILLUMINA.tar.gz

In [None]:
ASSEMBLY="/home/jovyan/work/RESULTS/AGGREGATED/G11_FLYE_MEDAKA.fasta"
ILLUMINA_R1="/home/jovyan/work/DATA/ILLUMINA/G11_R1.fastq.gz"
ILLUMINA_R2="/home/jovyan/work/DATA/ILLUMINA/G11_R2.fastq.gz"

In [None]:
cd ~/work/RESULTS/REMAPPING/ILLUMINA

In [None]:
minimap2 -ax sr -t 4 ${ASSEMBLY} ${ILLUMINA_R1} ${ILLUMINA_R2} | samtools sort -@ 1 -T "${ASSEMBLY/.fasta/}" -O BAM -o "${ASSEMBLY/.fasta/_ILL_minimap2.bam}" -
samtools index "${ASSEMBLY/.fasta/_ILL_minimap2.bam}"
samtools view -F 0x904 -c "${ASSEMBLY/.fasta/_ILL_minimap2.bam}"

### What is the percentage of aligned ILLUMINA reads in your assembly?

## <span style="color: #4CACBD;"> [OPTIONAL] Blobtools </span>

During the sequence quality assessment stage we tried to discern whether contamination was present. Sometimes this is not feasible at the read level. By plotting Contig GC content vs Contig Read Coverage we can look for clusters of contigs that share similar coverage. The appearance of multiple clusters can indicate multiple organisms. Occasionally, contigs can also be taxonomically classified, providing further evidence for contaminants.

First we need to download some files

In [None]:
cd ~/work/DATA
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/blobtools.tar.gz
tar zxvf blobtools.tar.gz
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/testBacteria.dmnd

Run Blobtools on each assembly. Blobtools requires both a BAM file as input and blast output for the classification step.2.

In [None]:
mkdir -p ~/work/RESULTS/BLOBTOOLS
cd ~/work/RESULTS/BLOBTOOLS

### Blastx using diamond 

Run diamond in blastx mode using assembled contigs vs a pre-formatted diamond bacteria database (protein)

In [None]:
#symbolic link from last assembly in the current repertory BLOBTOOLs
ln -s ~/work/RESULTS/4222_FLYE_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta 4222_FLYE_MEDAKA.fasta

In [None]:
time diamond blastx --query 4222_FLYE_MEDAKA.fasta --db ~/work/DATA/testBacteria.dmnd --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --sensitive  --max-target-seqs 1  --evalue 1e-25  --threads 4  --out diamond.csv

In [None]:
head diamond.csv

In [None]:
pwd

### Run blobtools 

Specifically to Blobtools: * nodes and names from ncbi taxdump database can be download from [here](https://github.com/DRL/blobtools#download-ncbi-taxdump-and-create-nodesdb).

In this training nodes and names are available on the DATA directory

In [None]:
ASSEMBLY="4222_FLYE_MEDAKA.fasta"
BAM=~/work/RESULTS/REMAPPING/ONT/${ASSEMBLY/.fasta/_ONT_minimap2.bam}
DIAMONDX=~/work/RESULTS/BLOBTOOLS/diamond.csv
BLOB_NODES=~/work/DATA/blobtools/nodes.dmp
BLOB_NAMES=~/work/DATA/blobtools/names.dmp

In [None]:
rm nodesDB.txt
blobtools create -i ${ASSEMBLY} -b ${BAM} -t ${DIAMONDX} -o quality --names ${BLOB_NAMES} --nodes ${BLOB_NODES} --db nodesDB.txt

In [None]:
blobtools view -i quality.blobDB.json --cov -o output;
blobtools plot -i quality.blobDB.json;

#### Is there contamination in the assembly?

#### Do any assemblies show strange clustering?

#### Why might coverage vary across contigs within an assembly?

The Blobplots all indicate a single cluster. Some contigs show fairly high coverage in comparison to the rest of the genome which could be repetitive elements in the genome.