# Formation South Green 2021  

##  Initiation à l’analyse de données Minion

### PART 3

Created by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-INRAE)

Septembre 2021

# 1. Perform a comparison of assemblies using QUAST

The metrics of assemblies can be evaluated using a quality assessment tool such as [QUAST](http://quast.bioinf.spbau.ru/manual.html).

In part 2, we have assembled raw data using various assemblers and also polished and corrected it. 

Aggregate all assemblies you have produced for your favorite clone in a new folder named "AGGREGATED".

We're going to compare it with QUAST.

In [None]:
# active quast env
conda activate quast

In [None]:
CLONE=Clone20
cd ~/SG-ONT-2021/RESULTS
mkdir -p ~/SG-ONT-2021/RESULTS/AGGREGATED
cd ~/SG-ONT-2021/RESULTS/AGGREGATED
ln -s ~/SG-ONT-2021/RESULTS/FLYE/assembly.fasta ${CLONE}_FLYE.fasta
ln -s ~/SG-ONT-2021/RESULTS/FLYE_RACON/assembly.racon2.fasta ${CLONE}_FLYE_RACONx2.fasta
ln -s ~/SG-ONT-2021/RESULTS/FLYE_RACON_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta ${CLONE}_FLYE_RACONx2_MEDAKA.fasta
ls -l ~/SG-ONT-2021/RESULTS/AGGREGATED

#### Do similar symbolic links to RAVEN results and check your folder

## 1.1  Run QUAST on CLONE assemblies and compare them

In [None]:
cd ~/SG-ONT-2021/RESULTS/AGGREGATED/
quast.py *.fasta -o QUAST

#### Go to the QUAST directory and check the file content.

#### Looking at the output statistics, what are the main differences between the assemblies?

#### Look total sizes, number of contigs, N50 statistics.

#### For your favorite clone what is the best one ?

## 1.2  Comparison of the assemblies against the reference genome of closely organism (REAL DATA !)

It's possible to compare assemblies against the reference genome of closely related organism. 

In our case we will use the real dataset Hh. 

Previously calculated assemblies for Hh can be found in DATA/real_Hh/Hh-ASSEMBLIES (used CulebrONT)

In [None]:
ls -lh ~/SG-ONT-2021/DATA/real_Hh/Hh-ASSEMBLIES

### Compare Hh assemblies against the reference sequence using QUAST

#### WARNING : This can take a while!! 

In [None]:
mkdir ~/SG-ONT-2021/RESULTS/AGGREGATED_Hh/
cd ~/SG-ONT-2021/RESULTS/AGGREGATED_Hh/
time quast.py /home/jovyan/SG-ONT-2021/DATA/real_Hh/Hh-ASSEMBLIES/*MEDAKA_STARTFIXED-CIRCULARISED.fasta -R /home/jovyan/SG-ONT-2021/DATA/real_Hh/REFH_M1C132.fasta -o QUAST_REF

#### Which of the assemblies are closest in size relative to the reference ?

#### Does this have the largest N50 and fewest number of contigs?

#### What assembly does obtain the best contiguity metrics?


# 2. Assessing gene space using BUSCO (REAL DATA)

Benchmarking Universal Single-Copy Orthologs ([BUSCO](https://busco.ezlab.org/busco_userguide.html)) helps to check if you have a good assembly, by searching the expected single-copy lineage-conserved orthologs in any newly-sequenced genome from an appropriate phylogenetic clade.

Calculate the gene space for each assembly

In [None]:
mkdir -p ~/SG-ONT-2021/RESULTS/AGGREGATED_Hh/BUSCO
cd ~/SG-ONT-2021/RESULTS/AGGREGATED_Hh/BUSCO

Lineage can be choose in busco database using --list-datasets parameter

In [None]:
# busco env
conda activate busco
busco --list-datasets

In [None]:
ASSEMBLY="~/SG-ONT-2021/DATA/real_Hh/Hh-ASSEMBLIES/FLYE-STEP_CORRECTION_MEDAKA_STARTFIXED-CIRCULARISED.fasta"
LINEAGE=bacteria_odb10
busco -i "$ASSEMBLY" -l "$LINEAGE" -c "${CPUS:-4}" -m genome -o BUSCO_RESULTS 

In [None]:
conda deactivate

### What does the gene space look like for this assembly?

#### Run BUSCO in all assemblies generated if you want... 

# 3. Read alignment statistics - remapping (CLONE)

Read congruency is an important measure in determining assembly accuracy. Clusters of read pairs that align incorrectly are strong indicators of mis-assembly.

How well do the reads align back to the draft assemblies? Use minimap2 and samtools to assess the basic alignment statistics.

Make a folder for your results.

We will use CloneX_FLYE_RACONx2_MEDAKA.fasta assembly and ONT READS as well ILLUMINA READS.

In [None]:
mkdir -p ~/SG-ONT-2021/RESULTS/REMAPPING
cd ~/SG-ONT-2021/RESULTS/REMAPPING/

In [None]:
CLONE="Clone10"
ASSEMBLY="/home/jovyan/SG-ONT-2021/RESULTS/FLYE_RACON_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta"
ONT="/home/jovyan/SG-ONT-2021/DATA/${CLONE}/ONT/${CLONE}.fastq.gz"
ILLUMINA_R1="/home/jovyan/SG-ONT-2021/DATA/${CLONE}/ILL/${CLONE}_R1.fastq.gz"
ILLUMINA_R2="/home/jovyan/SG-ONT-2021/DATA/${CLONE}/ILL/${CLONE}_R2.fastq.gz"

In [None]:
#symbolic link from last assembly in the current repertory ReMAPPING
ln -s ${ASSEMBLY} ${CLONE}_FLYE_RACONx2_MEDAKA.fasta
# overwrite ASSEMBLY variable
ASSEMBLY=${CLONE}_FLYE_RACONx2_MEDAKA.fasta

## Mapping assemblies vs ONT reads

In [None]:
minimap2 -ax map-ont -t 4 ${ASSEMBLY} ${ONT} | samtools sort -@ 1 -T "${ASSEMBLY/.fasta/}" -O BAM -o "${ASSEMBLY/.fasta/_ONT_minimap2.bam}" -
samtools index "${ASSEMBLY/.fasta/_ONT_minimap2.bam}"
samtools view -F 0x904 -c "${ASSEMBLY/.fasta/_ONT_minimap2.bam}"

## Mapping assemblies vs ILLUMINA reads

In [None]:
#minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam
minimap2 -ax sr -t 4 ${ASSEMBLY} ${ILLUMINA_R1} ${ILLUMINA_R2} | samtools sort -@ 1 -T "${ASSEMBLY/.fasta/}" -O BAM -o "${ASSEMBLY/.fasta/_ILL_minimap2.bam}" -
samtools index "${ASSEMBLY/.fasta/_ILL_minimap2.bam}"
samtools view -F 0x904 -c "${ASSEMBLY/.fasta/_ILL_minimap2.bam}"

### What is the percentage of aligned ONT and ILLUMINa reads in your clone assembly?

# 4. Blobtools

During the sequence quality assessment stage we tried to discern whether contamination was present. Sometimes this is not feasible at the read level. By plotting Contig GC content vs Contig Read Coverage we can look for clusters of contigs that share similar coverage. The appearance of multiple clusters can indicate multiple organisms. Occasionally, contigs can also be taxonomically classified, providing further evidence for contaminants.

First we need to download some files

In [None]:
cd ~/SG-ONT-2021/DATA
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/blobtools.tar.gz
tar zxvf blobtools.tar.gz
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/testBacteria.dmnd

Run Blobtools on each assembly. Blobtools requires both a BAM file as input and blast output for the classification step.2.

In [None]:
mkdir -p ~/SG-ONT-2021/RESULTS/BLOBTOOLS
cd ~/SG-ONT-2021/RESULTS/BLOBTOOLS

### Blastx using diamond 

Run diamond in blastx mode using assembled contigs vs a pre-formatted diamond bacteria database (protein)

In [None]:
#prepare assembly file
ASSEMBLY="/home/jovyan/SG-ONT-2021/RESULTS/FLYE_RACON_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta"
#symbolic link from last assembly in the current repertory BLOBTOOLs
ln -s ${ASSEMBLY} ${CLONE}_FLYE_RACONx2_MEDAKA.fasta
# overwrite ASSEMBLY variable
ASSEMBLY=${CLONE}_FLYE_RACONx2_MEDAKA.fasta

In [None]:
conda activate diamond
time diamond blastx --query ${ASSEMBLY}  --db ~/SG-ONT-2021/DATA/testBacteria.dmnd  --outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --sensitive  --max-target-seqs 1  --evalue 1e-25  --threads 4  --out diamond.csv
conda deactivate

In [None]:
head diamond.csv

### Run blobtools 

Specifically to Blobtools: * nodes and names from ncbi taxdump database can be download from [here](https://github.com/DRL/blobtools#download-ncbi-taxdump-and-create-nodesdb).

In this training nodes and names are available on the DATA directory

In [None]:
BAM=~/SG-ONT-2021/RESULTS/REMAPPING/${ASSEMBLY/.fasta/_ONT_minimap2.bam}
DIAMONDX=~/SG-ONT-2021/RESULTS/BLOBTOOLS/diamond.csv
BLOB_NODES=~/SG-ONT-2021/DATA/blobtools/nodes.dmp
BLOB_NAMES=~/SG-ONT-2021/DATA/blobtools/names.dmp

In [None]:
conda activate blobtools
blobtools create -i ${ASSEMBLY} -b ${BAM} -t ${DIAMONDX} -o quality --names ${BLOB_NAMES} --nodes ${BLOB_NODES} --db nodesDB.txt

In [None]:
blobtools view -i quality.blobDB.json --cov -o output;
blobtools plot -i quality.blobDB.json;

In [None]:
conda deactivate

#### Is there contamination in the assembly?

#### Do any assemblies show strange clustering?

#### Why might coverage vary across contigs within an assembly?

The Blobplots all indicate a single cluster. Some contigs show fairly high coverage in comparison to the rest of the genome which could be repetitive elements in the genome.

# 5. Comparative Alignment

Comparative alignment is a useful tool to see how assemblies compare to each other. This can be useful to compare assemblies to a reference, or to see if assemblies have large structural differences.



Check assembled genome with dgenies : http://dgenies.toulouse.inra.fr/

Since you have a reference genome available, an alternative option to de novo assembly is to assemble using a reference genome - by mapping of the sequence reads

Prepare data to download into dgenies : 

 * Reference.fasta ` ~/SG-ONT-2021/DATA/CloneX/reference.fasta `
 
 * ONT assembly in CloneX (Flye+Raconx2+Medaka) ` ~/SG-ONT-2021/RESULTS/FLYE_RACON_MEDAKA/MEDAKA_CONSENSUS/consensus.fasta" ` 
 
 * ABYSS assembly generated with illumina reads  ` ~/SG-ONT-2021/DATA/DGENIES/Clone20-abyss.fasta ` 
 

#### What is the mean difference between illumina and ONT assembly for Clone20?