# Formation South Green 2021  

##  Initiation à l’analyse de données Minion

### PART 3

Created by J.Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and G. Sarah (AGAP-CIRAD)

Septembre 2021

## 1. Perform a comparison of assemblies using QUAST

The metrics of assemblies can be evaluated using a quality assessment tools such as [QUAST](http://quast.bioinf.spbau.ru/manual.html).

We have assembled raw data using various assemblers and also and polished it. 

All the intermediate and results files are available here LINK

You can found assemblies results files in the AGGREGATED repertory.

In [1]:
USER="jovyan"
cd /home/${USER}/SG-ONT-2021/
## wget 

Run QUAST on the assemblies you now have made in the quast directory and compare them.

In [None]:
cd /home/${USER}/SG-ONT-2021/AGGREGATED

In [None]:
quast.py *.fasta -o QUAST

Go to the QUAST directory and check the file content.

Open the report.html file in a web browser.

* Looking at the output statistics, what are the main difference between the assemblies?

* Look total sizes, number of contigs, N50 statistics.

* For your favorite clone what is the best one ?

### Comparison of the assemblies against the reference genome of closely organism

It's possible to compare assemblies against the reference genome of closely related organism. 

In our case we will use  the *reference.fasta* sequence using QUAST. 

This sequence can be download [itrop server](https://itrop.ird.fr/ont-training)

In [None]:
wget https://itrop.ird.fr/ont-training/DATA/reference.fasta

Compare the output from your assemblies against the complete genome sequence using QUAST:

In [None]:
quast.py *.fasta -R reference.fasta -o QUAST_REF

* Which of the assemblies are closest in size relative to the reference ?

* Does this have the largest N50 and fewest number of contigs?

The ideal shape of the Nx graph would be vertical line if the bacteria is a single contig. As more contigs are present, the cumulative length graph shifts to the right slightly as contigs get shorter.

The Nx graph shows that the X assembly is the making the longest and least contigs.

The GC content graph does not indicate contamination.

The X assembly has the best contiguity metrics.

## 2. Assessing gene space using BUSCO 

Benchmarking Universal Single-Copy Orthologs ([BUSCO](https://busco.ezlab.org/busco_userguide.html)) helps to check if you have a good assembly, by searching the expected single-copy lineage-conserved orthologs in any newly-sequenced genome from an appropriate phylogenetic clade.

Calculate the gene space look like for each assembly?

In [None]:
mkdir -p BUSCO
cd BUSCO
ln -s ../*.fasta .
# run the following two lines to setup the busco environment
rsync -r /opt/miniconda3/pkgs/augustus-3.2.2-boost1.61_3/config/{species,model} .
export AUGUSTUS_CONFIG_PATH=$PWD
apply_BUSCO () {
    ASSEMBLY="$1" #The assembly is the first parameter to this function. The file must end in fasta
    LINEAGE=/home/data/opt-byod/busco/lineages/bacteria_odb9
    busco -i "$ASSEMBLY" -l "$LINEAGE" -c "${CPUS:-4}" -m genome -o "${ASSEMBLY/.fasta/_busco}"
}

In [None]:
for FASTA in *.fasta; do
    apply_BUSCO "$FASTA"
done

In [None]:
* What does the gene space look like for each assembly?

* What assembly is the best one?

## 3. Read alignment statistics - remapping

Read congruency is an important measure in determining assembly accuracy. Clusters of read pairs that align incorrectly are strong indicators of mis-assembly.

How well do the reads align back to the draft assemblies? Use minimap2 and samtools to assess the basic alignment statistics.

Make a folder for your results.

In [None]:
mkdir MINIMAP2
cd MINIMAP2
ln -s ../*.fasta . # link the fasta files in this directory

Then copy this function into your terminal.

In [None]:
align_reads () {
    ASSEMBLY="$1" # The assembly is the first parameter to this function. Must end in .fasta
    READS="$2"    # The first read pair is the second parameter to this function
    minimap2 index "$ASSEMBLY" # Index the assembly prior to alignment
    minmap2 -ax map-ont -t 1 "$ASSEMBLY" "$READS" | samtools sort -@ 1 -T "${ASSEMBLY/.fasta/}" -O BAM -o "${ASSEMBLY/.fasta/_bwa_alignment.bam}" -
    samtools index "${ASSEMBLY/.fasta/_bwa_alignment.bam}"
    # minimap2 : Align reads to the assembly
    # samtools sort : Sort the output by coordinate
    #    -O BAM : save the output as a BAM file
    #    -@ <int> : use <int> cores
    #    -T <temp> : Write temporary files to <temp>.nnnn.bam
    # samtools index : index the BAM file
    samtools view -F 0x904 -c "${ASSEMBLY/.fasta/_bwa_alignment.bam}"
}

Run the function above for all assemblies

In [None]:
for FASTA in *.fasta; do
    align_reads "$FASTA" ../CloneXX.fastq.gz 
done
# recuperer le nb du clon aussi de maniere automatique

## 3. Blobtools

During the sequence quality assessment stage we tried to discern whether contamination was present. Sometimes this is not feasible at the read level. By plotting Contig GC content vs Contig Read Coverage we can look for clusters of contigs that share similar coverage. The appearance of multiple clusters can indicate multiple organisms. Occasionally, contigs can also be taxonomically classified, providing further evidence for contaminants.

Run Blobtools on each assembly. Blobtools requires both a BAM file as input and blast output for the classification step.2.

In [18]:
cd ..
mkdir Blobtools
cd Blobtools
ln -s ../*.fasta .
ln -s ../BWA/*.bam .
ln -s ../Blast/*.tsv .
apply_Blobtools () {
    ASSEMBLY="$1" # The assembly is the first parameter to this function. The file must end in .fasta
    BAM="$2" # The BAM file is the second parameter to this function
    BLAST="$3" # The BLAST file is the third parameter to this function
    BLOB_DB=/opt/blobtools/data/nodesDB.txt
    blobtools create -i "$ASSEMBLY" -b "$BAM" -t "$BLAST" -o "${ASSEMBLY/.fasta/_blobtools}" --db "$BLOB_DB"
    blobtools blobplot -i "${ASSEMBLY/.fasta/_blobtools}.blobDB.json" -o "${ASSEMBLY/.fasta/_blobtools}"
}

In [None]:
for FASTA in *.fasta; do
    apply_Blobtools "$FASTA" "${FASTA/.fasta/_bwa_alignment.bam}" "${FASTA/.fasta/_blast_alignment.tsv}"
done

* Is there contamination in the assembly?

* Do any assemblies show strange clustering?

* Why might coverage vary across contigs within an assembly?

The Blobplots all indicate a single cluster. Some contigs show fairly high coverage in comparison to the rest of the genome which could be repetitive elements in the genome.

## 4. Kraken Taxonomic Classification

Occasionally classification might not be informative at the read level. By applying Kraken to the longer contigs, we can get a better idea of what is in the assembly as long as the classification database contains that information.

Run Kraken on each assembly.

In [None]:
mkdir Kraken
cd Kraken
ln -s ../*.fasta .
apply_Kraken () {
    ASSEMBLY="$1" # The assembly is the first parameter to this function. This file must end in .fasta
    KRAKEN_DB=/home/data/byod/minikraken_20141208 # The location of the kraken database
    echo "Running Kraken: $ASSEMBLY"
    kraken --threads "${CPUS:-4}" --db "$KRAKEN_DB" --fasta-input "$ASSEMBLY" > "${ASSEMBLY/.fasta/.kraken.tsv}"
    kraken-report --db "$KRAKEN_DB" "${ASSEMBLY/.fasta/.kraken.tsv}" > "${ASSEMBLY/.fasta/.kraken.rpt}"
    ktImportTaxonomy <( cut -f2,3 "${ASSEMBLY/.fasta/.kraken.tsv}" ) -o "${ASSEMBLY/.fasta/.krona.html}"
}

In [None]:
for FASTA in *.fasta; do
    apply_Kraken "$FASTA"
done

* What are you identified?

Here we see Kraken classifying around half the contigs, in contrast to the reads where it struggled to find a match. There are a high number of identified species, which is in contrast to all other analysis that indicate a single organism. This indicates that the sample is more likley closely related to the identified organisms, also supported by the point that many contigs are unclassified as well. The real organism is not present in the database used with Kraken.

## 5. Visual inspection of an assembly with Bandage

Bandage is a tool for visualizing assembly graphs with connections.

We can zoom in specific areas
of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences (GNU).

In [2]:
Bandage -h

QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-jovyan'

  ____                  _                  
 |  _ \                | |                 
 | |_) | __ _ _ __   __| | __ _  __ _  ___ 
 |  _ < / _` | '_ \ / _` |/ _` |/ _` |/ _ \
 | |_) | (_| | | | | (_| | (_| | (_| |  __/
 |____/ \__,_|_| |_|\__,_|\__,_|\__, |\___|
                                 __/ |     
                                |___/      
Version: 0.8.1

Usage:    Bandage <command> [options]
          
Commands: <blank>      Launch the Bandage GUI
          load         Launch the Bandage GUI and load a graph file
          info         Display information about a graph
          image        Generate an image file of a graph
          querypaths   Output graph paths for BLAST queries
          reduce       Save a subgraph of a larger graph
          
Options:  --help       View this help message
          --helpall    View all command line settings
          --version    View Bandage version numbe

In [12]:
cd /home/${USER}/SG-ONT-2021/

In [13]:
Bandage info RAVEN_Clone10/Clone10_raven.gfa


QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-jovyan'
Node count:                       3
Edge count:                       0
Smallest edge overlap (bp):       0
Largest edge overlap (bp):        0
Total length (bp):                1203789
Total length no overlaps (bp):    1203789
Dead ends:                        6
Percentage dead ends:             100%
Connected components:             3
Largest component (bp):           1105336
Total length orphaned nodes (bp): 1105336
N50 (bp):                         1105336
Shortest node (bp):               49172
Lower quartile node (bp):         49227
Median node (bp):                 49281
Upper quartile node (bp):         577309
Longest node (bp):                1105336
Median depth:                     0.000181845
Estimated sequence length (bp):   1203789


* What is the median depth of the assembly (similar to average coverage)?
Note down this number, as it describes that a node with a similar depth as this represents a sequence that occurs once in your entire genome.

In [14]:
Bandage image RAVEN_Clone10/Clone10_raven.gfa RAVEN_Clone10/Clone10_raven.png

QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-jovyan'


* What are we seeing on GFA files? 

FASTA force the assemblies to be broken into linear sections, and loose information about ambiguities. 

FASTG preserves linearity and keeps local complexity.

Observe GFA may look chaotic at first sight, but it is representing the actual complexity of the assembly.

## 6. Comparative Alignment

Comparative alignment is a useful tool to see how assemblies compare to each other. This can be useful to compare assemblies to a reference, or to see if assemblies have large structural differences.



Verify assembled genome by dgenies : http://dgenies.toulouse.inra.fr/

## 3. Mapping of the sequence reads against the reference


Since you have a reference genome available, an alternative option to de novo assembly is to assemble using a reference genome - by mapping of the sequence reads

In [None]:
# Build a new index of the reference



# Align the reads from the sample (and produce sorted fastq output of mapped reads and unmapped reads)

# Convert the SAM file to BAM in preparation for sorting

# Sort the BAM file, in preparation for SNP calling

# Index a bam file

# Build a new index of the reference with samtools faidx

# Call variants from the sorted BAM
