After having analyse the quality of our assemblies (through BUSCO and Quast), we choosed to continue our analysis on the assemblies constructed from Dorado modified basecalling, as in the future we want to sequence more than 250 isolates. Dorado was proven to be quicker than Guppy, and provide similar or better basecalling accuracy as well as similar assembly quality.

# What is Blobtools ?
Blobtools is a software that allow identification of contaminant inside an assembly.

It take as input :
- Assembly (.fasta)
- Sorted reads map to their reference assembly (.bam)
- A blast file (.out)

We have the assemblies (see HyPo), and the map reads (see BWA-MEM2), therefore we need to generate only the blast file

# Blobtools

In [None]:
## Show all available databases
$ update_blastdb.pl --showall

## Download the nt (nucleotide) database
$ update_blastdb.pl --decompress nt

## Blastn the assembly against the nt database, and create blobtool plots

for dir in /bigvol/omion/05-Polishing/With_HyPo/Gd*; do
    # Extract the base name from the directory
    base_name=$(basename "$dir")

    # Define input and output file paths
    fasta_file="$dir/hypo_${base_name}_Dorado_modbasecalling.fasta"
    bam_file="/bigvol/omion/05-Polishing/BWA-MEM2/Mapping/Dorado/modbasecalling/sorted_bwa_mapping_${base_name}_Dorado_modbasecalling.bam"
    out_file="${dir}.out"
    blobtools_output="${base_name}_create"

    # Run blastn
    blastn -query "$fasta_file" \
           -outfmt "6 qseqid staxids bitscore std" \
           -db /bigvol/omion/Software/database_blast/nt \
           -max_target_seqs 1 \
           -max_hsps 1 \
           -evalue 1e-25 \
           -out "$out_file" \
           -num_threads 40

    # Create blobtools database
    /bigvol/omion/Software/blobtools-blobtools_v1.1.1/blobtools create -i "$fasta_file" \
                      -b "$bam_file" \
                      -t "$out_file" \
                      -o "$blobtools_output"

    # View blobtools database
    /bigvol/omion/Software/blobtools-blobtools_v1.1.1/blobtools view -i "${blobtools_output}.blobDB.json"

    # Plot blobtools database
    /bigvol/omion/Software/blobtools-blobtools_v1.1.1/blobtools plot -i "${blobtools_output}.blobDB.json"
done


Contigs smaller than 50000 bp, and corresponding to mitochondrial contigs, and contaminant (identity > 85 %) were removed from further analysis.

# Contigs length of final assemblies

Final assemblies were contaminant and mitochondrial contigs were removed can be found at this path: /bigvol/omion/07-Filtered_Assemblies   
Contigs lenght were extracted using the following code: 


In [None]:
#!/bin/bash

# Loop over each file matching the pattern filtered_hypo*
for file in /bigvol/omion/07-Filtered_Assemblies/filtered_hypo*; do
  # Extract the isolate name from the files
  prefix=$(echo "$file" | sed -E 's/filtered_hypo_(Gd[^_]+)_.*/\1/')

  # Awk to extract contig length and store in a .tsv file
  awk -v prefix="$prefix" '
  /^>/ {
    if (seq) {
      print prefix "\t" substr(name, 2) "\t" length(seq);
    }
    name = $0;
    seq = "";
  }
  !/^>/ {
    seq = seq $0;
  }
  END {
    if (seq) {
      print prefix "\t" substr(name, 2) "\t" length(seq);
    }
  }' "$file" >> contig_lengths.tsv
done
