This part of the pipeline executes the Mash-based QC analysis wrapped within the Panaroo tool.

**WARNING: very heavy memory load! I let my system (32 GB mem + 32 GB swap) handle only 4 threads at once.**

### Checking dependencies

In [None]:
conda activate panaroo
panaroo-qc --version
conda deactivate

### Paths and parameters

#### Pipeline input folders

In [None]:
bakta_folder="./01-bakta"
bakta_genomes="$bakta_folder/genomes"
bakta_gffs="$bakta_folder/gffs"
bakta_proteomes="$bakta_folder/proteomes"
bakta_proteomes_gbk="$bakta_folder/proteomes_gbk"

failed_checkm="./00-refseq/failed_checkm"

metadata='./genomes_metadata'

#### Pipeline output folders

In [None]:
task_root="./02-QC"
indices="$task_root/indices"

data="$task_root/data"
QCed_gffs="$data/gffs"
QCed_genomes="$data/genomes"
QCed_proteomes="$data/proteomes"
QCed_proteomes_gbk="$data/proteomes_gbk"

mkdir -p $task_root $indices $data $QCed_gffs $QCed_genomes $QCed_proteomes $QCed_proteomes_gbk

#### Tool pointers and parameters

In [None]:
threads=4

ref_db="./utils/refseq.genomes.k21s1000.msh"

mash_plotter="./utils/plot_mash_grouped.R"

### Mash QC

In [None]:
conda activate panaroo

In [None]:
panaroo-qc -i $bakta_gffs/*.gff -o $task_root -t $threads --graph_type all --ref_db $ref_db

In [None]:
conda deactivate

In [None]:
rm -f $task_root/tmp*

### MDS contamination plot by rRNA cluster colour scale

In [None]:
root=$(pwd)

In [None]:
Rscript $mash_plotter $task_root/mds_coords.txt $metadata $task_root

In [None]:
cd $task_root

In [None]:
join -t $'\t' -2 1 -1 2 \
<(tail -n +2 ../names_by_ids | sort -t $'\t' -k 2) \
mash_contamination_hits.tab \
> mash_contamination_hits_with_name.tab

head mash_contamination_hits_with_name.tab

In [None]:
cd $root

**--> Manually inspect the Mash QC output <--**. Carefully go through `mash_contamination_hits_with_name.tab` to pinpoint any contaminated genomes (look for a high similarity with a RefSeq assembly that is certainly not your species). Also screen the boxplots and the MDS Mash distance plot for outliers. Mark these in your metadata spreadsheet as failed in the `Failed_Mash` column and re-export it to `genomes_metadata`.

### Filter genome dataset for QC-succeeded genomes

**--> Have you marked all QC-failed genomes in the `genomes_metadata` file? <--**

In [None]:
cat genomes_metadata | awk -F '\t' '{if ($11 == "X") {print $2}}' > $task_root/failed_mash
cat $failed_checkm $task_root/failed_mash > failed_qc

Copy the data of the passed genomes to a separate folder

In [None]:
comm -23 <(dir -1 $bakta_genomes | xargs basename -s .fna | sort) <(cat failed_qc | sort) | \
xargs -I % bash -c "
cp -u $bakta_genomes/%.fna $QCed_genomes/%.fna
cp -u $bakta_gffs/%.gff $QCed_gffs/%.gff
cp -u $bakta_proteomes/%.faa $QCed_proteomes/%.faa
cp -u $bakta_proteomes_gbk/%.gbff $QCed_proteomes_gbk/%.gbff"

In [None]:
dir -1 $QCed_genomes | wc -l

### Make rRNA cluster indices

The index files serve assign each QC-passed genome to an rRNA cluster, so that you can easily grab a set of genomes by rRNA cluster later on, without having to make a dedicated subdirectory structure.

The extensionless files are mere lists of accession numbers, while the ones with `.list` extension link an accession number to the associated Bakta Genbank file.

In [None]:
function index () {
    # list of accession numbers
    cat genomes_metadata | awk -v group=$1 -F '\t' '{if ($3==group && $12=="FALSE") print $2}' | sort \
    > $indices/group$1

    # list of genome file locations
    cat $indices/group$1 | xargs -I % \
    echo -e "%\t$QCed_proteomes_gbk/%.gbff" \
    >> $indices/group$1.list
}

In [None]:
index 1
index 4
index 14a
index 14b

An index of the full set is just the union of all the separate indices.

In [None]:
root=$(pwd)
cd $indices
dir -1 | grep -v '.list' | xargs cat > merge
dir -1 | grep '.list' | xargs cat > merge.list
cd $root

In [None]:
wc -l $indices/*.list