This part of the pipeline runs several scanner tools that screen genomes specificly for one MGE type. This notebook generates the raw output that will be processed later downstream.

**Make sure your Docker engine is active!**

### Paths and parameters

#### Pipeline input folders

In [None]:
genomes="./02-QC/data/genomes"
proteomes_gbk="./02-QC/data/proteomes_gbk"

#### Pipeline output folders

In [None]:
task_root="./10-MGEs"
ises="$task_root/ISEs"
ises_output="$ises/output"
pseudos="$task_root/pseudogenes"
pseudos_output="$pseudos/output"
prophages="$task_root/prophages"
prophages_output="$prophages/output"
bgcs="$task_root/BGCs"
bgcs_output="$bgcs/output"
bgcs_logs="$bgcs/logs"
genome_sizes="$task_root/genome_sizes"

mkdir -p $task_root $ises $ises_output $pseudos $pseudos_output $prophages $prophages_output $bgcs $bgcs_logs $bgcs_output $genome_sizes

#### Tool pointers and parameters

In [None]:
pseudofinder_db="/mnt/STORAGE/databases/blast_databases/Clostridia_nr"
pseudofinder="$HOME/bin/pseudofinder/pseudofinder.py"

phastest_docker_home="/mnt/STORAGE/databases/phastest/slurm-docker-cluster"
phastest_docker_input_folder="$phastest_docker_home/phastest_inputs"
phastest_docker_results_folder="$phastest_docker_home/phastest-app-docker/JOBS"

In [None]:
n_cores=20
n_pipes=5
n_cores_per_pipe=$(( n_cores / n_pipes ))

### Testing dependencies

In [None]:
conda activate isescan
isescan.py --version
conda deactivate

In [None]:
conda activate pseudofinder
$pseudofinder version
conda deactivate

In [None]:
conda activate antismash
antismash --version
conda deactivate

### ISEs

In [None]:
conda activate isescan
dir -1 $genomes | parallel --eta -j $n_pipes -I % isescan.py --seqfile $genomes/% --output $ises_output/% --nthread $n_cores_per_pipe \
| tee $ises/run.log
conda deactivate

### Pseudogenes

In [None]:
conda activate pseudofinder
dir -1 $genomes | xargs basename -s .fna | xargs -I % $pseudofinder annotate --genome $proteomes_gbk/%.gbff \
--outprefix $pseudos_output/% --database $pseudofinder_db --threads $n_cores --diamond | tee $pseudos/run.log
conda deactivate

In [None]:
dir -1 $genomes | xargs basename -s .fna | xargs -I % bash -c "
mkdir -p $pseudos_output/%
dir -1 $pseudos_output | grep % | grep -Ev '[0-9]$' | xargs -I {} mv $pseudos_output/{} $pseudos_output/%/{}
"

### BGCs

webserver defaults

In [None]:
conda activate antismash
dir -1 $genomes | xargs basename -s .fna | parallel --eta -j $n_pipes -I % antismash --cpus $(( n_cores_per_pipe + 1 )) --taxon bacteria \
--output-dir $bgcs_output/% --output-basename % --logfile $bgcs_logs/%.log \
--cb-knownclusters --cb-subclusters --asf --rre --tfbs --genefinding-tool prodigal \
$genomes/%.fna
conda deactivate

### Prophages

`Phastest` has its own work folder structure we have to adhere to, so let's reproduce it here.

In [None]:
cp -u $genomes/* $phastest_docker_input_folder/

In [None]:
root=$(pwd)
cd $phastest_docker_home

`xargs -t` flag to print which run is being executed

`docker run -T` flag to accept `xargs` argument piping into `docker`

`phastest --single-diamond` flag to run only one `diamond` process at a time to avoid file access clashing

In [None]:
dir -1 phastest_inputs/ | xargs -t -I % docker compose run -T --rm phastest -i contig -m lite -s % --yes --phage-only --single-diamond

In [None]:
cd $root

In [None]:
mv $phastest_docker_results_folder/* $prophages_output/
rm -f $phastest_docker_input_folder/*