### Host linkages:
We have 38,956 viral sequences (vOTUs), but what is the point of all these if we don't link to hosts?
Would like to link the viruses to hosts, and get AMG predictions if possible

After host linkages: Would like to do co-occurence calculations for viruses and their hosts (sourmash?) to see if the linkages make sense. 

How to:
- Use Minced to predict repeat-spacer areas
- For MAGs with a predicted repeat-spacer area, run Prokka for the cas gene predictions.


Tried:
- iphop
- crisprcastyper
- crisprcasfinder
- crass

Tried a bunch of CRISPR detection software, so far the most user friendly is Minced. 
Minced outputs a fasta file with all spacer sequences, that we can use to link MAGs to vOTUs. 
We will only be able to link MAGs that have spacers predicted. 


In [None]:
# Using Minced for CRISPR-spacers detection on the MAGs. 
# https://github.com/ctSkennerton/minced
# Total of 1482 MAGs with a crispr array
mamba activate minced

# srun
srun --account=ctbrowngrp -p med2 -J ccf -t 3:00:00 -c 24 --mem=50gb --pty bash

# run snakemake
mamba activate branchwater
snakemake --use-conda --resources mem_mb=50000 --rerun-triggers mtime \
-c 12 --rerun-incomplete -k -s crispr_hostlink.smk

## Iphop stuff:
- Downloaded iphop db and added own new MAG sequences
- iphop needed too much time and mem, so use different approach

In case I change my mind later, here is code for adding MAGs to iphop db and running iphop

In [None]:
# download iphop db 
# db in: /home/amhorst/databases/iphop/Aug_2023_pub_rw
mamba activate iphop_env
iphop download --db_dir ./iphop/

# run gtdbtk
# activate and run (pplacer needs mem so get scratch dir)
# Do not need classify de novo for MAGs into iphop
# can use the decorate tree command instead
mamba activate gtdbtk
gtdbtk classify_wf --cpus 36 --scratch_dir /home/amhorst/pplacer_scratch \
--genome_dir ../drep.999/dereplicated_genomes/ \
--extension fasta --skip_ani_screen --out_dir ./
# # bacteria
# gtdbtk de_novo_wf --genome_dir ../drep.999/dereplicated_genomes/ --bacteria \
# --outgroup_taxon p__Patescibacteria --out_dir pig_gut_MAGs_gtdbtk \
# --cpus 100 --force --extension fasta
# # archeae
# gtdbtk de_novo_wf --genome_dir ../drep.999/dereplicated_genomes/ --archaea \
# --outgroup_taxon p__Altiarchaeota --out_dir pig_gut_MAGs_gtdbtk \
# --cpus 100 --force --extension fasta

# decorate trees
# or can i decorate the trees instead, yes I can 
gtdbtk decorate --input_tree gtdbtk_classify/classify/gtdbtk.bac120.classify.tree.8.tree \
--output_tree gtdbtk_decorate/gtdbtk.bac120.decorated.tree.8.tree \
--gtdbtk_classification_file gtdbtk_classify/gtdbtk.bac120.summary.tsv

# made structure same as in example folder
# with an infer folder where the tree-taxonomy files are. Don't know if that works. 
# the problem is that the bacterial tree has 8 output trees. why??
# Now just using the backbone tree --> renamed and removed the word backbone. 
# with denovo MAGs, may meed to use the denovo command

In [None]:
# add new MAGs:
# iPHoP v1.3.3
# works if you have one tree for bacteria and one tree for archeae. 
# Use the backbone tree from gtdbtk
# 
# https://bitbucket.org/srouxjgi/iphop/issues/7/no-new-spacers-in-added-custom-mags
srun --account=ctbrowngrp -p med2 -J iphop_add -t 14:00:00 -c 50 --mem=50gb --pty bash

mamba activate iphop_env

iphop add_to_db \
-f /group/ctbrowngrp2/scratch/annie/2024-pigparadigm/results/MAGs/drep.999/dereplicated_genomes/ \
-g /group/ctbrowngrp2/scratch/annie/2024-pigparadigm/results/MAGs/gtdbtk_decorate/ \
-o /home/amhorst/databases/iphop/pig_gut_MAGs_July2024/ \
-d /home/amhorst/databases/iphop/Aug_2023_pub_rw \
-t 50 

In [None]:
# Run iphop with new MAGs added
srun --account=ctbrowngrp -p med2 -J iphop -t 100:00:00 -c 60 --mem=70gb --pty bash

mamba activate iphop_env
iphop predict -f hq_virseqs.95.cluster.fa \
-o ./iphop -d /home/amhorst/databases/iphop/pig_gut_MAGs_July2024/ -t 60