## Viral taxonomy on pig vOTUs

in sourmash/viral_taxonomy

Do with both ICTV + genbank 

Steps:
1) run a fastmultigather of all vOTUs against a viral taxonomy db. 
all vOTUs in: sig_files/signatures_concat/allvOTUs.zip
2) do at different scales, 100 and 1000, see what the difference is between the 2. 
3) Use snakefile to go from fastmultigather to taxonomy. 

Compare to vcontact2 output and genomad output

### Sourmash:
- genbank
- ictv
- scaled 100 and 1000


In [None]:
sourmash scripts fastmultigather \
sig.txt \
GCA_018585365.k31.rocksdb \
-k 31 -t 1000 -s 1000


In [None]:
# symlink the dbs
# symlink
ln -s /path/to/db/ .

# genbank db and taxonomy file:
/home/ntpierce/2023-spillover/output.genbank-viral/genbank.2023-05.viral.dna.zip
/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-viral.lineages.csv.gz

# ICTV dbs and tax file
/home/ntpierce/2023-spillover/output.vmr/
/home/ntpierce/2023-spillover/output.vmr/vmr_MSL38_v1.taxonomy.csv

# try the rocksdbs

## ICTV
- run against ictv tax
- use scale 100
- use scale 1000

For now just use a bp treshold of 0, can always adjust later. maybe use csvtk to subset results, or python

In [None]:
# run fastmultigather with sc100 and sc1000
cd ICTV
mkdir sc100
mkdir sc1000

# symlink
ln -s /home/ntpierce/2023-spillover/output.vmr/vmr_MSL38_v1.dna.k21.zip

# do srun
srun --account=ctbrowngrp -p med2 -J fmg_gb -t 5:00:00 -c 48 --mem=70gb --pty bash

mamba activate branchwater

# run the fastmultigather
cd sc100
sourmash scripts fastmultigather \
../../votu_sigs_s100.txt \
../vmr_MSL38_v1.dna.k21.zip \
-c 48 -k 21 -t 0 -s 100

cd sc1000
sourmash scripts fastmultigather \
../../votu_sigs_s100.txt \
../vmr_MSL38_v1.dna.k21.zip \
-c 48 -k 21 -t 0 -s 1000

In [None]:
# genbank
# run the fastmultigather
cd sc100
sourmash scripts fastmultigather \
../../votu_sigs_s100.txt \
../genbank.2023-05.viral.dna.zip \
-c 48 -k 21 -t 300 -s 100

cd sc1000
sourmash scripts fastmultigather \
../../votu_sigs_s100.txt \
../vmr_MSL38_v1.dna.k21.zip \
-c 48 -k 21 -t 0 -s 1000

In [None]:
ls . | grep 'fullgather' | wc -l

In [None]:
# sourmash tax
sourmash tax genome --gather-csv <gather_csv> [ ... ] --taxonomy-csv

## Rocksdb not working:
- For the genbank one: Error: Invalid argument: Column family not found: metadata
- For the ictv one: Error: No such file or directory (os error 2)

## vCONTACT2
- vContact2 is no longer supported, so current installations give bugs in python code
- Java installation doesnt work, maybe try --vcs-mode MCL instead of the java option

In [None]:
# because vcontact isnt updated with current python, use Chris old version
source /home/csantosm/initconda
conda activate VCONTACT2

# Compare to vcontact2 output
# prodigal file already avail
# remove spaces from the headers..
sed '/^>/ s/ .*//' ../../../virsorter2/protein_files >> tomato_pigeon_br_ns.faa


# and gene2genome
mamba activate vcontact2

python vcontact_gene2genome.py -p 240214_allvOTUs_highq.ns.faa -o \
240214_allvOTUs_highqx.csv -s Prodigal-FAA

# run vcontact2, 24 threads, 70GB, may need more
srun --account=ctbrowngrp -p bmm -J vcontact -t 16:00:00 -c 36 --mem=70gb --pty bash


vcontact2 --raw-proteins 240214_allvOTUs_highq.ns.faa \
--rel-mode 'Diamond' \
--db 'ProkaryoticViralRefSeq85-Merged' \
--proteins-fp 240214_allvOTUs_highq.csv \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--threads 24 \
--c1-bin /home/csantosm/miniconda3/bin/cluster_one-1.0.jar \
--output-dir .

# if vcontact fails, try with clusterOne (in vcontact2_redo):
vcontact2 --threads 24 \
--pcs vcontact2/vConTACT_pcs.csv \
--contigs vcontact2/vConTACT_contigs.csv \
--pc-profiles vcontact2/vConTACT_profiles.csv \
--vcs-mode ClusterONE \
--c1-bin /home/csantosm/miniconda3/bin/cluster_one-1.0.jar \
--db 'ProkaryoticViralRefSeq85-Merged' --output-dir vcontact2_redo

# if vcontact fails, try with MCL (in vcontact2_redp):
vcontact2 --threads 36 \
--pcs vcontact2/vConTACT_pcs.csv \
--contigs vcontact2/vConTACT_contigs.csv \
--pc-profiles vcontact2/vConTACT_profiles.csv \
--vcs-mode MCL \
--c1-bin /home/csantosm/miniconda3/bin/cluster_one-1.0.jar \
--db 'ProkaryoticViralRefSeq85-Merged' --output-dir vcontact2_mcl



### genomad
- Uses ICTV taxonomy (vmr 19)
- I think we only need genomad aggregated-classification


In [None]:
# db in the emersonlab, symlink that and the vOTU file:
ln -s /group/jbemersogrp/databases/genomad/genomad_db .

# srun it, needs quite some mem
srun --account=ctbrowngrp -p bmm -J genomad -t 64:00:00 -c 8 --mem=100gb --pty bash

# end to end for everything, need to annotate for classification
mamba activate genomad
genomad end-to-end --threads 8 --enable-score-calibration \
240214_allvOTUs_highq.fa genomad_out genomad_db --splits 3



In [None]:
# sketch vOTUs at different scales if neccesary 
# (not neccesary cause the scaled 100 should be able to scale up to 1000)
for f in *.fa
do
echo sourmash sketch dna -p k=21,scaled=100 $f --name ${f%.fa*} -o ../../vOTUs_100/${f%.fa*}.sig.gz
done | parallel -j 42
&&
for f in *.fa
do
echo sourmash sketch dna -p k=21,scaled=1000 $f --name ${f%.fa*} -o ../../vOTUs_1000/${f%.fa*}.sig.gz
done | parallel -j 42

In [None]:
# use snakemake file
srun --account=ctbrowngrp -p med2 -J fmg -t 12:00:00 -c 30 --mem=60gb --pty bash
snakemake --resources mem_mb=60000 --rerun-triggers mtime -c 30 --rerun-incomplete -k --latency-wait 1
