## How to run sourmash for viral genome taxonomy and recovery

How to run sourmash for viral taxonomic classifications?
What database works better, what k size, what scale? Does protein work?

## Steps
1. Start with simulated reads from Roux et al., 2017.
2a. create signature files directly from these reads
2b. Run reads through virome pipeline (assemly, Virsorter2)
3. predict proteins on predicted viruses
4. Create signatures
5. do fastmultigather (on 2a, 2b, 3) -> gather -> taxonomy
6. Genomad on 2b, compare to others

Roux uses Refseq v69 for this, Tessa made a db for that. 
- db: /home/ntpierce/2023-vsmash/output.refseq69
- tax: /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.taxonomy.csv

Reads from Roux 2017 at: https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/Virome_pipeline_benchmark


In [None]:
# use gocommands to download
mamba activate gocommands

# get the whole folder
gocmd get --progress /iplant/home/shared/iVirus/Virome_pipeline_benchmark/Simulated_Viromes/Simulations_10M/Sample_9/Reads_QC_R*.fastq.gz

# move sim reads from roux2017
for f in Sample_*/Reads_QC_R2.fastq.gz
do
mv $f ../sim_reads/${f%/Reads*}_QC_R2.fastq.gz
done


In [None]:
# how to run a snakefile
srun --account=ctbrowngrp -p med2 -J simread -t 24:00:00 -c 12 --mem=30gb --pty bash
mamba activate branchwater
snakemake --use-conda --resources mem_mb=30000 --rerun-triggers mtime \
-c 12 --rerun-incomplete -k -n

In [None]:
# fastmultigather only takes lists of gz files
Dont have those for reads, so use for loop

for f in *.zip
do
sourmash scripts fastmultigather \
$f /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.dna-sc100.zip \
-c 54 -k 15 -s 100 -t 300 
done 
for f in *.csv
do
mv $f ../../../fastmultigather/${f%.csv}.k15.s100.csv
done

for f in *.zip
do
sourmash scripts fastmultigather \
$f /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.protein-sc100.zip \
-c 54 -k 10 -s 100 -t 300 -m protein
done
for f in *.csv
do
mv $f ../../../fastmultigather/${f%.csv}.tr.k10.s100.csv
done

sourmash gather -k 21 Sample_1_vs_s1000_dna.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.dna-sc1000.zip -o Sample1_vs_s1000.csv

In [None]:
for f in *.zip
do
mv $f  ${f%.zip}_reads.zip
done

In [None]:
We will need to do a bunch of fastmultigathers...
protein(k7,10), dna(k15,21), scale 100, scale 1000, 
for reads, mh contigs and virsorter contigs.

Can we snake it?
We can give fmg a list of files, so put all file types that are the same in the same list?
a protein list and a dna list of files at each scale and the ksizes we can loop

readlink -f > filelist.txt

In [None]:
# do bash loop for sourmash compare cause lazy
# maybe try containment or avg-containment instead of ani
for i in {7,10,12}
do
sourmash compare \
*.nucl.zip \
-o sourmash_compare/vir_nucl.k$i.cmp \
-k $i --ani \
--labels-to sourmash_compare/vir_nucl.k$i.labels.csv
done

In [None]:
make a couple of file lists for the contig ones
readlink -f *.prot.s1000.7* > prot.s1000.7.txt
readlink -f *.prot.s1000.10* > prot.s1000.10.txt
readlink -f *.prot.s100.7* > prot.s100.7.txt
readlink -f *.prot.s100.10* > prot.s100.10.txt
readlink -f *.dna.s1000* > dna.s1000.txt
readlink -f *.dna.s100* > dna.s100.txt

sourmash scripts fastmultigather dna.s100.txt \
/home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.dna-sc100.zip \
-k 21 -c 54 -t 100

for f in *.csv
do
mv $f ../../fastmultigather/${f%.csv}.tr.k10.s100.csv
done

for f in Sample_1.mh.dna.s100.sig.gz
do
sourmash scripts fastmultigather \
$f /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.dna-sc100.zip \
-c 54 -k 21 -s 100 -t 300 
done
for f in *.csv
do
mv $f ../../../fastmultigather/${f%.csv}.tr.k10.s100.csv
done



sourmash scripts fastmultigather \
Samples.dna.s1000.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.dna-sc1000.zip \
-c 54 -k 21 -s 1000 -t 300 && mv *.csv ./dna_contigs/k21_s1000/ && \
sourmash scripts fastmultigather \
Samples.dna.s1000.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.dna-sc1000.zip \
-c 54 -k 15 -s 1000 -t 300 && mv *.csv ./dna_contigs/k15_s1000/ && \
sourmash scripts fastmultigather \
Samples.prot.k10.s100.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.protein-sc100.zip \
-c 54 -k 10 -s 100 -t 300 -m protein && mv *.csv ./protein/k10_s100/ && \
sourmash scripts fastmultigather \
Samples.prot.k10.s1000.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.protein-sc1000.zip \
-c 54 -k 10 -s 1000 -t 300 -m protein && mv *.csv ./protein/k10_s1000/ && \
sourmash scripts fastmultigather \
Samples.prot.k7.s100.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.protein-sc100.zip \
-c 54 -k 7 -s 100 -t 300 -m protein && mv *.csv ./protein/k7_s100/ && \
sourmash scripts fastmultigather \
Samples.prot.k7.s1000.sig.gz /home/ntpierce/2023-vsmash/output.refseq69/refseq69_phages.protein-sc1000.zip \
-c 54 -k 7 -s 1000 -t 300 -m protein && mv *.csv ./protein/k7_s1000/ 

In [None]:
# cat outputs and do tax
csvtk concat *.gather.tr.k7.s100.csv > ../reads.gather.tr.k7.s100.csv
csvtk concat *.gather.tr.k7.s1000.csv > ../reads.gather.tr.k7.s1000.csv
csvtk concat *.gather.tr.k10.s100.csv > ../reads.gather.tr.k10.s100.csv
csvtk concat *.gather.tr.k10.s1000.csv > ../reads.gather.tr.k10.s1000.csv

csvtk concat *.gather.k15.s100.csv > ../reads.gather.k15.s100.csv
csvtk concat *.gather.k15.s1000.csv > ../reads.gather.k15.s1000.csv
csvtk concat *.gather.k21.s100.csv > ../reads.gather.k21.s100.csv
csvtk concat *.gather.k21.s1000.csv > ../reads.gather.k21.s1000.csv


sourmash tax genome -g {input.csv} -t {input.taxdb} > {output}


In [None]:
# after virsorter2, have to individualize contigs to run prodigal
awk '/^>/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' contigs.fa

# can use the individ contigs for the snakefile, aka running prodigal, sourmash. 
# rename contigs first. s ee snake