## Viral sequence processing:

Sequences need to be filtered on quality stats and get clustered at 95% ANI. Doing this with cdhit bc lazy and it works as well as drep. Using only sequences from the 338 'good' metagenomic datasets

After clustering:
- pangenomics
- host linkages
- AMGs (through VIBRANT)?




### Clustering and filtering viral sequences:
- We have low, high and medium quality sequences. Since these come from metaGs, I only want to keep high and med quality. 
- Need to include all new vOTUs
- Then cluster using CD-HIT: https://www.biostars.org/p/366171/ for full file names
- full file names for pangenomics later
- vOTUs can be used for AMGs, host linking
- CLuster at 95% ANI

# AMGS
for AMGs:
- Rerun VS2 with dramV setting
- Run DRAM-V

In [None]:
# Clustering and filtering for high-quality vOTUs

# In: /group/ctbrowngrp2/scratch/annie/2024-pigparadigm/results/vOTUs/virsorter2
# concatenate all viral score tsvs (n=111,082)
mamba activate csvtk
csvtk concat virsorter2/*/final-viral-score.tsv -t > virsorter2_viralscores.tsv

# Rename the csv files with viral scores
for f in virsorter2/*/final-viral-combined.fa
do 
echo $f
grep -e '>' $f > ./tsv_files/oldhead/${f%/*}.csv
done

# now for the renamed ones (renamed viral seqs because of spaces in names)
for f in contigs/*rename.fa
do
echo $f
grep -e '>' $f > ../header_files/newhead/${f%_rename*}.csv
done

# remove the carrots, in both oldhead and newhead folders
for f in *.csv
do
sed 's/[<>,]//g' $f > ${f%.csv*}.clean.csv
done

# now make a concatenated file with names
for f in *.clean.csv
do
paste -d "\t" $f ../newhead/$f > ../combined_head/$f
done

# add header
for f in *.csv
do
csvtk add-header -t $f -n seqname,newname > $f.newname
done

# concat all
csvtk concat -t *.newname > ../header_keys.tsv

# join the final csvs
csvtk join -t -f seqname virsorter2_viralscores.tsv header_files/header_keys.tsv > renamed_viralscore.tsv

# select for max_score => 0.9 (n=62,146)
csvtk filter -f "max_score>=0.9" -t renamed_viralscore.tsv > highqual_renamed_viralscore.tsv
csvtk filter -f "length>=5000" -t highqual_renamed_viralscore.tsv > highqual_renamed_viralscore.len.tsv


# take the column with the new names so we can filter the fasta
csvtk cut -f "newname" highqual_renamed_viralscore.len.tsv -t > highqual_namelist.txt

# now use bbmap to filter fasta
mamba activate bbmap 
filterbyname.sh in=hq_virseqs.fa out=hq_virseqs.len.fa names=./tsv_files/highqual_namelist.txt include=t

# and sort it by length 
sortbyname.sh in=hq_virseqs.len.fa out=hq_virseqs.sort.fa length descending

# now cdhit to deduplicate 
srun --account=ctbrowngrp -p bmm -J cdhit -t 5:00:00 -c 32 --mem=70gb --pty bash

mamba activate cdhit
cd-hit-est -i hq_virseqs.sort.fa \
-o hq_virseqs.95.cluster.fa -d 0 \
-c 0.95 -aS 0.85 -M 70000 -T 32

## Pangenomics
- Picked 10 clusters > 30 strains
- Picked 10 metaGs to crosscheck
- Use Snakefile_vir

In [None]:
# use snakemake
# run snake 
srun --account=ctbrowngrp -p med2 -J snake -t 20:00:00 -c 40 --mem=20gb --pty bash
mamba activate branchwater

snakemake -s Snakefile_vir_tax --use-conda --resources mem_mb=20000 --rerun-triggers mtime \
-c 40 --rerun-incomplete -k -n



## Taxonomy
- Use sourmash for the viral taxonomy
- Use genomad for taxonomy
- Compare output

In [None]:
#GENOMAD
# srun it, needs quite some mem
srun --account=ctbrowngrp -p med2 -J genomad -t 24:00:00 -c 36 --mem=80gb --pty bash

# end to end for everything, need to annotate for classification
# Use for taxonomy of DNA phage

mamba activate genomad
genomad end-to-end \
hq_virseqs.95.cluster.fa \
./genomad /group/jbemersogrp/databases/genomad/genomad_db \
--threads 36 --enable-score-calibration \
--splits 20 --cleanup 

In [None]:
# Find the sourmash code tomorrow
# Sourmash
# Split into individ
awk '/^>/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}'  ../hq_virseqs.95.cluster.fa
# sketch contigs
# Use fastgather -> ICTV
# Then sourmash tax


In [None]:
srun --account=ctbrowngrp -p med2 -J fmg -t 24:00:00 -c 50 --mem=60gb --pty bash


# can I do a fmg against ICTV??
sourmash scripts fastmultigather \
../vOTU_sketchpaths.txt \
/home/ntpierce/2023-spillover/output.vmr/vmr_MSL38_v1.dna.k21.zip \
-k 21 --scaled 100 -t 100 -m DNA -c 50 

In [None]:

# tax annotate
sourmash tax annotate --from-file fmg_out.txt \
-t /home/ntpierce/2023-spillover/output.vmr/vmr_MSL38_v1.taxonomy.csv \
-o fmg_tax --ictv

# tax genome 
sourmash tax genome -q --from-file fmg_out.txt -r family \
-t /home/ntpierce/2023-spillover/output.vmr/vmr_MSL38_v1.taxonomy.csv \
--output-dir fmg_tax_genome --ictv -o fmg_tax_genome_fam