# process the DNA assemblies to get viral contigs
- concatenate all contigs
- remove everything under 10kb
- rename
- run virsorter2
- run iphop
- map reads


All assembled contigs from the xx samples were concatenated into one fasta file. Contigs smaller than 10kb were removed, and contigs were renamed using bbmap (ref) reformat.sh and rename.sh, respectively. Virsorter2 (ref) was then ran on all remaining contigs (n=454,868), using --min-length 10000 and --min-score 0.9. Resulting viral contigs were dereplicated at approximately species level (95% ANI), over 85% of the length of the shorter contig with CD-Hit (ref), using the -c 0.95 and -aS 0.85. Clustering resulted in xx viral contigs that were used for further analysis.
Bowtie2 (ref), was used to map reads to the viral contigs. A bowtie2 index file was createdusing the bowtie2-build command, otherwise standard settings. Both metatranscriptomic reads and metagenomic reads were mapped, to get an overview of the respective 'active' viral community and the present viral community. Reads were mapped using the --no-unal and --sensitive settings. 



In [None]:
# concatenate (n=)
cat /home/hfm/ER_rumenShotgun/ER_Nanuq/MAG/assembled_reads/NS*/final.contigs.fa > all_DNA_contigs.Hugo.fa

# srun 
srun --account=ctbrowngrp -p bmm -J vs2 -t 0:10:00 -c 1 --mem=50gb --pty bash

# rename and remove < 10k
mamba activate bbmap
reformat.sh in=all_DNA_contigs.Hugo.fa out=all_DNA_contigs.10k.fa minlength=10000 -Xmx50g
rename.sh in=all_DNA_contigs.10k.fa out=all_DNA_contigs.10k.rn.fa prefix=ath_rumen_2024_ -Xmx50g

# Partition it because virsorter will take forever on 454,868 sequences
partition.sh in=../all_DNA_contigs.10k.rn.fa out=out_%.fasta ways=40 -Xmx50g


In [None]:
# snakemake virsorter on the subsetted files
srun --account=ctbrowngrp -p med2 -J vs2 -t 24:00:00 -c 48 --mem=50gb --pty bash

mamba activate branchwater
snakemake --use-conda --resources mem_mb=50000 --rerun-triggers mtime \
-c 48 --rerun-incomplete -k -s Snakefile_vs2


# snakemake file
FASTA, = glob_wildcards('../resources/split_DNA_contigs/{fasta}.fasta')

rule all:
    input:
        expand("../results/virsorter2/DNA/check/{fasta}.done", fasta=FASTA),

rule virsorter2:
    input:
        fasta = '../resources/split_DNA_contigs/{fasta}.fasta',
    output:
        check='../results/virsorter2/DNA/check/{fasta}.done',
    conda: 
        "virsorter2"
    threads: 16
    shell:
        """
        virsorter run all -w ../results/virsorter2/DNA/{wildcards.fasta} \
        -i {input.fasta} -d /group/jbemersogrp/databases/virsorter/ \
        --min-length 10000 -j {threads} --min-score 0.9
        """


In [None]:
# resulted in 58,188 viral contigs
# Now concatenate and drep at 95

# srun 
srun --account=ctbrowngrp -p bmm -J vs2 -t 0:10:00 -c 1 --mem=50gb --pty bash

# rename again bc of stupid ||
mamba activate bbmap
rename.sh in=viral_contigs.fa out=viral_contigs.rn.fa prefix=ath_rumvir_24_ -Xmx50g

# split contigs for drep
mkdir contigs
cd ./contigs
awk '/^>/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' ../viral_contigs.rn.fa 

srun --account=ctbrowngrp -p bmh -J drep -t 10:00:00 -c 24 --mem=100gb --pty bash
mamba activate drep

# Run dRep at 95% ANI over 85% of length of longest contigs
dRep dereplicate ./drep --S_algorithm ANImf --ignoreGenomeQuality -l 10000 -sa 0.95 -nc 0.85 -p 24

In [None]:
# now cdhit to deduplicate, drep being shitty 
srun --account=ctbrowngrp -p bmm -J cdhit -t 5:00:00 -c 32 --mem=70gb --pty bash

mamba activate cdhit
cd-hit-est -i viral_contigs.rn.fa \
-o viral_contigs.95.cluster.fa -d 0 \
-c 0.95 -aS 0.85 -M 95000 -T 24

In [None]:
# DNA reads:
/home/hfm/ER_rumenShotgun/ER_Nanuq/TRIMMED/*R1_trim.fastq.gz

# metaT reads
/home/hfm/Rumen_Microbiome_Genomics/1_Sequences_Guanhui/TRIMMED/18048XR-81-24_S11_L003_R2_001_trim.fixed.gz

In [None]:
# Iphop
# Only run iphop on sequences with an RdRp predicted. I don't think it makes sense to use the rest
# We can run iphop on DNA viruses too
mamba activate bbmap
filterbyname.sh in=../../resources/Hugo_metaT.assembly.fa out=contigs_w_rdrp.fa \
names=../genomad_out/contigs_w_rdrp.txt include=t

ln -s /group/jbemersogrp/databases/iphop . 

# run it
srun --account=ctbrowngrp -p med2 -J iphop -t 24:00:00 -c 30 --mem=70gb --pty bash

mamba activate iphop_env
iphop predict -f contigs_w_rdrp.fa \
-o ./ -d ./iphop/latest/Aug_2023_pub_rw -t 30