## Process metatranscriptomic assemblies to look for RNA viruses
- Get all metatranscriptomic assemblies (n=9,822,279)
- Cut off at 600bp (at least one ORF) (n=1,459,768)
- Run genomad to recover viral proteins (uses prodigal-gv)
- Manually go through output and only keep matches to RdRps.
- Deduplicate both the protein (99% AAI) and DNA sequences (95% ANI)
- Use seqkit to get RdRp protein sequences
- Downloaded NCBI RefSeq (v. 224) RdRp sequences (n=2124)
- BlastP against all rdRp in refseq.
- Manually compare blast and genomad taxonomy
- If match, place in family level tree


For taxonomic classifications of RdRps and comparison of blast vs genomad, see # For taxonomy, see the taxomony notebook (taxonomy.ipynb)




In [None]:
# LENGHT CUTOFF
# Cut-off at 600 bp for each contig (will be about 200 aa)
mamba activate bbmap
reformat.sh in=Hugo_metaT.assembly.fa out=Hugo_metaT.over600.trim.fa minlength=600 -Xmx50g

# remove spaces from headers
reformat.sh in=Hugo_metaT.over600.trim.fa out=Hugo_metaT.over600.trim.ns.fa trd


In [None]:
# GENOMAD
# Use Genomad to predict and recover viral proteins from metaTs.
# link the genomad db 
ln -s /group/jbemersogrp/databases/genomad/genomad_db . 

# srun it, needs quite some mem
srun --account=ctbrowngrp -p bmm -J genomad -t 12:00:00 -c 50 --mem=100gb --pty bash

# end to end for everything, need to annotate for classification
# after genomad, screen the results by hand.
# For the ones that are predicted to be RdRp, create phylo trees, and map reads to contigs
mamba activate genomad
genomad end-to-end \
Hugo_metaT.over600.trim.fa ../results/genomad_out genomad_db \
--threads 50 --enable-score-calibration \
--splits 10 --cleanup \

In [None]:
# FILTER SEQUENCES
# Use genomad output to filter assemblies and proteins to only retain hits to RdRp
# Filter assemblies (n=3,491)
mamba activate bbmap
filterbyname.sh in=Hugo_metaT.over600.trim.fa \
out=metaT.Rdrphits.fa \
names=all_contigs_rdrphit.txt include=t

# filter proteins (proteins from genomad_output, n=3,498)
mamba activate seqkit
seqkit grep -n -f all_proteins_rdrphit.txt \
Hugo_metaT.over600.trim_virus_proteins.faa -o metaT.Rdrphits.faa

# Clean headers, as I don't want spaces in them
cut -d" " -f1 metaT.Rdrphits.faa > metaT.Rdrphits.ns.faa


In [None]:
# DEREPLICATION
# Contigs, dereplicate at 95% ANI (n=2,378)
mamba activate cdhit
cd-hit-est -i metaT.Rdrphits.fa \
-o 240708_rdrp_contigs.fa -d 0 \
-c 0.95 -aS 0.85 -M 95000 -T 24

# dereplicate the protein sequences
# Deduplicate the RdRps of interest, together with the refseq RdRps (2,656 sequences left, great!):
mamba activate cdhit
cd-hit -i  own_refseq.rdrp.faa -o  own_refseq.rdrp.dedup.faa \
-c 0.99 -T 1 -d 0

In [None]:
# READ MAPPING
# create a bowtie index 
mamba activate bowtie2
bowtie2-build 240708_rdrp_contigs.fa 240708_rdrp_contigs -p 30

# link the clean metaT reads
ln -s /home/hfm/Rumen_Microbiome_Genomics/1_Sequences_Guanhui/TRIMMED/*R1_001_trim* .
ln -s /home/hfm/Rumen_Microbiome_Genomics/1_Sequences_Guanhui/TRIMMED/*R2_001_trim* .

# Use snakemake for read mapping (see Snakefile_bowtie)
# srun
srun --account=ctbrowngrp -p med2 -J bt2 -t 5:00:00 -c 30 --mem=50gb --pty bash
# snakemake
snakemake -s Snakefile_bowtie --use-conda --resources mem_mb=50000 --rerun-triggers mtime \
-c 30 --rerun-incomplete -k -n

In [None]:
# COVERAGE TABLE
# Use coverM to create a coverage table for RNA virs with RdRp
mamba activate coverm

# make a coverage table, where the min amount of the contig that has to be covered is 75%
coverm contig -m mean --min-covered-fraction 0.75 -b *.bam > ../../240708_coverM-RNA.tsv

In [None]:
# PHYLOGENETIC TREES
# Use protein predictions to create phylo trees for RdRps (and thus RNA virs)
# Compare blastX and Genomad phylogenies to figure out what family an RdRp should be in a tree with

# create a diamond db 
mamba activate diamond
diamond makedb --in refseqrdrp.ns.faa --db refseqrdrp.ns.dmnd

# do blastp (out of 2656, 2028 aligned. 1,826 aligned with something at family level) in ./results/alignments/
mamba activate diamond
diamond blastp -q own.dedup.faa \
--max-target-seqs 1 --header \
-e 1 --threads 1 \
--very-sensitive \
--db refseqrdrp.ns.dmnd \
-o refseq.genomad.blastp.tsv

In [None]:
# Create phylogenetic trees using clustalo, trimal and fasttree
# Obviously an overarching tree was created for all RdRps, using same commands
# See Snakefile_alignment
# run snake
srun --account=ctbrowngrp -p bmm -J tree -t 1:00:00 -c 36 --mem=50gb --pty bash

# Run snakemake 
snakemake -s Snakefile_alignment --use-conda --resources mem_mb=50000 --rerun-triggers mtime \
-c 36 --rerun-incomplete -k -n

In [None]:
# Make sure to export each new env into a yml
conda env export > environment.yml

see https://github.com/AnneliektH/2024-caleb-snakemake/ on how to call on them