# Metaviromics

In this exercise we will use a very basic workflow for the search of viral metagenomes from RNA-seq data. 

## Before starting

Create the folders in which we'll store our data. There are two folders already: Fastq and ReferenceGenome whose content is self-explicatory! Let's create folders for the alignment, the unmapped reads, Clark output table and the viral reads we'll select. SPAdes will create is own output folder.

In [None]:
cd /home/student/DATA/metaviromics_data/
mkdir alignment unmapped clark viral
cd /Fastq
cat *R1_L*.fastq.gz > RNAseq_R1.fastq.gz
cat *R2_L*.fastq.gz > RNAseq_R2.fastq.gz

Step 1: Check if the fastq files are good. Let's use fastqc. After the program finishes just open with any browser the .html files.

In [None]:
fastqc RNAseq_R1.fastq.gz RNAseq_R2.fastq.gz

Step 3: Build the index for Hisat2 (it will be used to map RNA-sequencing reads)

In [None]:
cd ../ReferenceGenome
hisat2-build AaegL5_Chr1.fa AaegL5_index

Step 4: Map the RNA-seq raw reads to the host reference genome. In our case a fragment of the African Tiger Mosquito (Aedes albopictus) Chromosome 1. The output will be a SAM file which is uncompressed.

In [None]:
cd ..
hisat2 -x ReferenceGenome/AaegL5_index -1 Fastq/RNAseq_R1.fastq -2 Fastq/RNAseq_R2.fastq -p 2 -S alignment/RNAseq_AlignedRefGenome.sam

Step 5: Extract the unmapped reads. We are interested in reads that are NOT derived from the host genome.

In [None]:
samtools fastq -@ 2 -f 4 alignment/RNAseq_AlignedRefGenome.sam -1 unmapped/Unmapped_R1.fastq -2 unmapped/Unmapped_R2.fastq -s unmapped/Singletons.fastq

Step 6: Time to start searching for viruses! Assemble contigs using SPAdes in viral RNA mode.

In [None]:
rnaviralspades.py -t 2 -m 2 -1 unmapped/Unmapped_R1.fastq -2 umapped/Unmapped_R2.fastq -s unmapped/Singletons.fastq -o SPAdes_output

Step 7: Let's classify all the contigs produced by SPAdes using our custom database of viruses (including Flaviviruses, Alphaviruses and Vesiculoviruses) and see if we find any contig from these genera!
Do NOT launch this command, as it will require to donwload the taxonomy, which will take a lot of time. 

In [None]:
bash set_targets.sh /Clark/ClarkDB/ custom
bash classify_metagenome.sh -m 2 -n 4 -O SPAdes_output/contigs.fasta -R Clark/Clark_output

Step 8: Extract a fasta file with only the fasta sequences that were classified as viral by Clark. First, create a file with the names of the reads you want to extract (the viral ones).
Take note of which contigs have a viral ID before proceeding and mark them on a file.

In [None]:
python ExtractFasta.py -f SPAdes_output/contigs.fasta -k viral/virIDs.txt > viral/Viral_contigs.fasta

## Now that we have clark output and the viral contigs fasta...

Let's check them using BLAST, NCBI viral taxonomy, metagenomics analysis tools (e.g. MEGAN), online tools, visualize the data using Krona...