In [None]:
import pandas as pd
import os

sag = '{SAG ID here}'

In this notebook, we will run the quality controlled virus candidates through DRAM-v to look more closely at annotations on these viral contigs.

This notebook will go through the following steps:

7. Run viral contigs through DRAM-v
8. Summarize results and assess identified virus contigs.


First let's import our important variables into this notebook:

In [None]:
# setting a working subdirectory for virus finding
sagdir = "{}_vfinding".format(sag)

# setting a location to place SAG contigs
sag_contigs = os.path.join(sagdir, '{}_contigs.fasta'.format(sag))

In the last notebook, we put together an input file for DRAM-v: 

In [None]:
dramv_infasta = os.path.join(sagdir, '{}_cvpassing_vcandidates.fasta'.format(sag))

#### DRAM-v

In [None]:
dram_outdir = os.path.join(sagdir, "dramv")

We'll want to work in the DRAM conda environment to run this software in a terminal, rather than in this notebook. Load this environment in terminal by typing:

```
source activate /mnt/storage/envs/dram
```

The general command for dram-v is:
```
DRAM-v.py annotate -i {contig} -o {outdir} --min_contig_size 2000 --low_mem_mode --threads {threads}"
```

To run for your SAG, enter the printed output from the below line into your terminal:

In [None]:
print('DRAM-v.py annotate -i {dramv_infasta} -o {dram_outdir} --min_contig_size 2000 --low_mem_mode --threads 2'.format(**locals()))

Now let's check out DRAMv's outputs:

In [None]:
!ls {sagdir}/dramv/

DRAM-v has summarizing functions, but requries outputs from specific virus finders. Since we are using a blend of several, we cannot use DRAM to summarize the annotation results.  Let's check them out ourselves.

In [None]:
andf = pd.read_csv(os.path.join(dram_outdir, 'annotations.tsv'), sep = "\t")
andf = andf.rename(columns = {'Unnamed: 0':'orfid'})

With this table, we can start to figure out if our candidate contigs are, indeed viral.

let's look at the columns

In [None]:
andf.columns

DRAM-v compares ORFs to a number of different databases. It can be hard to pull together annotations from looking at this raw annotation file.  Let's parse a couple of the columns to pull out annotations specifically, and create a summary column with the most descriptive annotation.

In [None]:
#andf['vogdb_text'] = [" ".join(i.split(" ")[1:-1]).replace(";",'') if type(i) != float else i for i in andf['vogdb_description']]
#andf.loc[andf['vogdb_categories'].str.contains('Xr') | andf['vogdb_categories'].str.contains('Xs') | andf['vogdb_categories'].str.contains('Xh') |  adf['vogdb_categories'].str.contains('Xp'), 'vog_db_vir_protein' ] = 1
#andf['vog_db_vir_protein'] = adf['vog_db_vir_protein'].fillna(0)

andf['viral_hit_genome'] = [i.split("[")[-1].split("]")[0] if type(i) != float else i for i in andf['viral_hit']]
andf['viral_hit_gene_desc'] = [" ".join(i.split("[")[0].split(" ")[1:]) if type(i) != float else i for i in andf['viral_hit']]


import math
anns = []
source = []

for j, l in andf.iterrows():
    annotated = False
    for i in ['viral_hit_gene_desc', 'pfam_hits','kegg_hit']:
        #if l[i] != math.nan and 'hypothetical' not in l[i]:
        if type(l[i]) != float:
            if 'hypothetical' not in l[i]:
                anns.append(l[i])
                source.append(i)
                annotated = True
                break
    if not annotated:
        
        #if type(l['feature_max']) != float:
        #    anns.append(l['feature_max'])
        #    source.append('phanns')
        #    annotated = True
            
        #else:
        for i in ['viral_hit_gene_desc', 'pfam_hits','kegg_hit']:
            if type(l[i]) != float:
                anns.append(l[i])
                source.append(i)
                annotated = True
                break
    if not annotated:
        anns.append('')
        source.append('')
        
andf['annotation'] = anns
andf['annotation_source'] = source

It's time to take a manual look at these data, and take stock of what these sequences might actually be.  Let's save this dataframe, focusing on columns that we want to scrutinze more closely.

In [None]:
andf[['orfid','scaffold','viral_hit_genome','annotation','annotation_source','is_transposon', 'amg_flags']].to_csv(os.path.join(sagdir, '{}_vcandidate_annotations.csv'.format(sag)))

Questions:

Do any of your contigs contain phage structural genes?
(e.g. capsid, tail, collar, baseplate, wedge, integrase, terminase)

Can you observe any patterns in your identified viral candidate contigs?

After looking at the annotations, how many contigs in your SAG do you believe are viral?