In [1]:
import os
# pip install pandas as needed
import pandas as pd

Let's find some viruses!  This notebook will go through the following steps:

1. Run three virus finders
2. Apply thresholds to filter out unlikely candidates
3. Merge results from virus finders

First, pick a SAG!

Check out the table [here](https://docs.google.com/spreadsheets/d/1yn6GsOv8dHwtsn2UKs5KIYUtCLsc1PL5Lsp6aMV6294/edit?usp=sharing) to select a SAG. Fill in your name next to your selected SAG.

Now assign your SAG ID to the SAG variable.  Let's also set some other variable names, such as the location of the working directory and where we will place the sag contigs.

In [45]:
# choice of SAGs
# 1. AG-910-D08
# 2. AG-910-N05 (smallest - quickest)
# 3. AG-910-N11
# 4. AG-910-F10
# 5. AG-910-K03 - really big - 20x of the smallest... 

sag = 'AG-910-N05'

# setting a working subdirectory for virus finding
sagdir = "{}_vfinding".format(sag)

# setting a location to place SAG contigs
sag_contigs = os.path.join(sagdir, 'final_contigs_{}.fasta'.format(sag))

Now let's create a working directory, and copy our SAG over to this directory

In [46]:
!mkdir {sagdir}

!cp /mnt/storage/data/ag-910/{sag}/final_contigs_{sag}.fasta {sagdir}

#### VirSorter2

First let's make a directory for virsorter2 outputs:

In [47]:
vs2_output_dir = os.path.join(sagdir, 'vs2')

!mkdir {vs2_output_dir}

VirSorter2 is installed within a conda environment.  To load virsorter2, open a terminal and enter:

```
source activate /mnt/storage/envs/vs2
```

The general command for virsorter2 is:

```
virsorter run -i {sag_contigs} -w {output_dir} -j {number of jobs}
```

For this tutorial, enter the below printed line into your terminal:

In [48]:
print('virsorter run -i {sag_contigs} -w {vs2_output_dir} -j 2'.format(sag_contigs = sag_contigs, vs2_output_dir = vs2_output_dir))

virsorter run -i AG-910-N05_vfinding/final_contigs_AG-910-N05.fasta -w AG-910-N05_vfinding/vs2 -j 2


### DeepVirFinder

Next let's run our SAG through DeepVirFinder

In [49]:
dvf_outdir = os.path.join(sagdir, "dvf")

!mkdir {dvf_outdir}

We'll want to work in the DeepVirFinder conda environment to run this software in a terminal, rather than in this notebook. Load this environment in terminal by typing:

```
source activate /mnt/storage/envs/dvf
```

The general command for vibrant is:
```
dvf.py -i {contig} -o {output directory} -c {cpus}
```

To run for your SAG, enter the printed output from the below line into your terminal:

In [50]:
print('python /mnt/storage/software/DeepVirFinder/dvf.py -i {sag_contigs} -o {dvf_outdir} -c 2 '.format(**locals()))

python /mnt/storage/software/DeepVirFinder/dvf.py -i AG-910-N05_vfinding/final_contigs_AG-910-N05.fasta -o AG-910-N05_vfinding/dvf -c 2 


Now let's explore the outputs!

### Virsorter2

Let's list the directory contents:

In [51]:
!ls {vs2_output_dir}

config.yaml		  final-viral-combined.fa  iter-0
final-viral-boundary.tsv  final-viral-score.tsv    log


The file we'll look at is called 'final-viral-score.tsv'.  Next, we'll load it into a pandas dataframe:

In [52]:
vs2df = pd.read_csv(os.path.join(vs2_output_dir, "final-viral-score.tsv"), sep = "\t")
vs2df

Unnamed: 0,seqname,dsDNAphage,ssDNA,max_score,max_score_group,length,hallmark,viral,cellular
0,AG-910-N05_NODE_1||full,0.993,0.28,0.993,dsDNAphage,17534,0,90.0,0.0
1,AG-910-N05_NODE_2||full,0.94,0.7,0.94,dsDNAphage,7245,0,42.9,0.0


Let's add/transform some columns so that they may be merged with other virus finder outputs:

In [54]:
vs2df['contig'] = [i.split("|")[0] for i in vs2df['seqname']]
vs2df['vs2_type'] = [i.split("|")[-1] for i in vs2df['seqname']]
vs2df['sag'] = [i.split("_")[0] for i in vs2df['contig']]
vs2df

Unnamed: 0,seqname,dsDNAphage,ssDNA,max_score,max_score_group,length,hallmark,viral,cellular,contig,vs2_type,sag
0,AG-910-N05_NODE_1||full,0.993,0.28,0.993,dsDNAphage,17534,0,90.0,0.0,AG-910-N05_NODE_1,full,AG-910-N05
1,AG-910-N05_NODE_2||full,0.94,0.7,0.94,dsDNAphage,7245,0,42.9,0.0,AG-910-N05_NODE_2,full,AG-910-N05


VirSorter2 is one of the only workflows that will intentionally look for different types of viruses by running several different types of virus searches.  By using the default parameters, it searched for dsDNA phages and ssDNA viruses, but we could have asked it to search for additional types of viruses.

Your results may all look promising, but sometimes, this is not the case.  Let's apply a filter to this dataframe to only keep matches virsorter2 has high confidence in:

In pseudocode:
```
Keep the contig as a viral candidate if max_score > 0.9 
and
either
max_score_group == 'ssDNA'
or
hallmark > 0
```

And we will also add a column indicating that they are virsorter2 positive.

In [77]:
vs2_keeps = vs2df[(vs2df['max_score'] > 0.9) | 
((vs2df['max_score_group'] == 'ssDNA') | (vs2df['hallmark'] > 0))].copy()

vs2_keeps['vs2_pos'] = 1
vs2_keeps

Unnamed: 0,seqname,dsDNAphage,ssDNA,max_score,max_score_group,length,hallmark,viral,cellular,contig,vs2_type,sag,vs2_pos
0,AG-910-N05_NODE_1||full,0.993,0.28,0.993,dsDNAphage,17534,0,90.0,0.0,AG-910-N05_NODE_1,full,AG-910-N05,1
1,AG-910-N05_NODE_2||full,0.94,0.7,0.94,dsDNAphage,7245,0,42.9,0.0,AG-910-N05_NODE_2,full,AG-910-N05,1


### DeepVirFinder

And now let's check out the deep vir finder results

In [71]:
!ls {dvf_outdir}

final_contigs_AG-910-N05.fasta_gt1bp_dvfpred.txt


Only one output table!  Let's check it out:

In [72]:
dvfdf = pd.read_csv(os.path.join(dvf_outdir, 'final_contigs_{}.fasta_gt1bp_dvfpred.txt'.format(sag)), sep = "\t")
dvfdf

Unnamed: 0,name,len,score,pvalue
0,AG-910-N05_NODE_2,7372,0.999517,0.002908
1,AG-910-N05_NODE_1,17534,0.99646,0.005533


In [73]:
dvfdf['sag'] = [i.split("_")[0] for i in dvfdf['name']]
dvfdf['contig'] = dvfdf['name']

The thresholds we are going to apply to these outputs are:  

dvf_score_min = 0.9   
dvf_pvalue_max = 0.05

In [74]:
dvf_keep = dvfdf[(dvfdf['score'] > 0.9) & (dvfdf['pvalue'] < 0.05)].copy()
dvf_keep['dvf_pos'] = 1
dvf_keep

Unnamed: 0,name,len,score,pvalue,sag,contig,dvf_pos
0,AG-910-N05_NODE_2,7372,0.999517,0.002908,AG-910-N05,AG-910-N05_NODE_2,1
1,AG-910-N05_NODE_1,17534,0.99646,0.005533,AG-910-N05,AG-910-N05_NODE_1,1


Now let's merge these table together, and see what our SAGs look like:

In [75]:

# defining a smaller number of columns to keep
keep_cols = ['contig','sag','vs2_pos','vs2_type','dvf_pos']

# merging the virus finding outputs, and keeping only the above columns
merged_df = dvf_keep.merge(vs2_keeps, how = 'outer')[keep_cols]

# replacing 'na' values with 0 for the virus finder results columns
merged_df[['vs2_pos','dvf_pos']] = merged_df[['vs2_pos','dvf_pos']].fillna(0)

# creating a consensus score per contig based on how many virus finders identified it.
merged_df['consensus_score'] = merged_df['vs2_pos'] + merged_df['dvf_pos']
merged_df['consensus_score'].value_counts()
merged_df

Unnamed: 0,contig,sag,vs2_pos,vs2_type,dvf_pos,consensus_score
0,AG-910-N05_NODE_1,AG-910-N05,1,full,1,2
1,AG-910-N05_NODE_2,AG-910-N05,1,full,1,2


In [66]:
merged_df.groupby(['dvf_pos','vs2_pos'], as_index = False)['contig'].count()

Unnamed: 0,dvf_pos,vs2_pos,contig
0,1,0.0,2


Finally, write this final dataframe to my working directory to reference later:

In [76]:
merged_df.to_csv(os.path.join(sagdir, '{}_vfinding_vcandidates_merged_table.csv'.format(sag)), index = False)

Questions:

Which virus finder identified the most contigs?  
How much overlap is there between virus candidate contigs identified by different tools?