In [None]:
import pandas as pd
from pyfaidx import Fasta
import os

import seaborn as sns
import matplotlib.pyplot as plt

sag = '{SAG ID here}'

Next, let's check out the quality of these identified vcandidate contigs.  To do so, we first need to extract the viral sequences and write them into a new file, and then we can run them through CheckV.

4. Extract viral contigs from SAG into new Fasta file
5. Run viral contigs through CheckV
6. Curate viral candidate contigs based on checkV results
7. Extract checkV passing viral contigs

First let's set up some variables that we'll be using within this notebook (same as the last notebook).

In [None]:
# setting a working subdirectory for virus finding
sagdir = "{}_vfinding".format(sag)

# setting a location to place SAG contigs
sag_contigs = os.path.join(sagdir, '{}_contigs.fasta'.format(sag))

And load the dataframe we'd put together in notebook 1:

In [None]:
df = pd.read_csv(os.path.join(sagdir, '{}_vfinding_vcandidates_merged_table.csv'.format(sag)))
df

### Extracting viral sequences

We'll use a package called pyfaidx to grab sequences from our fasta file, and our own function to write them to a new file.

In [None]:
# function to write to a new output file handle:

def write_fa_record(name, seq, oh, line_len=60):
    print(">{}".format(name), file=oh)
    for i in range(0, len(seq), line_len):
        print(seq[i:i+line_len], file=oh)

In [None]:
fa = Fasta(sag_contigs)
vcandidate_fasta = os.path.join(sagdir, '{}_initial_vcandidates.fasta'.format(sag))

with open(vcandidate_fasta, 'w') as oh:
    for i, l in df.iterrows():
        seq = fa[l.contig]
        write_fa_record(l.contig, seq, oh)
        
print("Our fasta file to run through checkv is", vcandidate_fasta)

### CheckV of viral candidate sequences

In [None]:
checkv_outdir = os.path.join(sagdir, "checkv")

!mkdir {checkv_outdir}

We'll want to work in the CheckV conda environment to run this software in a terminal, rather than in this notebook. Load this environment in terminal by typing:

```
source activate /mnt/storage/envs/checkv
```

The general command for vibrant is:
```
checkv end_to_end {infa} {outdir} -t {threads} -d {location of checkv database}
```

To run for your SAG, enter the printed output from the below line into your terminal:

In [None]:
print('checkv end_to_end {vcandidate_fasta} {checkv_outdir} -t 2 -d /mnt/storage/reference_dbs/checkv/checkv-db-v1.5'.format(**locals()))

Let's check out the results!  

In [None]:
!ls {checkv_outdir}

Let's look specifically into the quality summary file:

In [None]:
cvdf = pd.read_csv(os.path.join(checkv_outdir, 'quality_summary.tsv'), sep = "\t")
cvdf

What's the checkv quality of your SAG's viral candidates?

In [None]:
cvdf['checkv_quality'].value_counts()

A metric I like to look at is the proportion of genes identified as host genes on each contig. Let's add a column that shows this:

In [None]:
cvdf['pct_host_genes'] = round(cvdf['host_genes'] / cvdf['gene_count'] * 100, 1)

In [None]:
cvdf[['contig_id','pct_host_genes','checkv_quality']]

We'll select contigs to move forward with based on the following criteria:

```
Keep IF:
checkv_quality is NOT Not-determined
OR IF:
pct_host_genes < 50%
```

In [None]:
cvdf_keeps = cvdf[(cvdf['checkv_quality'] != 'Not-determined') | (cvdf['pct_host_genes'] < 50)]



In [None]:
cvdf_keeps

Finally, let's save this table for the next step.

In [None]:
cvdf_keeps.to_csv(os.path.join(sagdir, '{}_post_qc_vcandidates_checkv_quality_table.csv'.format(sag)), index = False)

OK, finally, let's extract the passing checkV contigs into a new fasta file for input for the annotation step.

This time, we'll extract contigs from the checkV output. We will do it this way because checkV separates prophages from host sequences, so if CheckV determined any of our sequences were prophage, we will only look at the phage portion of the sequence for our next step.

In [None]:
vfa = Fasta(os.path.join(checkv_outdir, 'viruses.fna'))
              
vcandidate_fasta = os.path.join(sagdir, '{}_cvpassing_vcandidates.fasta'.format(sag))

with open(vcandidate_fasta, 'w') as oh:
    for i, l in cvdf.iterrows():
        
        if l['provirus'] == 'Yes':
            profa = Fasta(os.path.join(checkv_outdir, 'proviruses.fna'))
            seq = profa[l['contig_id']]
        else:
            seq = vfa[l['contig_id']]
    
        write_fa_record(l['contig_id'], seq, oh)
        
print("Our fasta file to run through DRAM-v is", vcandidate_fasta)

Questions:

How much did the CheckV filter affect the final contig counts?

How are you feeling about your SAG at this point?  Do you think it contains a virus?

Do you agree with the decisions in this notebook for filtering out viral contigs?