Recruitment values from traditional viruscope and batch viruscope aren't matching... want to figure out why in this notebook.

In [1]:
exesag = "AG-903-I06"

In [10]:
import pandas as pd
import os
import os.path as op
from recruitment_for_vs import *

In [11]:
odpovtsv = "/mnt/scgc/simon/simonsproject/bats248_vs/diamond/pergenome/{}_vs_POV.tsv.gz".format(exesag)
odlineptsv = "/mnt/scgc/simon/simonsproject/bats248_vs/diamond/pergenome/{}_vs_LineP-all.tsv.gz".format(exesag)

In [12]:
pov = import_diamond_tsv(odpovtsv, best_hit=False)

In [13]:
def construct_recruit_tbl(vir_tsv, bac_tsv, read_count_dict, contig_file):
    '''
    Args:
        vir_tsv: diamond recruitment converted to tsv for vir metagenome
        bac_tsv: diamond recruitment converted to tsv for bac metagenome
        read_count_dict: dict of mg read counts with two keys -- 'vir_reads' and 'bac_reads'
        contig_file: path to a file with sag contigs in it; either in fasta or gff format
    Returns:
        pandas dataframe with mg fraction calculated
    '''
    cnames = "qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore".split()
    bac_df = import_diamond_tsv(bac_tsv)
    vir_df = import_diamond_tsv(vir_tsv)

    bac_sum = pd.Series(bac_df.groupby('sseqid')['qseqid'].count(), name='hit_mg-bac')
    vir_sum = pd.Series(vir_df.groupby('sseqid')['qseqid'].count(), name='hit_mg-vir')

    orfhits = pd.concat([bac_sum, vir_sum], axis=1).reset_index().rename(columns={'index':'orf'})
    orfhits = map_orfs_to_contigs(orfhits, contig_file)
    
    chits = pd.concat([summarize_by_contig(orfhits, 'hit_mg-bac'), summarize_by_contig(orfhits, 'hit_mg-vir')], axis=1)
    chits['reads_mg-vir'] = float(read_count_dict['vir_reads'])
    chits['reads_mg-bac'] = float(read_count_dict['bac_reads'])
    
    clens = contig_lengths(contig_file)
    
    out_tbl = compute_fr(chits.reset_index(), clens, mult=1e6)
    
    return out_tbl

In [14]:
# constructing recruit table from vs test run :

vtsv = "/mnt/scgc/simon/simonsproject/jb_vs_test/AG-903/AG-903-I06/diamond/POV.tsv.gz"
btsv =  "/mnt/scgc/simon/simonsproject/jb_vs_test/AG-903/AG-903-I06/diamond/LineP-all.tsv.gz"
read_count_dict = {'vir_reads':5922080, 'bac_reads':8279226}
contig_file = "/mnt/scgc/simon/simonsproject/bats248_contigs/coassemblies/AG-903/AG-903-I06_contigs.fasta"

In [15]:
assert op.exists(contig_file)

In [16]:
df1 = construct_recruit_tbl(vtsv, btsv, read_count_dict, contig_file)

doesn't look like input contig file is in gff format.  Will assume that contig name is embedded in the ORF name.
looks like input config fiel is in fasta format.


In [17]:
df1

Unnamed: 0,contig,hit_mg-bac,hit_mg-vir,reads_mg-vir,reads_mg-bac,contig_length,fr_mg-bac,fr_mg-vir
0,AG-903-I06_NODE_1,555.0,1551.0,5922080.0,8279226.0,35014.0,0.001915,0.00748
1,AG-903-I06_NODE_10,21.0,148.0,5922080.0,8279226.0,12198.0,0.000208,0.002049
2,AG-903-I06_NODE_11,393.0,189.0,5922080.0,8279226.0,10789.0,0.0044,0.002958
3,AG-903-I06_NODE_12,285.0,3194.0,5922080.0,8279226.0,10602.0,0.003247,0.050871
4,AG-903-I06_NODE_13,57.0,,5922080.0,8279226.0,10047.0,0.000685,
5,AG-903-I06_NODE_14,479.0,4.0,5922080.0,8279226.0,9782.0,0.005915,6.9e-05
6,AG-903-I06_NODE_15,43.0,1.0,5922080.0,8279226.0,7199.0,0.000721,2.3e-05
7,AG-903-I06_NODE_16,1.0,,5922080.0,8279226.0,5516.0,2.2e-05,
8,AG-903-I06_NODE_17,19.0,2.0,5922080.0,8279226.0,5199.0,0.000441,6.5e-05
9,AG-903-I06_NODE_18,2.0,,5922080.0,8279226.0,3460.0,7e-05,


In [18]:
realvs = pd.read_csv("/mnt/scgc/simon/simonsproject/jb_vs_test/AG-903/AG-903-I06/summary/AG-903-I06_contigs-summary.csv.gz")

In [20]:
realvs.columns

Index(['Unnamed: 0', 'contig_length', 'gene_count', 'viral_phage_gene_count',
       'viral_phage_gene_fraction', 'viral2_phage_gene_count',
       'viral2_phage_gene_fraction', 'Similarity_1.LineP.all.fr',
       'Similarity_1.POV.fr', 'ratio_virus_bacteria', 'virus_class',
       'virus_prob'],
      dtype='object')

In [22]:
rvs = realvs[['Unnamed: 0','Similarity_1.LineP.all.fr','Similarity_1.POV.fr', 'ratio_virus_bacteria']]
rvs.rename(columns = {'Unnamed: 0':'contig'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)


In [26]:
rvs.merge(df1, on='contig', how='outer')[['contig', 'Similarity_1.LineP.all.fr','fr_mg-bac','Similarity_1.POV.fr','fr_mg-vir']]

Unnamed: 0,contig,Similarity_1.LineP.all.fr,fr_mg-bac,Similarity_1.POV.fr,fr_mg-vir
0,AG-903-I06_NODE_1,0.001918,0.001915,0.00761,0.00748
1,AG-903-I06_NODE_2,0.01305,0.01305,0.001685,0.001685
2,AG-903-I06_NODE_3,0.01827,0.01827,0.001012,0.001012
3,AG-903-I06_NODE_4,0.000308,0.000308,0.005089,0.005089
4,AG-903-I06_NODE_5,0.012137,0.012137,0.00927,0.00927
5,AG-903-I06_NODE_6,0.001548,0.001548,0.000126,0.000126
6,AG-903-I06_NODE_7,0.003379,0.003387,0.024417,0.024714
7,AG-903-I06_NODE_8,2.7e-05,2.7e-05,,
8,AG-903-I06_NODE_9,0.007619,0.007619,0.000525,0.000525
9,AG-903-I06_NODE_10,0.000208,0.000208,0.002049,0.002049


When using the same input, python and R functions generate the same recruit table.  Why, then, are the recruit tables different between the batch version and the normal version?

In [27]:
batch_tbl = pd.read_csv("/mnt/scgc/simon/simonsproject/jb_vs_test/AG-903-I06_new_script.csv")

In [29]:
rvs.merge(batch_tbl, on='contig', how='outer')[['contig', 'Similarity_1.LineP.all.fr','fr_mg-bac','Similarity_1.POV.fr','fr_mg-vir']]

Unnamed: 0,contig,Similarity_1.LineP.all.fr,fr_mg-bac,Similarity_1.POV.fr,fr_mg-vir
0,AG-903-I06_NODE_1,0.001918,0.000604,0.00761,0.000757
1,AG-903-I06_NODE_2,0.01305,0.011669,0.001685,0.002154
2,AG-903-I06_NODE_3,0.01827,0.017506,0.001012,0.000689
3,AG-903-I06_NODE_4,0.000308,3.5e-05,0.005089,2.1e-05
4,AG-903-I06_NODE_5,0.012137,0.008243,0.00927,0.008587
5,AG-903-I06_NODE_6,0.001548,0.00245,0.000126,0.000231
6,AG-903-I06_NODE_7,0.003379,0.004998,0.024417,0.027699
7,AG-903-I06_NODE_8,2.7e-05,0.000118,,1.3e-05
8,AG-903-I06_NODE_9,0.007619,0.007513,0.000525,0.000592
9,AG-903-I06_NODE_10,0.000208,0.000119,0.002049,5.5e-05


NOT THE SAME between the two.  Maybe it's an issue with the input files used.  I am using the prokka generated ORFS instead of creating them myself via prodigal.  I'll run a small test on the vs-generated prodigal orfs to see if I get the same outputs as previous.

In [35]:
prots = '/mnt/scgc/simon/simonsproject/jb_vs_test/AG-903/AG-903-I06/prodigal/AG-903-I06_contigs_proteins.fasta'
contigs = '/mnt/scgc/simon/simonsproject/bats248_contigs/coassemblies/AG-903/AG-903-I06_contigs.fasta'

In [36]:
"python recruitment_for_vs.py --threads 10 --output /mnt/scgc/simon/simonsproject/bats248_vs/diamond/ \
--sag-contigs {contigs} \
{prots} /mnt/scgc_nfs/ref/viral_dbs/POV.fasta.gz /mnt/scgc_nfs/ref/viral_dbs/LineP-all.fasta.gz".format(contigs=contigs, prots=prots)

'python recruitment_for_vs.py --threads 10 --output /mnt/scgc/simon/simonsproject/bats248_vs/diamond/ --sag-contigs /mnt/scgc/simon/simonsproject/bats248_contigs/coassemblies/AG-903/AG-903-I06_contigs.fasta /mnt/scgc/simon/simonsproject/jb_vs_test/AG-903/AG-903-I06/prodigal/AG-903-I06_contigs_proteins.fasta /mnt/scgc_nfs/ref/viral_dbs/POV.fasta.gz /mnt/scgc_nfs/ref/viral_dbs/LineP-all.fasta.gz'

In [37]:
!ls /mnt/scgc/simon/simonsproject/bats248_vs/diamond/

AG-903-I06_contigs_proteins.dmnd
AG-903-I06_contigs_proteins_mg_diamond_recruitment_tbl.csv
AG-903-I06_contigs_proteins_vs_LineP-all.daa
AG-903-I06_contigs_proteins_vs_LineP-all.tsv.gz
AG-903-I06_contigs_proteins_vs_POV.daa
AG-903-I06_contigs_proteins_vs_POV.tsv.gz
pergenome


In [38]:
testtbl = pd.read_csv("/mnt/scgc/simon/simonsproject/bats248_vs/diamond/AG-903-I06_contigs_proteins_mg_diamond_recruitment_tbl.csv")

In [39]:
rvs.merge(testtbl, on='contig', how='outer')[['contig', 'Similarity_1.LineP.all.fr','fr_mg-bac','Similarity_1.POV.fr','fr_mg-vir']]

Unnamed: 0,contig,Similarity_1.LineP.all.fr,fr_mg-bac,Similarity_1.POV.fr,fr_mg-vir
0,AG-903-I06_NODE_1,0.001918,0.001915,0.00761,0.00748
1,AG-903-I06_NODE_2,0.01305,0.01305,0.001685,0.001685
2,AG-903-I06_NODE_3,0.01827,0.01827,0.001012,0.001012
3,AG-903-I06_NODE_4,0.000308,0.000308,0.005089,0.005089
4,AG-903-I06_NODE_5,0.012137,0.012137,0.00927,0.00927
5,AG-903-I06_NODE_6,0.001548,0.001548,0.000126,0.000126
6,AG-903-I06_NODE_7,0.003379,0.003387,0.024417,0.024714
7,AG-903-I06_NODE_8,2.7e-05,2.7e-05,,
8,AG-903-I06_NODE_9,0.007619,0.007619,0.000525,0.000525
9,AG-903-I06_NODE_10,0.000208,0.000208,0.002049,0.002049


### Those are almost exactly the same.  Prokka ORFs generate much different recruitment results compared to prodigal generated ORFs.  Now we (I) know.