generate coverage tables

Requires a directory with genomes positive for the selected gene of interest
Requires a directory with genomes negative for the selected gene of interest

The function below generates coverage tables with the following arguments:
- complexity: The proportion of negative genomes to include in the synthetic metagenome generated from it relative to the total amount of files in the neg_genomes directory. Needs to be a value between 0 and 1. Eg. for 100 files and a value of 0.1, 10 negative files will be in the synthetic metagenome.
- pos_samples: the amount of positive samples to include in the synthetic metagenome. Needs to be a value between 0 and the amount of files in the pos_genomes directory. The for loop calling the function below will generate tables for synthetic metagenomes for all possible values of pos_samples.
- amount tables: the amount of coverage tables to generate with the above settings. The for loop calling the function below will generate this amount PER each possible value of pos_samples
- logn_mean and logn_std are mean and standard deviation of the normal distribution underlying the lognormal distribution that the coverages are sampled from. The defaults are the values used in sugimoto et al 2019. Both positive and negative genomes are sampled from the same distribution as described for dataset 2 in sugimoto.


In [7]:
BGC_type = 'Rieske'
metagenome_name = 'SynSponge' # do not use '_'
pct_sampled = 0.75 #proxy of metagenome complexity, corresponding to amount of neg samples to include
amount_pos_samples = 1 #specify this manually to create files with a certain amount of pos samples
amount_cov_tables = 1

In [2]:
import os
from os import listdir, mkdir
from os.path import isfile, join
import pandas as pd
import numpy as np
from pathlib import Path
#import plotnine #currently installed in home. Should better be under project, but couldn't figure it out
import random

In [3]:
# Helper function for making directories only if they don't exist yet
def makedir(dirpath):
    if os.path.isdir(dirpath):
        print(dirpath,'exists already')
    else:
        print('Making', dirpath)    
        os.mkdir(dirpath)

        
# Defining paths for required directory structure for input and output files relative to parent directory
parent_dir='/nesi/project/vuw03285/'
BGC_path=os.path.join(parent_dir, BGC_type)
base_genomes_path=os.path.join(BGC_path, 'base_genomes')
neg_genomes_path=os.path.join(BGC_path, 'base_genomes/neg_genomes')
pos_genomes_path=os.path.join(BGC_path, 'base_genomes/pos_genomes')
coverage_table_path=os.path.join(BGC_path, 'coverage_tables')


# Calling function to make directories if they don't exist yet
makedir(coverage_table_path)

Making /nesi/project/vuw03285/Rieske/coverage_tables


In [None]:
%%capture cap --no-stderr
# Generating a report file for this particular script
print('\nBGC_type =', BGC_type)
print('\nmetagenome_name =', metagenome_name)
print('\npct_sampled =', pct_sampled)
print('\namount_pos_samples =', amount_pos_samples)
print('\namount_cov_tables =', amount_cov_tables)
with open(BGC_path+'/'+'report_generate_coverage_table.txt', 'w') as f:
    f.write(str(cap))

In [4]:
# Number depends on the selected pos and neg genomes in previous script
# Sugimoto default is 140 neg and 10 pos genomes
neg_filenames = [f for f in listdir(neg_genomes_path) if isfile(join(neg_genomes_path, f))]
pos_filenames = [f for f in listdir(pos_genomes_path) if isfile(join(pos_genomes_path, f))]

all_filenames=neg_filenames+pos_filenames
df_len = len(all_filenames)
print('length of coverage tables is:', df_len)

#Randomly select 140 from neg_filenames, move them to a different directory
#Randomly select 10 from pos_filenames, move them to a different directory
# Keep in mind that there are duplicate proteins in the positive

length of coverage tables is: 13


The function below generates coverage tables with the following arguments:
- complexity: The proportion of negative genomes to include in the synthetic metagenome generated from it relative to the total amount of files in the neg_genomes directory. Needs to be a value between 0 and 1. Eg. for 100 files and a value of 0.1, 10 negative files will be in the synthetic metagenome.
- pos_samples: the amount of positive samples to include in the synthetic metagenome. Needs to be a value between 0 and the amount of files in the pos_genomes directory. Remember the remove some of the positive samples from the directory for validation (this is done automatically by the second script). **These must not be included in the training dataset!** The for loop calling the function below will generate tables for synthetic metagenomes for all possible values of pos_samples.
- amount tables: the amount of coverage tables to generate with the above settings. The for loop calling the function below will generate this amount PER each possible value of pos_samples
- logn_mean and logn_std are mean and standard deviation of the normal distribution underlying the lognormal distribution that the coverages are sampled from. The defaults are the values used in sugimoto et al 2019. Both positive and negative genomes are sampled from the same distribution as described for dataset 2 in sugimoto.


In [6]:
def assemble_coverage_tables(complexity, pos_samples, amount_tables, logn_mean=1, logn_std=2):

    add_samples = round(complexity*len(neg_filenames)) # round to nearest int
    
    for mg_number in range(0,amount_tables):
        
        neg_coverage_list = []
        for i in range(0,add_samples):
            neg_coverage_list.append(np.random.lognormal(mean=logn_mean, sigma=logn_std))
        while len(neg_coverage_list) < len(neg_filenames):
            neg_coverage_list.append(0)
        randomised_neg_cov = random.sample(neg_coverage_list,len(neg_filenames))

        pos_coverage_list = []
        for i in range(0,pos_samples):
            pos_coverage_list.append(np.random.lognormal(mean=logn_mean, sigma=logn_std))
        while len(pos_coverage_list) < len(pos_filenames):
            pos_coverage_list.append(0)
        randomised_pos_cov = random.sample(pos_coverage_list,len(pos_filenames))    

        neg_df_dict = {'metagenome_name':[],'complexity':[],'base_genome_filename':[],'coverage':[]}
        for i in range(0,len(neg_filenames)):
            neg_df_dict['metagenome_name'].append(metagenome_name+'_'+str(mg_number))
            neg_df_dict['complexity'].append(complexity)
            neg_df_dict['base_genome_filename'].append(neg_filenames[i])
            neg_df_dict['coverage'].append(randomised_neg_cov[i])
        neg_cov_df = pd.DataFrame(neg_df_dict)

        pos_df_dict = {'metagenome_name':[],'complexity':[],'base_genome_filename':[],'coverage':[]}
        for i in range(0,len(pos_filenames)):
            pos_df_dict['metagenome_name'].append(metagenome_name+'_'+str(mg_number))
            pos_df_dict['complexity'].append(complexity)
            pos_df_dict['base_genome_filename'].append(pos_filenames[i])
            pos_df_dict['coverage'].append(randomised_pos_cov[i])
        pos_cov_df = pd.DataFrame(pos_df_dict)

        # See plot_mg_correlation notebook to match naming convention! (e.g 0_15_7375_S148.csv)
        result = pd.concat([neg_cov_df, pos_cov_df])
        result.to_csv(coverage_table_path+'/'+str(pos_samples)+'_'+str(len(pos_filenames)-pos_samples)+'_'+metagenome_name+'_'+str(mg_number)+'.csv', sep=',', index=False)

        
#assemble_coverage_tables(pct_sampled, 3, amount_cov_tables)
print('function loaded')


function loaded


In [9]:
# Create individual coverage tables with pre-defined parameters
print('calling assemble_coverage_tables function with: complexity =', pct_sampled, 'pos_samples =', amount_pos_samples, 'amount_tables =', amount_cov_tables)
assemble_coverage_tables(complexity=pct_sampled, pos_samples=amount_pos_samples, amount_tables=amount_cov_tables)

calling assemble_coverage_tables function with: complexity = 0.75 pos_samples = 1 amount_tables = 1


In [10]:
# generate a selection of coverage tables with all possible amounts of positve samples
for i in range(0,len(pos_filenames)+1):
    print('calling assemble_coverage_tables function with: complexity =', pct_sampled, 'pos_samples =', i, 'amount_tables =', amount_cov_tables)
    assemble_coverage_tables(complexity=pct_sampled, pos_samples=i, amount_tables=amount_cov_tables)

calling assemble_coverage_tables function with: complexity = 0.75 pos_samples = 0 amount_tables = 1
calling assemble_coverage_tables function with: complexity = 0.75 pos_samples = 1 amount_tables = 1
calling assemble_coverage_tables function with: complexity = 0.75 pos_samples = 2 amount_tables = 1
calling assemble_coverage_tables function with: complexity = 0.75 pos_samples = 3 amount_tables = 1
