generate coverage tables

Requires a directory with genomes positive for the selected gene of interest.
Requires a directory with genomes negative for the selected gene of interest.

The function below generates coverage tables with the following arguments:
- complexity: The proportion of negative genomes to include in the synthetic metagenome generated from it relative to the total amount of files in the neg_genomes directory. Needs to be a value between 0 and 1. Eg. for 100 files and a value of 0.1, 10 negative files will be in the synthetic metagenome.
- pos_samples: the amount of positive samples to include in the synthetic metagenome. Needs to be a value between 0 and the amount of files in the pos_genomes directory. The for loop calling the function below will generate tables for synthetic metagenomes for all possible values of pos_samples.
- amount tables: the amount of coverage tables to generate with the above settings. The for loop calling the function below will generate this amount PER each possible value of pos_samples
- logn_mean and logn_std are mean and standard deviation of the normal distribution underlying the lognormal distribution that the coverages are sampled from. The defaults are the values used in sugimoto et al 2019. Both positive and negative genomes are sampled from the same distribution as described for dataset 2 in sugimoto.


In [1]:
BGC_type = 'nitrile_hydratase_beta'
metagenome_name = 'nifhbSynSponge' # do not use '_'

In [2]:
import os
from os import listdir, mkdir
from os.path import isfile, join
import pandas as pd
import numpy as np
from pathlib import Path
import random
from datetime import datetime

In [3]:
# Helper function for making directories only if they don't exist yet
def makedir(dirpath):
    if os.path.isdir(dirpath):
        print(dirpath,'exists already')
    else:
        print('Making', dirpath)    
        os.mkdir(dirpath)

        
# Defining paths for required directory structure for input and output files relative to parent directory
#will make directories relative to the path the notebook was opened in
parent_dir= !echo $(pwd)
BGC_path=os.path.join(parent_dir[0], BGC_type)
base_genomes_path=os.path.join(BGC_path, 'base_genomes')
neg_genomes_path=os.path.join(BGC_path, 'base_genomes/neg_genomes')
pos_genomes_path=os.path.join(BGC_path, 'base_genomes/pos_genomes')
coverage_table_path=os.path.join(BGC_path, 'validation_coverage_tables')

# made directories manually. comment out or remove after running
#validation_pos_genomes_path=os.path.join(BGC_path, 'base_genomes/validation_pos_genomes')
#validation_coverage_table_path=os.path.join(BGC_path, 'validation_coverage_tables')

# Calling function to make directories if they don't exist yet
makedir(coverage_table_path)

/media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/validation_coverage_tables exists already


In [4]:
with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'w') as f:
    f.write('Output directory is: '+BGC_path+'\n')
    f.write('\nBGC_type = '+BGC_type)
    f.write('\nmetagenome_name = '+str(metagenome_name)+'\n')

In [4]:
# Number depends on the selected pos and neg genomes in previous script
# Sugimoto default is 140 neg and 10 pos genomes
neg_filenames = [f for f in listdir(neg_genomes_path) if isfile(join(neg_genomes_path, f))]
pos_filenames = [f for f in listdir(pos_genomes_path) if isfile(join(pos_genomes_path, f))]
#pos_filenames = [f for f in listdir(validation_pos_genomes_path) if isfile(join(validation_pos_genomes_path, f))]

all_filenames=neg_filenames+pos_filenames
df_len = len(all_filenames)
print('Max length of coverage tables is:', df_len)

with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
    f.write('\nLength of coverage tables is: '+str(df_len)+'\n')
    f.write('Consisting of: '+str(len(neg_filenames))+' negative genomes and '+str(len(pos_filenames))+' positive genomes.\n\n')

Max length of coverage tables is: 107


The function below generates coverage tables with the following arguments:
- complexity: The proportion of negative genomes to include in the synthetic metagenome generated from it relative to the total amount of files in the neg_genomes directory. Needs to be a value between 0 and 1. Eg. for 100 files and a value of 0.1, 10 negative files will be in the synthetic metagenome.
- pos_samples: the amount of positive samples to include in the synthetic metagenome. Needs to be a value between 0 and the amount of files in the pos_genomes directory. Remember the remove some of the positive samples from the directory for validation (this is done automatically by the second script). **These must not be included in the training dataset!** The for loop calling the function below will generate tables for synthetic metagenomes for all possible values of pos_samples.
- amount tables: the amount of coverage tables to generate with the above settings. The for loop calling the function below will generate this amount PER each possible value of pos_samples
- logn_mean and logn_std are mean and standard deviation of the normal distribution underlying the lognormal distribution that the coverages are sampled from. The defaults are the values used in sugimoto et al 2019. Both positive and negative genomes are sampled from the same distribution as described for dataset 2 in sugimoto.


Rewrite the cell below here according to Sugimoto:

Dataset 1:
- high complexity metagenomes contain 90% of 140 neg genomes, low complexity metagenomes contain 30% (generate 70 each)
- Both high and low complexity metagenomes randomly contain between 0 and 3 positive genomes
(- probability that each negative genome is selected is different, probability for each positive genome to be selected is the same.)
- proportion of each negative genome is log-normally distributed with mean=1 and variance=4, then normalised so that sum of proportions is 1
- Each selected positive genome can have either "high" (9-11) or "low" (1-4/3) coverage, sampled from a uniform distribution
- If a metagenome contains 1 or 3 positive genomes, coverage of each is equally likely to be high or low
- If a metagenome contains 2 positive genomes, coverage is high for 1 and low for the other


Dataset 2:
- Exactly as dataset 1 with the difference that both positive and negative genomes were sampled from a lognormal distribution.

In [6]:
def assemble_coverage_tables(complexity, pos_samples, amount_tables, logn_mean=1, logn_std=2):

    add_samples = round(complexity*len(neg_filenames)) # round to nearest int
    
    for mg_number in range(0,amount_tables):
        
        neg_coverage_list = []
        for i in range(0,add_samples):
            neg_coverage_list.append(np.random.lognormal(mean=logn_mean, sigma=logn_std))
        while len(neg_coverage_list) < len(neg_filenames):
            neg_coverage_list.append(0)
        randomised_neg_cov = random.sample(neg_coverage_list,len(neg_filenames))

        pos_coverage_list = []
        for i in range(0,pos_samples):
            pos_coverage_list.append(np.random.lognormal(mean=logn_mean, sigma=logn_std))
        while len(pos_coverage_list) < len(pos_filenames):
            pos_coverage_list.append(0)
        randomised_pos_cov = random.sample(pos_coverage_list,len(pos_filenames))    

        neg_df_dict = {'metagenome_name':[],'complexity':[],'base_genome_filename':[],'coverage':[]}
        for i in range(0,len(neg_filenames)):
            neg_df_dict['metagenome_name'].append(metagenome_name+'_'+str(mg_number))
            neg_df_dict['complexity'].append(complexity)
            neg_df_dict['base_genome_filename'].append(neg_filenames[i])
            neg_df_dict['coverage'].append(randomised_neg_cov[i])
        neg_cov_df = pd.DataFrame(neg_df_dict)

        pos_df_dict = {'metagenome_name':[],'complexity':[],'base_genome_filename':[],'coverage':[]}
        for i in range(0,len(pos_filenames)):
            pos_df_dict['metagenome_name'].append(metagenome_name+'_'+str(mg_number))
            pos_df_dict['complexity'].append(complexity)
            pos_df_dict['base_genome_filename'].append(pos_filenames[i])
            pos_df_dict['coverage'].append(randomised_pos_cov[i])
        pos_cov_df = pd.DataFrame(pos_df_dict)
        
        with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
            f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tGenerating coverage table with '+str(add_samples)+' negative genome(s) and '+str(pos_samples)+' positive genome(s).\n')
        
        # See plot_mg_correlation notebook to match naming convention! (e.g 0_15_7375_S148.csv)
        # Naming must include complexity! Update naming in plot_mg_correlation notebook!
        result = pd.concat([neg_cov_df, pos_cov_df])
        result.to_csv(coverage_table_path+'/'+str(pos_samples)+'_'+str(len(pos_filenames)-pos_samples)+'_'+str(complexity)+'_'+metagenome_name+'_'+str(mg_number), sep=',', index=False, header=False)
        #result.to_csv(validation_coverage_table_path+'/'+str(pos_samples)+'_'+str(len(pos_filenames)-pos_samples)+'_'+str(complexity)+'_'+metagenome_name+'_'+str(mg_number), sep=',', index=False, header=False)

print('function loaded')

function loaded


In [7]:
pct_sampled = 0.3 #proxy of metagenome complexity, corresponding to amount of neg samples to include
amount_pos_samples = 1 #specify this manually to create files with a certain amount of pos samples
amount_cov_tables = 1

# Create individual coverage tables with pre-set parameters
#print('calling assemble_coverage_tables function with: complexity =', pct_sampled, 'pos_samples =', amount_pos_samples, 'amount_tables =', amount_cov_tables)
#assemble_coverage_tables(complexity=pct_sampled, pos_samples=amount_pos_samples, amount_tables=amount_cov_tables)

In [8]:
# Good for validation with different amounts of spiked in positive genomes and varying background complexity

pct_sampled = 0.9 #proxy of metagenome complexity, corresponding to amount of neg samples to include
amount_cov_tables = 3

# generate a selection of coverage tables with all possible amounts of positve samples
#for i in range(0,len(pos_filenames)+1):
#    print('calling assemble_coverage_tables function with: complexity =', pct_sampled, 'pos_samples =', i, 'amount_tables =', amount_cov_tables)
#    assemble_coverage_tables(complexity=pct_sampled, pos_samples=i, amount_tables=amount_cov_tables)

calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 0 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 1 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 2 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 3 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 4 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 5 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 6 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 7 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 8 amount_tables = 3
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 9 amount_tables = 3
calling as

In [7]:
#Mimicking Sugimoto 2019 datasets

amount_cov_tables = 14

#Generate 70 high complexity, 70 low complexity metagenomes, of each, generate 14 with 0, 1, 2, 3, 4 pos samples 
for i in range(0,5):
    print('calling assemble_coverage_tables function with: complexity =', 0.3, 'pos_samples =', i, 'amount_tables =', amount_cov_tables)
    assemble_coverage_tables(complexity=0.3, pos_samples=i, amount_tables=amount_cov_tables)
    print('calling assemble_coverage_tables function with: complexity =', 0.9, 'pos_samples =', i, 'amount_tables =', amount_cov_tables)
    assemble_coverage_tables(complexity=0.9, pos_samples=i, amount_tables=amount_cov_tables)

calling assemble_coverage_tables function with: complexity = 0.3 pos_samples = 0 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 0 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.3 pos_samples = 1 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 1 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.3 pos_samples = 2 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 2 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.3 pos_samples = 3 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 3 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.3 pos_samples = 4 amount_tables = 14
calling assemble_coverage_tables function with: complexity = 0.9 pos_samples = 4 amount_tables = 14
