This is similar as script 3.0, but the generation of coverage tables, i.e. synthetic metagenomes, is more closely aligned with the Sugimoto 2019 publication. Both dataset1 and dataset2 type synthetic metagenomes can be generated with this script. As it is set up by default, it will generate 70 each of low/high complexity metagenomes.


Requires a directory with genomes positive for the selected gene of interest.
Requires a directory with genomes negative for the selected gene of interest.

The function below generates coverage tables with the following arguments:
- complexity: The proportion of negative genomes to include in the synthetic metagenome generated from it relative to the total amount of files in the neg_genomes directory. Needs to be a value between 0 and 1. Eg. for 100 files and a value of 0.1, 10 negative files will be in the synthetic metagenome.
- pos_samples: the amount of positive samples to include in the synthetic metagenome. Needs to be a value between 0 and the amount of files in the pos_genomes directory. The for loop calling the function below will generate tables for synthetic metagenomes for all possible values of pos_samples.
- amount tables: the amount of coverage tables to generate with the above settings. The for loop calling the function below will generate this amount PER each possible value of pos_samples
- logn_mean and logn_std are mean and standard deviation of the normal distribution underlying the lognormal distribution that the coverages are sampled from. The defaults are the values used in sugimoto et al 2019. Both positive and negative genomes are sampled from the same distribution as described for dataset 2 in sugimoto.


In [1]:
BGC_type = 'RTX_toxin_acyltransferase_combined'
metagenome_name = 'RTXtoxinSynSponge' # do not use '_'

In [2]:
import os
from os import listdir, mkdir
from os.path import isfile, join
import pandas as pd
import numpy as np
from pathlib import Path
import random
from datetime import datetime

In [3]:
# Helper function for making directories only if they don't exist yet
def makedir(dirpath):
    if os.path.isdir(dirpath):
        print(dirpath,'exists already')
    else:
        print('Making', dirpath)    
        os.mkdir(dirpath)

        
# Defining paths for required directory structure for input and output files relative to parent directory
#will make directories relative to the path the notebook was opened in
parent_dir= !echo $(pwd)
BGC_path=os.path.join(parent_dir[0], BGC_type)
base_genomes_path=os.path.join(BGC_path, 'base_genomes')
neg_genomes_path=os.path.join(BGC_path, 'base_genomes/neg_genomes')
pos_genomes_path=os.path.join(BGC_path, 'base_genomes/pos_genomes')
ds1_coverage_table_path=os.path.join(BGC_path, 'ds1_coverage_tables')
ds2_coverage_table_path=os.path.join(BGC_path, 'ds2_coverage_tables')

# made directories manually. comment out or remove after running
validation_pos_genomes_path=os.path.join(BGC_path, 'base_genomes/validation_pos_genomes')
validation_coverage_table_path=os.path.join(BGC_path, 'validation_coverage_tables')


# Calling function to make directories if they don't exist yet
makedir(ds1_coverage_table_path)
makedir(ds2_coverage_table_path)

Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_combined/ds1_coverage_tables
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_combined/ds2_coverage_tables


In [4]:
with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'w') as f:
    f.write('Output directory is: '+BGC_path+'\n')
    f.write('\nBGC_type = '+BGC_type)
    f.write('\nmetagenome_name = '+str(metagenome_name)+'\n')

In [5]:
# Number depends on the selected pos and neg genomes in previous script
# Sugimoto default is 140 neg and 10 pos genomes
neg_filenames = [f for f in listdir(neg_genomes_path) if isfile(join(neg_genomes_path, f))]
pos_filenames = [f for f in listdir(pos_genomes_path) if isfile(join(pos_genomes_path, f))]
#pos_filenames = [f for f in listdir(validation_pos_genomes_path) if isfile(join(validation_pos_genomes_path, f))]

all_filenames=neg_filenames+pos_filenames
df_len = len(all_filenames)
print('Neg genomes available:', len(neg_filenames))
print('Pos genomes available:', len(pos_filenames))
print('Max length of coverage tables is:', df_len)

with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
    f.write('\nLength of coverage tables is: '+str(df_len)+'\n')
    f.write('Consisting of: '+str(len(neg_filenames))+' negative genomes and '+str(len(pos_filenames))+' positive genomes.\n\n')

Neg genomes available: 140
Pos genomes available: 10
Max length of coverage tables is: 150


Rewrite the cell below here according to Sugimoto:

Dataset 1:
- high complexity metagenomes contain 90% of 140 neg genomes, low complexity metagenomes contain 30% (generate 70 each)
- Both high and low complexity metagenomes randomly contain between 0 and 3 positive genomes
(- probability that each negative genome is selected is different, probability for each positive genome to be selected is the same.)
- proportion of each negative genome is log-normally distributed with mean=1 and variance=4, then normalised so that sum of proportions is 1
- Each selected positive genome can have either "high" (9-11) or "low" (1-4/3) coverage, sampled from a uniform distribution
- If a metagenome contains 1 or 3 positive genomes, coverage of each is equally likely to be high or low
- If a metagenome contains 2 positive genomes, coverage is high for 1 and low for the other


Dataset 2:
- Exactly as dataset 1 with the difference that both positive and negative genomes were sampled from a lognormal distribution.

In [6]:
#This is for generating datasets required for build input. For validation, use script 3.0
def assemble_coverage_tables(complexity, amount_tables, pos_samples=3, logn_mean=1, logn_std=2, dataset=2):

    amount_pos_samples = random.randint(0,pos_samples) # Select between 0 and 3 by default
    add_samples = round(complexity*len(neg_filenames)) # round to nearest int
    
    for mg_number in range(0,amount_tables):
        
        #generate a list of negative coverates of length 'add_samples' and randomly select coverages from it
        neg_coverage_list = []
        for i in range(0,add_samples):
            neg_coverage_list.append(np.random.lognormal(mean=logn_mean, sigma=logn_std))
        while len(neg_coverage_list) < len(neg_filenames):
            neg_coverage_list.append(0)
        randomised_neg_cov = random.sample(neg_coverage_list,len(neg_filenames))

        # default - dataset 2, positive and negative samples are selected from same distribution
        if dataset == 2:
            pos_coverage_list = []
            for i in range(0,amount_pos_samples):
                pos_coverage_list.append(np.random.lognormal(mean=logn_mean, sigma=logn_std))
            while len(pos_coverage_list) < len(pos_filenames):
                pos_coverage_list.append(0)
            randomised_pos_cov = random.sample(pos_coverage_list,len(pos_filenames))


        # dataset 1 - pos samples are chosen from a particular ruleset
        elif dataset == 1:
            pos_coverage_list = []
            #decide if how many pos samples are chosen. Not compatible with non-default pos_samples
            rand_pos = random.randint(0,3)
            if rand_pos == 0:
                pos_coverage_list.append(0)
            elif rand_pos == 1:
                if random.random() <= 0.5:
                    pos_coverage_list.append(random.uniform(9,11))
                else:
                    pos_coverage_list.append(random.uniform(1,4/3))
            elif rand_pos == 2:
                pos_coverage_list.append(random.uniform(9,11))
                pos_coverage_list.append(random.uniform(1,4/3))
            elif rand_pos == 3:
                for sample in range(0,rand_pos):
                    if random.random() <= 0.5:
                        pos_coverage_list.append(random.uniform(9,11))
                    else:
                        pos_coverage_list.append(random.uniform(1,4/3))
            else:
                print('dataset 1 not compatible with non-default amount of pos_samples')
                return()

            while len(pos_coverage_list) < len(pos_filenames):
                pos_coverage_list.append(0)
            randomised_pos_cov = random.sample(pos_coverage_list,len(pos_filenames))
        else:
            print('Only 1 and 2 are valid values for dataset')
            return()

        neg_df_dict = {'metagenome_name':[],'complexity':[],'base_genome_filename':[],'coverage':[]}
        for i in range(0,len(neg_filenames)):
            neg_df_dict['metagenome_name'].append(metagenome_name+'_'+str(mg_number))
            neg_df_dict['complexity'].append(complexity)
            neg_df_dict['base_genome_filename'].append(neg_filenames[i])
            neg_df_dict['coverage'].append(randomised_neg_cov[i])
        neg_cov_df = pd.DataFrame(neg_df_dict)

        pos_df_dict = {'metagenome_name':[],'complexity':[],'base_genome_filename':[],'coverage':[]}
        for i in range(0,len(pos_filenames)):
            pos_df_dict['metagenome_name'].append(metagenome_name+'_'+str(mg_number))
            pos_df_dict['complexity'].append(complexity)
            pos_df_dict['base_genome_filename'].append(pos_filenames[i])
            pos_df_dict['coverage'].append(randomised_pos_cov[i])
        pos_cov_df = pd.DataFrame(pos_df_dict)
        
        
        # See plot_mg_correlation notebook to match naming convention! (e.g 0_15_7375_S148.csv)
        # Naming must include complexity! Update naming in plot_mg_correlation notebook!
        result = pd.concat([neg_cov_df, pos_cov_df])

        return(result)

print('function loaded')

function loaded


In [7]:
# dataset 2

# Generate coverage tables for 70 low complexity metagenomes

metagenome_counter = 0
final_amount_tables = 70

while metagenome_counter < final_amount_tables:
    low_complexity=0.3
    result_df = assemble_coverage_tables(complexity=low_complexity, amount_tables=1, dataset=2)
    pos_name = len(list(filter(lambda num: num != 0, (result_df.iloc[-len(pos_filenames):,3]))))
    neg_name = len(list(filter(lambda num: num != 0, (result_df.iloc[:len(neg_filenames),3]))))
    with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
        f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tGenerating low complexity coverage table '+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(low_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter)+' with '+str(pos_name)+' positive genome(s) and '+str(neg_name)+' negative genome(s).\n')
    result_df.to_csv(ds2_coverage_table_path+'/'+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(low_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter), sep=',', index=False, header=False)
    metagenome_counter +=1

    
# Generate coverage tables for 70 high complexity metagenomes

metagenome_counter = 0
final_amount_tables = 70

while metagenome_counter < final_amount_tables:
    high_complexity=0.9
    result_df = assemble_coverage_tables(complexity=high_complexity, amount_tables=1, dataset=2)
    pos_name = len(list(filter(lambda num: num != 0, (result_df.iloc[-len(pos_filenames):,3]))))
    neg_name = len(list(filter(lambda num: num != 0, (result_df.iloc[:len(neg_filenames),3]))))
    with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
        f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tGenerating high complexity coverage table '+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(low_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter)+' with '+str(pos_name)+' positive genome(s) and '+str(neg_name)+' negative genome(s).\n')
    result_df.to_csv(ds2_coverage_table_path+'/'+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(high_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter), sep=',', index=False, header=False)
    metagenome_counter +=1


print('Dataset 2 coverage tables done.')

Dataset 2 coverage tables done.


In [8]:
# dataset 1

# Generate coverage tables for 70 low complexity metagenomes

metagenome_counter = 0
final_amount_tables = 70

while metagenome_counter < final_amount_tables:
    low_complexity=0.3
    result_df = assemble_coverage_tables(complexity=low_complexity, amount_tables=1, dataset=1)
    pos_name = len(list(filter(lambda num: num != 0, (result_df.iloc[-len(pos_filenames):,3]))))
    neg_name = len(list(filter(lambda num: num != 0, (result_df.iloc[:len(neg_filenames),3]))))
    with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
        f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tGenerating low complexity coverage table '+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(low_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter)+' with '+str(pos_name)+' positive genome(s) and '+str(neg_name)+' negative genome(s).\n')
    result_df.to_csv(ds1_coverage_table_path+'/'+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(low_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter), sep=',', index=False, header=False)
    metagenome_counter +=1

    
# Generate coverage tables for 70 high complexity metagenomes

metagenome_counter = 0
final_amount_tables = 70

while metagenome_counter < final_amount_tables:
    high_complexity=0.9
    result_df = assemble_coverage_tables(complexity=high_complexity, amount_tables=1, dataset=1)
    pos_name = len(list(filter(lambda num: num != 0, (result_df.iloc[-len(pos_filenames):,3]))))
    neg_name = len(list(filter(lambda num: num != 0, (result_df.iloc[:len(neg_filenames),3]))))
    with open(BGC_path+'/'+'report_3_generate_coverage_table.txt', 'a') as f:
        f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tGenerating high complexity coverage table '+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(low_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter)+' with '+str(pos_name)+' positive genome(s) and '+str(neg_name)+' negative genome(s).\n')
    result_df.to_csv(ds1_coverage_table_path+'/'+str(pos_name)+'_'+str(len(pos_filenames)-pos_name)+'_'+str(high_complexity)+'_'+metagenome_name+'_'+str(metagenome_counter), sep=',', index=False, header=False)
    metagenome_counter +=1


print('Dataset 1 coverage tables done.')

Dataset 1 coverage tables done.
