This script generates the 2 remaining files required for input:
- True positive genes (nucleotide fasta)
- protein alignment (amino acid fasta)
IMPORTANTLY, these 2 files are created from non-overlapping sequences!


And also moves the corresponding files to new directories so that coverage tables can be generated from them.

In the first cell specify:
- BGC type (This name must stay constant throughout the scripts)
- select_neg_genomes, i.e. the amount of negative genomes to be transferred to the neg_genomes directory
- select_pos_genomes, i.e. the amount of positive genomes to be transferred to the pos_genomes directory and to generate the tp_genes file from (the surplus amount will be used to generate the protein alignment from)
- pos_isolation_source_filter, if these terms are found in the isolation_source column of the positive samples in the summary file, they will be scored higher in a scoring column, i.e. samples from a known and desired isolation source will be used preferentially.
- neg_isolation_source_filter, accordingly
- avoid_list. These terms are scored with a 0, end at the bottom of the table, and will be picked last. This is useful when an uncommon gene is searched for and more, and/or more tenuous isolation sources have been allowed during download. These are generally words that contain one of the search terms, e.g. 'sea' in 'diseased'.

Modify in such a way that TP genes are used as query against all individual negative genomes. Negative genomes are only moved from temp directory to neg_genomes directory if the blast search comes back negative

In [1]:
BGC_type = 'nitrile_hydratase_beta'
select_neg_genomes = 140
select_pos_genomes = 10

avoid_list = ['', 'isolation_source not annotated', 'diseased', 'mice', 'spice', 'septicemic', 'research', 'crevice']
#these are identical to first script, but don't have to be
pos_isolation_source_filter =  ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor','sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae']
neg_isolation_source_filter = ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor', 'sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae']

In [2]:
import os
from os import listdir, mkdir
from os.path import isfile, join
from pathlib import Path
import pandas as pd
import random
import glob
import warnings

In [3]:
def makedir(dirpath):
    if os.path.isdir(dirpath):
        print(dirpath,'exists already')
    else:
        print('Making', dirpath)    
        os.mkdir(dirpath)

        
# Defining paths for required directory structure for input and output files relative to parent directory
#parent_dir='/media/manu/RiPP_Prioritiser/'
#will make directories relative to the path the notebook was opened in
parent_dir= !echo $(pwd)
BGC_path=os.path.join(parent_dir[0], BGC_type)
neg_genomes_path=os.path.join(BGC_path, 'base_genomes/temp_neg_genomes')
pos_genomes_path=os.path.join(BGC_path, 'base_genomes/temp_pos_genomes')
output_neg_path=os.path.join(BGC_path, 'base_genomes/neg_genomes')
output_pos_path=os.path.join(BGC_path, 'base_genomes/pos_genomes')


# Calling function to make directories if they don't exist yet
makedir(output_neg_path)
makedir(output_pos_path)

os.chdir(BGC_path)

Making /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
Making /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes


In [4]:
# Generating a report file for this script
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'w') as f:
    f.write('Output directory is: '+BGC_path+'\n')
    f.write('\nBGC_type = '+BGC_type)
    f.write('\nselect_neg_genomes = '+str(select_neg_genomes))
    f.write('\nselect_pos_genomes = '+str(select_pos_genomes))
    f.write('\navoid_list = '+str(avoid_list))
    f.write('\nneg_isolation_source_filter = '+str(neg_isolation_source_filter))
    f.write('\npos_isolation_source_filter = '+str(pos_isolation_source_filter)+'\n')

In [5]:
# load summary table into data frame () output from 1.)
summary_file = pd.read_csv('summary.tsv', sep='\t')

Change order of tables to prioritize samples that have an isolation source

In [6]:
warnings.filterwarnings('ignore')

#filter positives and drop all duplicate protein sequences originating from different organisms
pos_mask = (summary_file['dir'] == '+')
pos_df = summary_file[pos_mask]
pos_df.drop_duplicates(subset='protein_id', keep=False, inplace=True)


#filter negatives
neg_mask = (summary_file['dir'] == '-')
neg_df = summary_file[neg_mask]

#scoring words in isolation source so as to preferentially pick samples with chosen isolation sources

def custom_sorting(source,isolation_source_filter):
    score = 1
    if isolation_source_filter=='pos':
        for word in pos_isolation_source_filter:
            if word in source:
                score +=1
        for word in avoid_list:
            if source == word:
                score=0
    elif isolation_source_filter=='neg':
        for word in neg_isolation_source_filter:
            if word in source:
                score +=1
        for word in avoid_list:
            if source == word:
                score=0
    return score


pos_df['scoring_column'] = pos_df.apply(lambda x: custom_sorting(x['isolation_source'],'pos'),axis=1)
neg_df['scoring_column'] = neg_df.apply(lambda x: custom_sorting(x['isolation_source'],'neg'),axis=1)

pos_df.sort_values(by=['scoring_column'], axis=0, ascending=False, inplace=True)
neg_df.sort_values(by=['scoring_column'], axis=0, ascending=False, inplace=True)

In [7]:
#Split positive genomes into 2 bins, one goes towards tp-genes and is the pos-genomes used for synthesising metagenomes
#the other one constitutes a source of protein sequences for alignment as an input file

# Genomes selected in such a way that they are from the top of the pre-sorted pos_df
unique_pos_df = pos_df.drop_duplicates(subset='assembly', inplace=False)
selected_tp_genomes = list(unique_pos_df.iloc[:,1])[0:select_pos_genomes]
remaining_pos_genomes = list(unique_pos_df.iloc[:,1])[select_pos_genomes:]

with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_tp_genomes are:\n')
    f.write(str(selected_tp_genomes)+'\n')

#select genomes and isolate GCF number from them, move selected tp genomes to final pos_genomes directory
for genome in selected_tp_genomes:
    print('moving positive', genome, 'to', output_pos_path)
    !mv "{pos_genomes_path}"/"{genome}"* "{output_pos_path}"
    
#generate dataframe containing all tp-genomes and all the tp-genes contained in it
filtered_pos_df = pos_df[pos_df['assembly'].isin(selected_tp_genomes)]
remaining_pos_df = pos_df[~pos_df['assembly'].isin(selected_tp_genomes)]

#isolate all the headers and transfer them to the selected_tp_genes file
full_header_list = []
for i in range(0,len(filtered_pos_df)):
    full_header=str('>')+filtered_pos_df.iloc[i,1]+str('_')+filtered_pos_df.iloc[i,3]+str('_')+filtered_pos_df.iloc[i,5]
    full_header_list.append(full_header)

# generate fasta file with selected tp genes found in the selected genomes
print('generating selected_tp_genes.fasta')
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_tp_genes in positive genomes are:\n')
tp_gene_counter=0
with open(BGC_path+'/'+BGC_type+'_tp_genes.fasta') as fh:
    lines=fh.readlines()
    for i in range(0,len(lines)):
        for j in range(0,len(full_header_list)):
            if full_header_list[j] in lines[i]:
                tp_gene_counter+=1
                with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
                    f.write(lines[i][1:-1]+'\n')
                with open(BGC_path+'/'+BGC_type+'_selected_tp_genes.fasta', 'a') as outfile:
                    outfile.write(lines[i]+lines[i+1])    
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n'+str(len(selected_tp_genomes))+' unique genomes with '+ str(tp_gene_counter)+' unique tp genes.\n\n')
                    
                    
# transfer all amino acid sequences that are not part of the tp-genomes to a fasta file
print('generating selected_tp_aa.fasta')
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_tp_aa sequences for muscle alignment are:\n')
tp_aa_counter = 0
with open(BGC_path+'/'+BGC_type+'_selected_tp_aa.fasta', 'a') as outfile:
    for i in range(0,len(remaining_pos_df)):
        tp_aa_counter+=1
        fasta_header=str('>')+remaining_pos_df.iloc[i,1]+str('_')+remaining_pos_df.iloc[i,3]+str('_')+remaining_pos_df.iloc[i,5]+'\n'
        sequence = remaining_pos_df.iloc[i,6][2:-2]+'\n'
        with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
            f.write(fasta_header[1:-1]+'\n')
        outfile.write(fasta_header)
        outfile.write(sequence)
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n'+str(len(remaining_pos_genomes))+' unique genomes with '+ str(tp_aa_counter)+' unique aa sequences.\n\n')        
        
#Move selected neg genome files to different location
unique_neg_df = neg_df.drop_duplicates(subset='assembly', inplace=False)
selected_neg_genomes = list(unique_neg_df.iloc[:,1])[0:select_neg_genomes]

with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_neg_genomes are:\n')
    f.write(str(selected_neg_genomes))

for genome in selected_neg_genomes:
    print('moving negative', genome, 'to', output_neg_path)
    !mv "{neg_genomes_path}"/"{genome}"* "{output_neg_path}"

    
print('Done')

moving positive GCF_002983865.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_011290545.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_009363555.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_011040495.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_009363335.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_009363615.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_002393525.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_006970865.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_genomes
moving positive GCF_003544895.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/pos_

moving negative GCF_009371885.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_002241655.2 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_003959225.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_000772065.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_003067445.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_018402175.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_000214435.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_000317675.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_003993825.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_

moving negative GCF_000604425.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_004215015.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
moving negative GCF_002874535.1 to /media/manu/RiPP_Prioritiser/nitrile_hydratase_beta/base_genomes/neg_genomes
Done


The cell below generates a muscle alignment of the protein sequences from genomes that were not chosen as pos_genomes. However, in order to follow the procedure of the 2019 Sugimoto paper more closely, instead download the seed alignment for the respective protein of interest from pfam in fasta format and save is as {BGC_type}_selected_tp_alignment.fasta

In [8]:
#!module load MUSCLE/3.8.1551
#not sure why full path is required?
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n\nMUSCLE alignment details:\n')

!muscle -in "{BGC_path}"/"{BGC_type}"_selected_tp_aa.fasta -out "{BGC_path}"/"{BGC_type}"_selected_tp_alignment.fasta -loga "{BGC_path}"/report_2_generate_tp.txt
#!module unload MUSCLE/3.8.1551


MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

nitrile_hydratase_beta_selected 53 seqs, lengths min 90, max 319, avg 175
00:00:00    15 MB(-3%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00    15 MB(-3%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00    24 MB(-4%)  Iter   1  100.00%  Align node       
00:00:00    24 MB(-4%)  Iter   1  100.00%  Root alignment
00:00:00    24 MB(-4%)  Iter   2  100.00%  Refine tree   
00:00:00    24 MB(-4%)  Iter   2  100.00%  Root alignment
00:00:00    24 MB(-4%)  Iter   2  100.00%  Root alignment
00:00:01    24 MB(-4%)  Iter   3  100.00%  Refine biparts
00:00:02    24 MB(-4%)  Iter   4  100.00%  Refine biparts
00:00:02    24 MB(-4%)  Iter   5  100.00%  Refine biparts
00:00:02    24 MB(-4%)  Iter   5  100.00%  Refine biparts


Introduce a step here where all negative genomes are concatenated, a blastdb is built from the concatenated file, and all selected TP genes are used to query the blast db. If there are any hits, remove that sequence from the selected_neg_genomes_list, and add another one, then check again.