In addition to the original script, 2_generate_TP, this script does a blast search of selected_tp_genes against temp_neg_genomes and filters out all genomes that have a hit.


This script generates the 2 remaining files required for input:
- True positive genes (nucleotide fasta)
- protein alignment (amino acid fasta)
IMPORTANTLY, these 2 files are created from non-overlapping sequences!


And also moves the corresponding files to new directories so that coverage tables can be generated from them.

In the first cell specify:
- BGC type (This name must stay constant throughout the scripts)
- select_neg_genomes, i.e. the amount of negative genomes to be transferred to the neg_genomes directory
- select_pos_genomes, i.e. the amount of positive genomes to be transferred to the pos_genomes directory and to generate the tp_genes file from (the surplus amount will be used to generate the protein alignment from)
- pos_isolation_source_filter, if these terms are found in the isolation_source column of the positive samples in the summary file, they will be scored higher in a scoring column, i.e. samples from a known and desired isolation source will be used preferentially.
- neg_isolation_source_filter, accordingly
- avoid_list. These terms are scored with a 0, end at the bottom of the table, and will be picked last. This is useful when an uncommon gene is searched for and more, and/or more tenuous isolation sources have been allowed during download. These are generally words that contain one of the search terms, e.g. 'sea' in 'diseased'.

Modify in such a way that TP genes are used as query against all individual negative genomes. Negative genomes are only moved from temp directory to neg_genomes directory if the blast search comes back negative

In [1]:
BGC_type = 'YcaO'
select_neg_genomes = 140
select_pos_genomes = 10

avoid_list = ['', 'isolation_source not annotated', 'diseased', 'mice', 'spice', 'septicemic', 'research', 'crevice']
#these are identical to first script, but don't have to be
pos_isolation_source_filter =  ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor','sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae']
neg_isolation_source_filter = ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor', 'sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae']

In [2]:
import os
from os import listdir, mkdir
from os.path import isfile, join
from pathlib import Path
import pandas as pd
from pandas.errors import EmptyDataError
import random
import glob
import warnings

In [3]:
def makedir(dirpath):
    if os.path.isdir(dirpath):
        print(dirpath,'exists already')
    else:
        print('Making', dirpath)    
        os.mkdir(dirpath)

        
# Defining paths for required directory structure for input and output files relative to parent directory
#parent_dir='/media/manu/RiPP_Prioritiser/'
#will make directories relative to the path the notebook was opened in
parent_dir= !echo $(pwd)
BGC_path=os.path.join(parent_dir[0], BGC_type)
neg_genomes_path=os.path.join(BGC_path, 'base_genomes/temp_neg_genomes')
pos_genomes_path=os.path.join(BGC_path, 'base_genomes/temp_pos_genomes')
output_neg_path=os.path.join(BGC_path, 'base_genomes/neg_genomes')
output_pos_path=os.path.join(BGC_path, 'base_genomes/pos_genomes')
neg_blast_path=os.path.join(BGC_path, 'base_genomes/neg_blast')
neg_blast_db_path=os.path.join(BGC_path, 'base_genomes/neg_blast/databases')
neg_blast_results_path=os.path.join(BGC_path, 'base_genomes/neg_blast/results')


# Calling function to make directories if they don't exist yet
makedir(output_neg_path)
makedir(output_pos_path)
makedir(neg_blast_path)
makedir(neg_blast_db_path)
makedir(neg_blast_results_path)

os.chdir(BGC_path)

Making /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
Making /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
Making /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast
Making /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases
Making /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/results


In [4]:
# Generating a report file for this script
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'w') as f:
    f.write('Output directory is: '+BGC_path+'\n')
    f.write('\nBGC_type = '+BGC_type)
    f.write('\nselect_neg_genomes = '+str(select_neg_genomes))
    f.write('\nselect_pos_genomes = '+str(select_pos_genomes))
    f.write('\navoid_list = '+str(avoid_list))
    f.write('\nneg_isolation_source_filter = '+str(neg_isolation_source_filter))
    f.write('\npos_isolation_source_filter = '+str(pos_isolation_source_filter)+'\n')

In [5]:
# load summary table into data frame () output from 1.)
summary_file = pd.read_csv('summary.tsv', sep='\t')

Change order of tables to prioritize samples that have an isolation source

In [6]:
warnings.filterwarnings('ignore')

#filter positives and drop all duplicate protein sequences originating from different organisms
pos_mask = (summary_file['dir'] == '+')
pos_df = summary_file[pos_mask]
pos_df.drop_duplicates(subset='protein_id', keep=False, inplace=True)


#filter negatives
neg_mask = (summary_file['dir'] == '-')
neg_df = summary_file[neg_mask]

#scoring words in isolation source so as to preferentially pick samples with chosen isolation sources

def custom_sorting(source,isolation_source_filter):
    score = 1
    if isolation_source_filter=='pos':
        for word in pos_isolation_source_filter:
            if word in source:
                score +=1
        for word in avoid_list:
            if source == word:
                score=0
    elif isolation_source_filter=='neg':
        for word in neg_isolation_source_filter:
            if word in source:
                score +=1
        for word in avoid_list:
            if source == word:
                score=0
    return score


pos_df['scoring_column'] = pos_df.apply(lambda x: custom_sorting(x['isolation_source'],'pos'),axis=1)
neg_df['scoring_column'] = neg_df.apply(lambda x: custom_sorting(x['isolation_source'],'neg'),axis=1)

pos_df.sort_values(by=['scoring_column'], axis=0, ascending=False, inplace=True)
neg_df.sort_values(by=['scoring_column'], axis=0, ascending=False, inplace=True)

In [7]:
#Split positive genomes into 2 bins, one goes towards tp-genes and is the pos-genomes used for synthesising metagenomes
#the other one constitutes a source of protein sequences for alignment as an input file

# Genomes selected in such a way that they are from the top of the pre-sorted pos_df
unique_pos_df = pos_df.drop_duplicates(subset='assembly', inplace=False)
selected_tp_genomes = list(unique_pos_df.iloc[:,1])[0:select_pos_genomes]
remaining_pos_genomes = list(unique_pos_df.iloc[:,1])[select_pos_genomes:]

with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_tp_genomes are:\n')
    f.write(str(selected_tp_genomes)+'\n')

#select genomes and isolate GCF number from them, move selected tp genomes to final pos_genomes directory
for genome in selected_tp_genomes:
    print('moving positive', genome, 'to', output_pos_path)
    !mv "{pos_genomes_path}"/"{genome}"* "{output_pos_path}"
    
#generate dataframe containing all tp-genomes and all the tp-genes contained in it
filtered_pos_df = pos_df[pos_df['assembly'].isin(selected_tp_genomes)]
remaining_pos_df = pos_df[~pos_df['assembly'].isin(selected_tp_genomes)]

#isolate all the headers and transfer them to the selected_tp_genes file
full_header_list = []
for i in range(0,len(filtered_pos_df)):
    full_header=str('>')+filtered_pos_df.iloc[i,1]+str('_')+filtered_pos_df.iloc[i,3]+str('_')+filtered_pos_df.iloc[i,5]
    full_header_list.append(full_header)

# generate fasta file with selected tp genes found in the selected genomes
print('generating selected_tp_genes.fasta')
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_tp_genes in positive genomes are:\n')
tp_gene_counter=0
with open(BGC_path+'/'+BGC_type+'_tp_genes.fasta') as fh:
    lines=fh.readlines()
    for i in range(0,len(lines)):
        for j in range(0,len(full_header_list)):
            if full_header_list[j] in lines[i]:
                tp_gene_counter+=1
                with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
                    f.write(lines[i][1:-1]+'\n')
                with open(BGC_path+'/'+BGC_type+'_selected_tp_genes.fasta', 'a') as outfile:
                    outfile.write(lines[i]+lines[i+1])    
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n'+str(len(selected_tp_genomes))+' unique genomes with '+ str(tp_gene_counter)+' unique tp genes.\n\n')
                    
                    
# transfer all amino acid sequences that are not part of the tp-genomes to a fasta file
print('generating selected_tp_aa.fasta')
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_tp_aa sequences for muscle alignment are:\n')
tp_aa_counter = 0
with open(BGC_path+'/'+BGC_type+'_selected_tp_aa.fasta', 'a') as outfile:
    for i in range(0,len(remaining_pos_df)):
        tp_aa_counter+=1
        fasta_header=str('>')+remaining_pos_df.iloc[i,1]+str('_')+remaining_pos_df.iloc[i,3]+str('_')+remaining_pos_df.iloc[i,5]+'\n'
        sequence = remaining_pos_df.iloc[i,6][2:-2]+'\n'
        with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
            f.write(fasta_header[1:-1]+'\n')
        outfile.write(fasta_header)
        outfile.write(sequence)
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n'+str(len(remaining_pos_genomes))+' unique genomes with '+ str(tp_aa_counter)+' unique aa sequences.\n\n')

    
print('Done')

moving positive GCF_009363655.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_009363735.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_013085545.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_012317585.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_014490785.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_009649995.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_004328865.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_002009335.2 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_000217795.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
moving positive GCF_015265455.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/pos_genomes
generating selected_tp_genes.fasta
generating selected_tp_aa

In [8]:
#Move selected neg genome files to different location
unique_neg_df = neg_df.drop_duplicates(subset='assembly', inplace=False)
# gets a list of length of specified amount of neg genomes
neg_genomes_list = list(unique_neg_df.iloc[:,1])

with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n'+'Using blastn to check negative genomes for contamination'+'\n\n')

# makes blast databases of all individual neg genomes (easier to keep track of accession numbers than when combining)
for genomes in neg_genomes_list:
    !makeblastdb -in "{neg_genomes_path}"/"{genomes}"* -dbtype nucl -out "{neg_blast_db_path}"/"{genomes}"_db

# runs blastn search of all TP genes against all blast databases of negative genomes
for genomes in neg_genomes_list:
    !blastn -db "{neg_blast_db_path}"/"{genomes}"_db -query "{BGC_path}"/"{BGC_type}"_selected_tp_genes.fasta -out "{neg_blast_results_path}"/"{genomes}".blastout -outfmt "6 qseqid sseqid pident evalue"

# use pandas to concatenate all blast output tables
df_list = []
for outfile in os.listdir(neg_blast_results_path):
    try:
        blast_df = pd.read_csv(neg_blast_results_path+'/'+outfile, sep='\t', names=['qseqid', 'sseqid', 'pident', 'evalue'], index_col=None)
        blast_df['subject_accession'] = '.'.join(outfile.split('.')[0:2])
        df_list.append(blast_df)
    except EmptyDataError:
        continue
        
# Generate a list of contaminated negative genomes        
remove_df = pd.concat(df_list)
remove_list = list(remove_df.iloc[:,4])

with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\n'+'Contamination with tp_seqs:\n'+str(list(remove_df.iloc[:,0]))+'\nsequences identified in negative samples:\n'+str(remove_list)+'\n\n')

# Remove these contaminated samples from the possible pool of negative sequences
cleaned = ~unique_neg_df['assembly'].isin(remove_list)
cleaned_unique_neg_df = unique_neg_df[cleaned]

# Select a preset number of negative genomes (this could lead to fewer genomes available than selected. Handle)
selected_neg_genomes = list(cleaned_unique_neg_df.iloc[:,1])[0:select_neg_genomes]

# report block
with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
    f.write('\nselected_neg_genomes are:\n')
    f.write(str(selected_neg_genomes))

for genome in selected_neg_genomes:
    print('moving negative', genome, 'to', output_neg_path)
    !mv "{neg_genomes_path}"/"{genome}"* "{output_neg_path}"

print('Done')



Building a new DB, current time: 07/20/2021 10:57:53
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_002094855.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_002094855.1_ASM209485v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 89 sequences in 0.105674 seconds.


Building a new DB, current time: 07/20/2021 10:57:53
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_009664145.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_009664145.1_ASM966414v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.039063 seconds.


Building a new DB, current time: 07/20/2021 10:57:54
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_001316435.1_db
New DB title:  

Adding sequences from FASTA; added 376 sequences in 0.131297 seconds.


Building a new DB, current time: 07/20/2021 10:57:58
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_009910325.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_009910325.1_ASM991032v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 67 sequences in 0.0560842 seconds.


Building a new DB, current time: 07/20/2021 10:57:58
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_002806945.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_002806945.1_ASM280694v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 30 sequences in 0.0383899 seconds.


Building a new DB, current time: 07/20/2021 10:57:58
New DB name:   /media/manu/RiPP_Prioritiser

Adding sequences from FASTA; added 75 sequences in 0.0393951 seconds.


Building a new DB, current time: 07/20/2021 10:58:03
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_001683595.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_001683595.1_NGF2_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 48 sequences in 0.0580959 seconds.


Building a new DB, current time: 07/20/2021 10:58:03
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_003429285.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_003429285.1_ASM342928v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.0521002 seconds.


Building a new DB, current time: 07/20/2021 10:58:03
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/ba

Adding sequences from FASTA; added 227 sequences in 0.0477121 seconds.


Building a new DB, current time: 07/20/2021 10:58:08
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_000286315.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_000286315.1_ASM28631v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 365 sequences in 0.0717492 seconds.


Building a new DB, current time: 07/20/2021 10:58:08
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_003839475.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_003839475.1_ASM383947v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 46 sequences in 0.065124 seconds.


Building a new DB, current time: 07/20/2021 10:58:08
New DB name:   /media/manu/RiPP_Prioritiser

Adding sequences from FASTA; added 42 sequences in 0.056807 seconds.


Building a new DB, current time: 07/20/2021 10:58:13
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_015461845.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_015461845.1_ASM1546184v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 20 sequences in 0.069968 seconds.


Building a new DB, current time: 07/20/2021 10:58:13
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_000979635.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_000979635.1_gtlEnvA5udCFS_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 139 sequences in 0.14609 seconds.


Building a new DB, current time: 07/20/2021 10:58:13
New DB name:   /media/manu/RiPP_Prioritiser

Adding sequences from FASTA; added 80 sequences in 0.0669372 seconds.


Building a new DB, current time: 07/20/2021 10:58:18
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_000812065.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_000812065.1_ASM81206v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 58 sequences in 0.04634 seconds.


Building a new DB, current time: 07/20/2021 10:58:18
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_001189955.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_001189955.1_ASM118995v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 192 sequences in 0.154388 seconds.


Building a new DB, current time: 07/20/2021 10:58:18
New DB name:   /media/manu/RiPP_Prioritiser/Yc

Adding sequences from FASTA; added 58 sequences in 1.00035 seconds.


Building a new DB, current time: 07/20/2021 10:58:52
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_000746665.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_000746665.1_ASM74666v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 1.36549 seconds.


Building a new DB, current time: 07/20/2021 10:58:53
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_011299615.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_011299615.1_ASM1129961v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.463944 seconds.


Building a new DB, current time: 07/20/2021 10:58:54
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/b

Adding sequences from FASTA; added 344 sequences in 0.982996 seconds.


Building a new DB, current time: 07/20/2021 10:59:29
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_000774065.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_000774065.1_ASM77406v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 145 sequences in 1.28384 seconds.


Building a new DB, current time: 07/20/2021 10:59:31
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_017255225.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_017255225.1_ASM1725522v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 38 sequences in 0.887856 seconds.


Building a new DB, current time: 07/20/2021 10:59:32
New DB name:   /media/manu/RiPP_Prioritiser/Y

Adding sequences from FASTA; added 2 sequences in 1.44485 seconds.


Building a new DB, current time: 07/20/2021 11:00:06
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_001833965.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_001833965.1_ASM183396v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 42 sequences in 1.59144 seconds.


Building a new DB, current time: 07/20/2021 11:00:08
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_015888165.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_015888165.1_ASM1588816v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 1.4616 seconds.


Building a new DB, current time: 07/20/2021 11:00:10
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/ba

Adding sequences from FASTA; added 15 sequences in 1.02805 seconds.


Building a new DB, current time: 07/20/2021 11:00:38
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_004405125.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_004405125.1_ASM440512v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 34 sequences in 1.10477 seconds.


Building a new DB, current time: 07/20/2021 11:00:40
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_blast/databases/GCF_014770435.1_db
New DB title:  /media/manu/RiPP_Prioritiser/YcaO/base_genomes/temp_neg_genomes/GCF_014770435.1_ASM1477043v1_genomic.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 1.40478 seconds.


Building a new DB, current time: 07/20/2021 11:00:42
New DB name:   /media/manu/RiPP_Prioritiser/YcaO/

Adding sequences from FASTA; added 101 sequences in 0.050967 seconds.
moving negative GCF_002094855.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_009664145.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_001316435.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_013283165.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_013283445.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_009363475.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_004354305.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_004022445.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_003584015.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_001434745.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/ne

moving negative GCF_003349265.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_000490355.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_003197505.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_017350015.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_010093095.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_016770915.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_016595265.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_003295565.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_004337335.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_000931385.1 to /media/manu/RiPP_Prioritiser/YcaO/base_genomes/neg_genomes
moving negative GCF_009177065.1 to /media/manu/RiPP_Prioriti

The cell below generates a muscle alignment of the protein sequences from genomes that were not chosen as pos_genomes. However, in order to follow the procedure of the 2019 Sugimoto paper more closely, instead download the seed alignment for the respective protein of interest from pfam in fasta format and save is as {BGC_type}_selected_tp_alignment.fasta

In [9]:
#!module load MUSCLE/3.8.1551
#not sure why full path is required?
#with open(BGC_path+'/'+'report_2_generate_tp.txt', 'a') as f:
#    f.write('\n\nMUSCLE alignment details:\n')

#!muscle -in "{BGC_path}"/"{BGC_type}"_selected_tp_aa.fasta -out "{BGC_path}"/"{BGC_type}"_selected_tp_alignment.fasta -loga "{BGC_path}"/report_2_generate_tp.txt
#!module unload MUSCLE/3.8.1551

Introduce a step here where all negative genomes are concatenated, a blastdb is built from the concatenated file, and all selected TP genes are used to query the blast db. If there are any hits, remove that sequence from the selected_neg_genomes_list, and add another one, then check again.