This script downloads gbk and fasta files fo generating the synthetic metagenome inputs for metaBGC build. It is far from perfect, but it's a start. Currently, the genomes parsed and downloaded are randomised. This is in principle horribly inefficient and depending on the search term, this script can run for a couple of days. However, it can be justified with generating a relatively unbiased dataset (at least as unbiased as the random number generation is).


There are a number of ways of specifying the downloads through the following inputs:
- BGC_type: Name of the output directory relative to the current working directory. This name must stay constant throughout the remaining scripts.
- amount_pos_genomes: The amount of positive genomes for the specified search_term
- amount_neg_genomes: The amount of negative genomes for the specified search_term
- search_term: name of the protein of interest to parse in the feature tables. Has to be all lower case. Can be the full term to be specific, or just part of a name to have a more loose search. E.g. 'nitrile hydratase subunit alpha' vs. 'nitrile hydratase'.
- neg_isolation_source_filter and pos_isolation_source_filter: pass a list of terms to narrow down the isolation source of organisms. These can be individually specified and don't need to be the same. To permit all isolation sources, pass ['', 'isolation_source not annotated']


Narrowing down the search space:

As of now, this script will generate a table with all currently available bacterial and archaeal genomes in refseq independent of their level of completion (currently ~214000). In order to change this, all the calls to ncbi-genome-download have to be changed.


Outputs:
- all_refseq_accessions.tsv: as of June 4th, 2021, 10.50am NZDT, the refseq table that is parsed is also downloaded to the BGC_type directory.
- tp_genes.fasta: nucleotide fasta file of all positive hits for search_term
- summary.csv: table with accession number, isolation source, organism, description, as well as for positive hits protein_id and amino acid sequence of all downloaded genomes
- backup_pos_accessions.txt: a list of refseq accession numbers that refer to genomes positive for search_term, but not downloaded as they were not in the desired pos_isolation_source_filter.
- report_1_download_base_genomes.txt: report of running this notebook with the given parameters including timestamps (i.e. to know what version of refseq was used at the time of download)
- base_genomes directory, containing:
    - features_tables: feature tables of all downloaded genomes
    - neg_gbk_files: gbk files of all downloaded genomes negative for search_term
    - pos_gbk_files: gbk files of all downloaded genomes positive for search_term
    - temp_neg_genomes: fasta files of all downloaded genomes negative for search_term
    - temp_pos_genomes: fasta files of all downloaded genomes positive for search_term

In [1]:
#Output directory will be named after BGC_type
BGC_type = 'RTX_toxin_acyltransferase_pos'
#amount genomes does not equal amount sequences as 1 genome can have multiple hits for a search term.
amount_neg_genomes = 1 # need significantly more (1000) due to false negative filtering
amount_pos_genomes = 25 # 15 is enough. Tp alignment is taken from pfam. Increase if the domain is noevel
#names are converted to lower case for searching, so search term should be all lower case letters
search_term = 'rtx toxin-activating lysine-acyltransferase'
#if isolation source doesn't matter, include '' and 'not annotated' in the filter
#isolation source information is also converted to lower case, so keep search terms lower case
#can modify positive and negative isolation source separately.

## ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor', 'sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae']

neg_isolation_source_filter = ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor', 'sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae', 'clam', 'shell', 'mussel']
pos_isolation_source_filter =  ['marine', 'sea', 'sponge', 'ocean', 'porifera', 'seafloor', 'sediment', 'water', 'tidal', 'coral', 'reef', 'coast', 'ship', 'fish', 'aquaculture', 'atlantic', 'pacific', 'mediterranean', 'baltic', 'pond', 'river', 'ice', 'carribean', 'lake', 'fjord', 'marina', 'hydro', 'algal', 'algae', 'clam', 'shell', 'mussel']

In [2]:
import os
from os import listdir, mkdir
from os.path import isfile, join
from pathlib import Path
import pandas as pd
import random
import glob
from Bio import SeqIO
from datetime import datetime

In [3]:
def makedir(dirpath):
    if os.path.isdir(dirpath):
        print(dirpath,'exists already')
    else:
        print('Making', dirpath)    
        os.mkdir(dirpath)

        
# Defining paths for required directory structure for input and output files relative to parent directory
parent_dir='/media/manu/RiPP_Prioritiser'
BGC_path=os.path.join(parent_dir, BGC_type)
base_genomes_path=os.path.join(BGC_path, 'base_genomes')
neg_genomes_path=os.path.join(BGC_path, 'base_genomes/temp_neg_genomes')
pos_genomes_path=os.path.join(BGC_path, 'base_genomes/temp_pos_genomes')
feature_tables_path=os.path.join(BGC_path, 'base_genomes/feature_tables')
neg_gbk_path=os.path.join(BGC_path, 'base_genomes/neg_gbk_files')
pos_gbk_path=os.path.join(BGC_path, 'base_genomes/pos_gbk_files')



# Calling function to make directories if they don't exist yet
makedir(BGC_path)
makedir(base_genomes_path)
makedir(neg_genomes_path)
makedir(pos_genomes_path)
makedir(feature_tables_path)
makedir(neg_gbk_path)
makedir(pos_gbk_path)

os.chdir(parent_dir)

Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/temp_neg_genomes
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/temp_pos_genomes
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/neg_gbk_files
Making /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/pos_gbk_files


In [4]:
# Generating a report file for this particular script
with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'w') as f:
    f.write('Output directory is: '+BGC_path+'\n')
    f.write('\nBGC_type = '+BGC_type)
    f.write('\namount_neg_genomes = '+str(amount_neg_genomes))
    f.write('\namount_pos_genomes = '+str(amount_pos_genomes))
    f.write('\nsearch_term = '+search_term)
    f.write('\nneg_isolation_source_filter = '+str(neg_isolation_source_filter))
    f.write('\npos_isolation_source_filter = '+str(pos_isolation_source_filter)+'\n')

Using ncbi-genome-download to get a list of accession numbers of complete bacterial and archaeal genomes from refseq. Save this file as a tsv.
Run only if file doesn't exist yet

In [5]:
# Download table of GCF accessions
#including archaea: 21739 (374 more)
#excluding archaea: 21365
#!ncbi-genome-download bacteria,archaea -s refseq -F features -l complete --flat-output --dry-run > "{parent_dir}"/all_refseq_accessions.tsv


# If including all assembly levels:
#including archaea: 214543 (1125 more)
#excluding archaea: 213418

now=!$"date"
#!ncbi-genome-download bacteria,archaea -s refseq -F features -l all --flat-output --dry-run > "{parent_dir}"/all_refseq_accessions.tsv
!ncbi-genome-download bacteria,archaea -s refseq -F features -l all --flat-output --dry-run > "{BGC_path}"/all_refseq_accessions.tsv

with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
    f.write('\nrefseq input df generated from table downloaded on '+now[0]+'\n')



In [6]:
# read .tsv file into a pandas dataframe
gcf_df = pd.read_csv(BGC_path+'/all_refseq_accessions.tsv', header=None, sep='\t', skiprows=[0])
gcf_df.columns=['accession','organism','strain'] 

#turn first column into a list of accession number to randomly select from
accession_list = list(gcf_df.loc[:,'accession'])

with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
    f.write('\nrefseq input df\n'+str(gcf_df)+'\n\n')

In [7]:
# https://stackoverflow.com/questions/54662605/how-to-pass-a-python-variables-to-shell-script-in-azure-databricks-notebookbles
#os.environ['LIST'] = ' '.join(random_gcf)
#print(os.getenv('LIST'))

- Also make a shutoff condition if coming back to download more genomes. I.e. parse feature table folder and see if GCF_... already exists, if so, skip (also should append to summary file in this case, but that's probably too much)
- pos only appended to summary if the protein id is in the last or only feature

In [8]:
#scramble accession_list
random_gcf_list=random.sample(accession_list,len(accession_list))

# pre-definiting headers and data types for faster loading of data frames
feature_table_headers = ["# feature", "class", "assembly", "assembly_unit", "seq_type", "chromosome", "genomic_accession", "start", "end", "strand", "product_accession", "non-redundant_refseq", "related_accession", "name", "symbol", "GeneID", "locus_tag", "feature_interval_length", "product_length", "attributes"]
data_types = {'# feature':str,'class':str,'assembly':str,'assembly_unit':str,'seq_type':str,'chromosome':str,'genomic_accession':str,'strand':str,'product_accession':str,'non-redundant_refseq':str,'related_accession':str,'name':str,'symbol':str,'locus_tag':str,'attributes':str}

#initiate summary dictionary
summary_dict={'dir':[],'assembly':[],'isolation_source':[],'organism':[],'description':[],'protein_id':[],'sequence':[]}

#initialise counters to keep track of the amount of checked and downloaded genomes
gcf_checked_counter=0
neg_genomes_counter=0
pos_genomes_counter=0


#open a file to parse positive sequences into --> generate the TP file for this gene

for i in range(0,len(random_gcf_list)): 

    os.chdir(feature_tables_path)
    #print('\nCurrently downloaded',pos_genomes_counter, 'of', amount_pos_genomes, 'pos genome(s).')
    #print('Currently downloaded',neg_genomes_counter, 'of', amount_neg_genomes, 'neg genome(s).')


    #exit from the loop once the pre-set amount of genomes has been downloaded
    if neg_genomes_counter == amount_neg_genomes and pos_genomes_counter == amount_pos_genomes:
        print('\n\n\nDownloaded specified amount of genomes.')
        break
       
    #print('Checked a total of:\t', gcf_checked_counter, 'genome(s). \n\n\n')
    gcf_checked_counter+=1
    
    #exit from the loop if all genomes have been checked (~22000, so that would take a while)
    if gcf_checked_counter == len(random_gcf_list):
        print('Checked all possible genomes. Exiting.')
        break

        
    random_gcf=random_gcf_list[i]
    !ncbi-genome-download bacteria,archaea -s refseq -F features -l all -A "{random_gcf}" --flat-output --output-folder "{feature_tables_path}"
    !gunzip -f "{feature_tables_path}"/"{random_gcf}"*
        
    # Pre-filter individual feature table by a range of criteria to make it more compact
    for feature_table in glob.glob(random_gcf+'*_feature_table.txt'):
        #print('\nFiltering', feature_table)
        features_df = pd.read_csv(feature_table, index_col=None, sep = '\t', names = feature_table_headers, dtype=data_types)
        mask = (features_df['# feature'] == 'CDS')
        export_df = features_df[mask]
        mask2 = (export_df['class'] == 'with_protein') # removes pseudogenes 'without protein'
        export_df2 = export_df[mask2]
        mask3 = (export_df2['assembly_unit'] == 'Primary Assembly')
        export_df3 = export_df2[mask3]
        mask4 = (export_df3['seq_type'] == 'chromosome') | (export_df3['seq_type'] == 'plasmid')
        export_df4 = export_df3[mask4]
        mask5 = pd.isnull(export_df4['attributes'])
        export_df5 = export_df4[mask5]
        export_df5.loc[:,'name'] = export_df5.loc[:,'name'].str.lower()
        mask6 = (~export_df5['name'].str.contains('hypothetical protein'))
        export_df6 = export_df5[mask6]

        
        # search for search term in reduced features table and then conditionally download fasta and gbk either into
        # pos or neg genomes directories
        interest_mask = export_df6['name'].str.contains(search_term)
        search_names = export_df6[interest_mask]    
        

        
##### NEGATIVE GENOMES #####

    # if search term is not found in genome and the set amount of negative genomes has not been reached
    if len(search_names) == 0 and neg_genomes_counter < amount_neg_genomes:
        
        # Download gbk file
        print('No hits for', search_term, 'in', random_gcf)
        print('Downloading gbk of', random_gcf, 'to negative gbk directory.')
        !ncbi-genome-download bacteria,archaea -s refseq -F genbank -l all -A "{random_gcf}" --flat-output --output-folder "{neg_gbk_path}"
        !gunzip -f "{neg_gbk_path}"/"{random_gcf}"*
 
        #parse gbk files for isolation source and organism
        os.chdir(neg_gbk_path)
        for gbk_file in glob.glob(random_gcf+'*.gbff'):
            print('Parsing gbk file of', random_gcf, 'for organism and isolation source.')
            with open(gbk_file):
                for record in SeqIO.parse(gbk_file,'gb'):
                    features = record.features[0]
                    
                    #parse organism if annotated, handle if not
                    if 'organism' in features.qualifiers:
                        organism = features.qualifiers['organism']
                    else:
                        organism = ['organism not annotated']

                    #parse isolation source if annotated, handle if not
                    if 'isolation_source' in features.qualifiers:
                        isolation_source = [entry.lower() for entry in record.features[0].qualifiers['isolation_source']]
                    else:
                        isolation_source = ['isolation_source not annotated']

                # if predefined search words are found in isolation source, increment a counter
                #download_indicator = 0
                wordlist = []
                for word in neg_isolation_source_filter:
                    if word in isolation_source[0]:
                        #download_indicator+=1
                        wordlist.append(word)
                if len(wordlist) >=1:
                    print(wordlist, 'in desired isolation source list.')
                else:
                    print('Not in desired isolation source list. Feature table and gbk files are removed.\n')

                        
                # if isolation source condition is met...
                if len(wordlist) >= 1:
                    #...add row with required information to the summary table...
                    print('Add row for:', random_gcf, 'in', record.description, 'to summary table.')
                    summary_dict['dir'].append('-')
                    summary_dict['assembly'].append(random_gcf)
                    summary_dict['isolation_source'].append(isolation_source[0])
                    summary_dict['organism'].append(organism[0])
                    summary_dict['description'].append(record.description)
                    summary_dict['protein_id'].append('None')
                    summary_dict['sequence'].append('None')

                    #...and download the fasta file for the corresponding genome and increment the counter.
                    os.chdir(feature_tables_path)
                    print('Downloading fasta of', random_gcf, 'to negative genomes directory.')
                    print('Checked a total of:\t', gcf_checked_counter, 'genome(s).\n')
                    with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
                        f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tDownloading fasta of:\t'+random_gcf+' to negative genomes directory.\n')
                    !ncbi-genome-download bacteria,archaea -s refseq -F fasta -l all -A "{random_gcf}" --flat-output --output-folder "{neg_genomes_path}"
                    !gunzip -f "{neg_genomes_path}"/"{random_gcf}"*
                    #download_indicator = 0
                    wordlist = []
                    neg_genomes_counter+=1

                # if isolation source condition is not met...
                else:
                    #print(random_gcf, 'negative for', search_term, 'but', isolation_source[0], 'not in desired list.')
                    #print('Removing gbk file and feature table of', random_gcf)
                    !rm "{neg_gbk_path}"/"{random_gcf}"*
                    !rm "{feature_tables_path}"/"{random_gcf}"*
                    wordlist = []
                        

    # if search term is not found in genome and the set amount of negative genomes has been reached
    elif len(search_names) == 0 and neg_genomes_counter == amount_neg_genomes:
        #print(random_gcf, 'is negative for', search_term+',', 'but specified number of negative genomes reached.')
        #print('Removing feature table of', random_gcf)
        !rm "{feature_tables_path}"/"{random_gcf}"*
    
    
##### POSITIVE GENOMES #####
    # if search term is found in genome and the set amount of positive genomes has not been reached
    elif len(search_names) != 0 and pos_genomes_counter < amount_pos_genomes:

        # Download gbk file
        product_accessions = list(search_names.loc[:,'product_accession'])
        print('Found following hit/s for search term', search_term+':', product_accessions)
        print('Downloading gbk of', random_gcf, 'to positive gbk directory.')
        !ncbi-genome-download bacteria,archaea -s refseq -F genbank -l all -A "{random_gcf}" --flat-output --output-folder "{pos_gbk_path}"
        !gunzip -f "{pos_gbk_path}"/"{random_gcf}"*
        
        
        # Adding an entry for each found hit in each genome into summary_dict
        os.chdir(pos_gbk_path)
        for gbk_file in glob.glob(random_gcf+'*.gbff'):
            print('Parsing gbk file of', random_gcf, 'for organism and isolation source.')
            with open(gbk_file):
                for record in SeqIO.parse(gbk_file,'gb'):    
                    features = record.features[0]
                    
                    #parse organism if annotated, handle if not
                    if 'organism' in record.features[0].qualifiers:
                        organism = record.features[0].qualifiers['organism']
                    else:
                        organism = ['organism not annotated']

                    #parse isolation source if annotated, handle if not
                    if 'isolation_source' in record.features[0].qualifiers:
                        isolation_source = [entry.lower() for entry in record.features[0].qualifiers['isolation_source']]
                        record.features[0].qualifiers['isolation_source']
                    else:
                        isolation_source = ['isolation_source not annotated']
        
        
                    # if predefined search words are found in isolation source, increment a counter
                    #download_indicator = 0
                    wordlist = []
                    for word in pos_isolation_source_filter:
                        if word in isolation_source[0]:
                            #download_indicator+=1
                            wordlist.append(word)
                    

                    # if isolation source condition is met...
                    if len(wordlist) >= 1:
                        #...add row with required information to the summary table...  
                        for i in record.features:
                            for j in product_accessions:
                                if i.qualifiers.get('protein_id') == [j]:
                                    #print(i.extract(record.seq))
                                    print('Add row for', j, 'in', record.description, 'to summary table.')
                                    with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
                                        f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tAdd row for:\t\t\t'+random_gcf+' '+j+' in '+record.description+' to summary table.\n')
                                    summary_dict['dir'].append('+')
                                    summary_dict['assembly'].append(random_gcf)
                                    summary_dict['isolation_source'].append(isolation_source[0])
                                    summary_dict['organism'].append(organism[0])
                                    summary_dict['description'].append(record.description)
                                    summary_dict['protein_id'].append(j)
                                    summary_dict['sequence'].append(i.qualifiers.get('translation'))
                                    print('Adding nucleotide sequence of', j, 'to TP file.')
                                    with open(BGC_path+'/'+BGC_type+'_tp_genes.fasta', 'a') as fasta_file:
                                        fasta_file.write('>'+random_gcf+'_'+organism[0]+'_'+j+'\n'+str(i.extract(record.seq))+'\n')



                                
        #...and download the fasta file for the corresponding genome and increment the counter.
        if len(wordlist) >= 1:
            os.chdir(feature_tables_path)
            print(wordlist, 'in desired isolation source list.')
            print('Downloading fasta of', random_gcf, 'to positive genomes directory.')
            print('Checked a total of:\t', gcf_checked_counter, 'genome(s).\n')
            with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
                f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tDownloading fasta of:\t'+random_gcf+' to positive genomes directory.\n')
            !ncbi-genome-download bacteria,archaea -s refseq -F fasta -l all -A "{random_gcf}" --flat-output --output-folder "{pos_genomes_path}"
            !gunzip -f "{pos_genomes_path}"/"{random_gcf}"*
            pos_genomes_counter+=1
            wordlist = []
        else:
            print(random_gcf, 'positive for', search_term, 'but', isolation_source[0], 'not in desired list.')
            print('Removing gbk file and feature table of', random_gcf)
            print('Checked a total of:\t', gcf_checked_counter, 'genome(s).\n')
            with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
                f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tDownloading\t\t\t\t'+random_gcf+' accession number to backup_pos_accessions.txt. '+isolation_source[0]+' not in desired list.\n')
            with open(BGC_path+'/'+'backup_pos_accessions.txt', 'a') as f:
                f.write(random_gcf+'\n')
            !rm "{pos_gbk_path}"/"{random_gcf}"*
            !rm "{feature_tables_path}"/"{random_gcf}"*
            wordlist = []


    # if search term is found in genome and the set amount of positive genomes has been reached
    elif len(search_names) != 0 and pos_genomes_counter == amount_pos_genomes:
        with open(BGC_path+'/'+'report_1_download_base_genomes.txt', 'a') as f:
            f.write(datetime.now(tz=None).strftime('%d/%m/%y, %H:%M:%S')+'\tDownloading\t\t\t\t'+random_gcf+' accession number to backup_pos_accessions.txt. Set number of positive genomes reached.\n')
        with open(BGC_path+'/'+'backup_pos_accessions.txt', 'a') as f:
            f.write(random_gcf+'\n')
        #print(random_gcf, 'is positive for', search_term+',', 'but specified number of positive genomes reached.')
        #print('Removing feature table of', random_gcf)
        !rm "{feature_tables_path}"/"{random_gcf}"*

        
        
        
##### FAIL? #####
        
    elif neg_genomes_counter == amount_neg_genomes and pos_genomes_counter == amount_pos_genomes:
        print("This shouldn't be happening", '\n')   
        

print('Done')

              
              
              
summary_df = pd.DataFrame(summary_dict)
summary_df.to_csv(BGC_path+'/summary.tsv', index=False, sep='\t')

No hits for rtx toxin-activating lysine-acyltransferase in GCF_000249955.1
Downloading gbk of GCF_000249955.1 to negative gbk directory.
Parsing gbk file of GCF_000249955.1 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files are removed.

No hits for rtx toxin-activating lysine-acyltransferase in GCF_002277675.1
Downloading gbk of GCF_002277675.1 to negative gbk directory.
Parsing gbk file of GCF_002277675.1 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files are removed.

No hits for rtx toxin-activating lysine-acyltransferase in GCF_000652895.1
Downloading gbk of GCF_000652895.1 to negative gbk directory.
Parsing gbk file of GCF_000652895.1 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files are removed.

No hits for rtx toxin-activating lysine-acyltransferase in GCF_018336695.1
Downloading gbk of GCF_018336695.1 to negative gbk directory.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Parsing gbk file of GCF_018336695.1 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files are removed.

No hits for rtx toxin-activating lysine-acyltransferase in GCF_000014365.2
Downloading gbk of GCF_000014365.2 to negative gbk directory.
Parsing gbk file of GCF_000014365.2 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files are removed.

No hits for rtx toxin-activating lysine-acyltransferase in GCF_003076075.1
Downloading gbk of GCF_003076075.1 to negative gbk directory.
Parsing gbk file of GCF_003076075.1 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files are removed.

No hits for rtx toxin-activating lysine-acyltransferase in GCF_012276715.1
Downloading gbk of GCF_012276715.1 to negative gbk directory.
Parsing gbk file of GCF_012276715.1 for organism and isolation source.
Not in desired isolation source list. Feature table and gbk files

GCF_003390535.1 positive for rtx toxin-activating lysine-acyltransferase but in vitro phage -resistant clone not in desired list.
Removing gbk file and feature table of GCF_003390535.1
Checked a total of:	 4814 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_013085165.1 to positive gbk directory.
Parsing gbk file of GCF_013085165.1 for organism and isolation source.
GCF_013085165.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_013085165.1
Checked a total of:	 5405 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_000765415.1 to positive gbk directory.
Parsing gbk file of GCF_000765415.1 for organism and isolation source.
GCF_000765415.1 positive for rtx toxin-activating lysine-acyltransferase but isolatio

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_013183815.1']
Downloading gbk of GCF_000953355.1 to positive gbk directory.
Parsing gbk file of GCF_000953355.1 for organism and isolation source.
GCF_000953355.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_000953355.1
Checked a total of:	 19205 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_033929666.1']
Downloading gbk of GCF_002313025.1 to positive gbk directory.
Parsing gbk file of GCF_002313025.1 for organism and isolation source.
Add row for WP_033929666.1 in Vibrio cholerae strain FORC_055 isolate MFDS chromosome 1, complete sequence to summary table.
Adding nucleotide sequence of WP_033929666.1 to TP file.
['fish'] in desired isolation source list.
Downloading fasta of GCF_002313025.1 to positive genomes directory.
Checked a tota

rm: cannot remove '/media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_902832255.1*': No such file or directory
ERROR: Download from NCBI failed: ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))
gzip: /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_015264295.1*.gz: No such file or directory
rm: cannot remove '/media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_015264295.1*': No such file or directory
ERROR: Download from NCBI failed: ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))
gzip: /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_002550925.1*.gz: No such file or directory
rm: cannot remove '/media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_002550925.1*': No such

rm: cannot remove '/media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_016018015.1*': No such file or directory
Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_039507923.1']
Downloading gbk of GCF_000342305.2 to positive gbk directory.
Parsing gbk file of GCF_000342305.2 for organism and isolation source.
GCF_000342305.2 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_000342305.2
Checked a total of:	 26806 genome(s).

ERROR: Download from NCBI failed: ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))
gzip: /media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GCF_003411165.2*.gz: No such file or directory
rm: cannot remove '/media/manu/RiPP_Prioritiser/RTX_toxin_acyltransferase_pos/base_genomes/feature_tables/GC

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_002946655.1 to positive gbk directory.
Parsing gbk file of GCF_002946655.1 for organism and isolation source.
GCF_002946655.1 positive for rtx toxin-activating lysine-acyltransferase but clinical isolate from adult patient not in desired list.
Removing gbk file and feature table of GCF_002946655.1
Checked a total of:	 36591 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_032474245.1']
Downloading gbk of GCF_019273635.1 to positive gbk directory.
Parsing gbk file of GCF_019273635.1 for organism and isolation source.
GCF_019273635.1 positive for rtx toxin-activating lysine-acyltransferase but food not in desired list.
Removing gbk file and feature table of GCF_019273635.1
Checked a total of:	 36782 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_161772470.1']
Dow

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_009763065.1 to positive gbk directory.
Parsing gbk file of GCF_009763065.1 for organism and isolation source.
GCF_009763065.1 positive for rtx toxin-activating lysine-acyltransferase but stool not in desired list.
Removing gbk file and feature table of GCF_009763065.1
Checked a total of:	 56609 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_011081431.1']
Downloading gbk of GCF_009665475.1 to positive gbk directory.
Parsing gbk file of GCF_009665475.1 for organism and isolation source.
GCF_009665475.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_009665475.1
Checked a total of:	 56865 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_010634639.1']
Downloa

Parsing gbk file of GCF_001318185.1 for organism and isolation source.
GCF_001318185.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_001318185.1
Checked a total of:	 77058 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_046395515.1']
Downloading gbk of GCF_019090985.1 to positive gbk directory.
Parsing gbk file of GCF_019090985.1 for organism and isolation source.
GCF_019090985.1 positive for rtx toxin-activating lysine-acyltransferase but prolonged culture of phase i not in desired list.
Removing gbk file and feature table of GCF_019090985.1
Checked a total of:	 77113 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_013085145.1 to positive gbk directory.
Parsing gbk file of GCF_013085145.1 for organism and isolation source.
GCF_013085145.1

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_011081431.1']
Downloading gbk of GCF_000746665.1 to positive gbk directory.
Parsing gbk file of GCF_000746665.1 for organism and isolation source.
Add row for WP_011081431.1 in Vibrio vulnificus strain 93U204 chromosome II, complete sequence to summary table.
Adding nucleotide sequence of WP_011081431.1 to TP file.
['sea'] in desired isolation source list.
Downloading fasta of GCF_000746665.1 to positive genomes directory.
Checked a total of:	 90644 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_009762895.1 to positive gbk directory.
Parsing gbk file of GCF_009762895.1 for organism and isolation source.
GCF_009762895.1 positive for rtx toxin-activating lysine-acyltransferase but stool not in desired list.
Removing gbk file and feature table of GCF_009762895.1
Checked a total of:	 90773 genome(s).

Found followi

Parsing gbk file of GCF_017948345.1 for organism and isolation source.
GCF_017948345.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_017948345.1
Checked a total of:	 114790 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_000022585.1 to positive gbk directory.
Parsing gbk file of GCF_000022585.1 for organism and isolation source.
GCF_000022585.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_000022585.1
Checked a total of:	 115349 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_001881196.1']
Downloading gbk of GCF_009763945.1 to positive gbk directory.
Parsing gbk file of GCF_009763945.1 for organism and isolation source.
GCF_0097639

Parsing gbk file of GCF_002813815.1 for organism and isolation source.
GCF_002813815.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_002813815.1
Checked a total of:	 134605 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_032474245.1']
Downloading gbk of GCF_004117115.1 to positive gbk directory.
Parsing gbk file of GCF_004117115.1 for organism and isolation source.
GCF_004117115.1 positive for rtx toxin-activating lysine-acyltransferase but isolation_source not annotated not in desired list.
Removing gbk file and feature table of GCF_004117115.1
Checked a total of:	 134967 genome(s).

Found following hit/s for search term rtx toxin-activating lysine-acyltransferase: ['WP_017044044.1']
Downloading gbk of GCF_003390555.1 to positive gbk directory.
Parsing gbk file of GCF_003390555.1 for organism and isolation source.
GCF_0033905