# Acquisition of the NEGATIVE dataset via PhANNs DB
PhANNs DB article link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660903/

To determine whether anti-CRISPR proteins would be encoded as structural proteins and thus included in the "classes" dataset of PhANNs DB, we need to understand the nature of anti-CRISPR proteins and how they function within bacteriophages.

Anti-CRISPR proteins are typically produced by bacteriophages. These proteins inhibit the activity of the CRISPR-Cas system. Anti-CRISPR proteins are not typically considered structural components of the phage particle itself. Instead, they are regulatory proteins that interfere with the host's defense mechanisms. Therefore, it's unlikely that anti-CRISPR proteins would be classified as structural proteins in the context of the "classes" dataset in PhANNs DB. Given this, the negative dataset will be obtained from the "classes" database of PhANNs DB. Since anti-CRISPR proteins are regulatory in nature and not structural, we will select the structural proteins as the negative dataset.

The link to the file downloaded from the PhANNs DB website is http://phanns.com/download/expandedDB.tgz. This is the chosen file as it has already been preprocessed by the authors.

In [1]:
import os
from Bio import SeqIO
import pandas as pd
import re

In [2]:
# Conversion of selected .fasta files into a single .fasta file
def merge_fasta(input_dir, output_dir, output_file):
    output_path = os.path.join(output_dir, output_file)
    with open(output_path, "w") as out_handle:
        for filename in os.listdir(input_dir):
            if filename.endswith(".fasta") and not filename.endswith("_other.fasta"): # Selecting all .fasta files EXCEPT "_other.fasta" files
                filepath = os.path.join(input_dir, filename)
                for record in SeqIO.parse(filepath, "fasta"):
                    SeqIO.write(record, out_handle, "fasta")    # Writes the combined fasta file

# Directory containing input FASTA files (relative path)
input_dir = "PhANNs/expandedDB"

# Output directory (current working directory)
output_dir = os.getcwd()

# Output file
output_file = "ProducedDatasets/negative_dataset.fasta"

# Merge the input FASTA files into one
merge_fasta(input_dir, output_dir, output_file)

print("Combined FASTA file saved as", os.path.join(output_file))

Combined FASTA file saved as ProducedDatasets/negative_dataset.fasta


In [3]:
# Reads the FASTA file created and converts it into a pandas dataframe (clustered_negative_cdhit.fasta)
def fasta2df(fasta_file):
    """
    Reads the CD-HIT .fasta file output of the dataset and converts it into a pandas DataFrame.

    Parameters:
    fasta_file (str): Path to the FASTA file of the negative dataset.

    Returns:
    pd.DataFrame: DataFrame with columns:
        - 'ID': Accession IDs of the sequences.
        - 'Sequence': Corresponding protein sequences.
        - 'Organism': Organism information extracted from the sequence description.
        - 'Structure': Description of the protein structure.
        - 'Size': Length of the protein sequences.
        - 'Protein Acr': A column with the value 'No' for all entries.
    """
    # Read the FASTA file and extract required information for each record
    data = [(fasta.id,
             str(fasta.seq),
             re.search(r'\[(.*?)\]', fasta.description).group(1),
             re.sub(r'\[.*?\]', '', fasta.description).strip().replace(fasta.id, '').strip())
            for fasta in SeqIO.parse(open(fasta_file), 'fasta')]

    # Create a DataFrame from the extracted data
    df = pd.DataFrame(data, columns=['ID', 'Sequence', 'Organism', 'Structure'])

    # Add new columns 'Size' and 'Protein Acr'
    df['Size'] = df['Sequence'].apply(len)
    df['Protein Acr'] = 'No'

    return df

In [4]:
df_negative = fasta2df('ProducedDatasets/negative_dataset.fasta')
df_negative 

Unnamed: 0,ID,Sequence,Organism,Structure,Size,Protein Acr
0,WP_132722810.1,MTNNTDTFLHYREGKTQSQRFLKNLDPENLKLNDLDVADWLLFAFN...,Tenacibaculum sp. M341,phage baseplate protein,1042,No
1,WP_130915135.1,MKKTDTFSHYREGKSQMQRFLAELDPGNLELHDFDLFDWLLFANNF...,Chryseobacterium gleum,phage baseplate protein,1027,No
2,AYZ12950.1,MKKTDTFSHYREGKSQMQRFLAELDPGNLELHDFDLFDWLLFANNF...,Chryseobacterium arthrosphaerae,phage baseplate protein,1016,No
3,WP_124535473.1,MKKTDTFSHYREGKSQMQRFLAELDPSNLELHDFDLFDWLLFANNF...,Chryseobacterium sp. KBW03,phage baseplate protein,1017,No
4,WP_123887321.1,MKKTDTFSHYREGKSQMQRFLAELDPGNLELHDFDLFDWLLFANNF...,Chryseobacterium indoltheticum,phage baseplate protein,1016,No
...,...,...,...,...,...,...
168655,YP_009031970.1,MAMNAHTPFDAKQDWTNPYCQNSSNDPMVDALLGNAYHVVRTVYCN...,Escherichia phage Bp4,tail fiber protein,600,No
168656,AGC31500.1,MNAHTPFDAKKDWSNPYCQNSSNDPMVDALLGNAYHVVRTVYCNLG...,Escherichia phage EC1-UPM,tail fiber protein,928,No
168657,AEL79676.1,MNAHTPFDANQDWSNPYCQNKSNDPMVDALLGNAYHVVRTVYCNLG...,Escherichia phage vB_EcoP_G7C,tail fiber protein,1063,No
168658,EHJ79759.1,MDNEFYTLLTDRGMAKIASALADKKQIHLQKMAVGDGGGQYYEPTA...,Salmonella enterica subsp. enterica serovar Ba...,Phage tail fiber protein,476,No
