# Acquisition of the NEGATIVE dataset via PhANNs DB
PhANNs DB article link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660903/

To determine whether anti-CRISPR proteins would be encoded as structural proteins and thus included in the "classes" dataset of PhANNs DB, we need to understand the nature of anti-CRISPR proteins and how they function within bacteriophages.

Anti-CRISPR proteins are typically produced by bacteriophages. These proteins inhibit the activity of the CRISPR-Cas system. Anti-CRISPR proteins are not typically considered structural components of the phage particle itself. Instead, they are regulatory proteins that interfere with the host's defense mechanisms. Therefore, it's unlikely that anti-CRISPR proteins would be classified as structural proteins in the context of the "classes" dataset in PhANNs DB. Given this, the negative dataset will be obtained from the "classes" database of PhANNs DB. Since anti-CRISPR proteins are regulatory in nature and not structural, we will select the structural proteins as the negative dataset.

The link to the file downloaded from the PhANNs DB website is http://phanns.com/download/expandedDB.tgz. This is the chosen file since it's already gotten through pre-processing and validation splitting.

In [3]:
# Conversion of selected .fasta files into a single .fasta file

import os
from Bio import SeqIO

def merge_fasta(input_dir, output_dir, output_file):
    output_path = os.path.join(output_dir, output_file)
    with open(output_path, "w") as out_handle:
        for filename in os.listdir(input_dir):
            if filename.endswith(".fasta") and not filename.endswith("_other.fasta"): # Selecting all .fasta files EXCEPT "_other.fasta" files
                filepath = os.path.join(input_dir, filename)
                for record in SeqIO.parse(filepath, "fasta"):
                    SeqIO.write(record, out_handle, "fasta")    # Writes the combined fasta file

# Directory containing input FASTA files (relative path)
input_dir = "PhANNs/expandedDB"

# Output directory (current working directory)
output_dir = os.getcwd()

# Output file
output_file = "negative_dataset.fasta"

# Merge the input FASTA files into one
merge_fasta(input_dir, output_dir, output_file)

print("Combined FASTA file saved as", os.path.join(output_dir, output_file))

Combined FASTA file saved as c:\Users\chris\OneDrive\LIGIN Laptop\Documentos\Programming\UMinho - A1S2 - ProjetoBioInf\negative_dataset.fasta
