In this notebook, I am trying to show that the Inphared files produced within Pharokka sometimes have an incorrect Accession ID for a given phage. 
- **What happens in most cases and is the expected behaviour**: the "Accession" and "contig" columns from the Inphared files have the same matching identifier, which also matches the identifier found in the "contig" columns in other output tsv files from Pharokka.
- **What sometimes happens and seems like a weird bug**: the key in the "contig" column from the Inphared file matches the identifier in all other "contig" columns from all the other Pharokka output files, but the "Accession" ID in the Inphared file corresponds to the identifier of a completely different phage.

In [1]:
import pandas as pd
from Bio import SeqIO
from Bio.SeqFeature import FeatureLocation
from Bio.Seq import UndefinedSequenceError
import os

The script is used to check for mismatches in phage IDs in a folder full of subfolders. Each subfolder has Pharokka output for a specific phage. The problem is, sometimes the Inphared output files have the wrong ID - they show an Accession id that corresponds to different phage (which is also present in the complete full of subfolder, but not here, as the data used only contains a subset of all phages). 

In these cases where the Accession ID seems incorrect in Inphared, the correct ID is actually in the "contig" column of these Inphared files. We can be sure about the fact that the contig column is the correct one because all other files in the Pharokka output contain the same code in their contig columns


In [21]:
# Directory containing the subdirectories
main_directory = '10_folders/'

# Function to create a safe column name from a filename
def safe_column_name(filename):
    return filename.replace('.', '_').replace('-', '_')

# Process the subdirectories
results = []
for subdir in os.listdir(main_directory):
    subdir_path = os.path.join(main_directory, subdir)
    if os.path.isdir(subdir_path):
        row = {'Subdirectory': subdir}

        # Loop over the files in the subdirectory
        for filename in os.listdir(subdir_path):
            if filename.endswith(".tsv"):
                file_path = os.path.join(subdir_path, filename)

                # Read the TSV file into a DataFrame
                try:
                    df = pd.read_csv(file_path, sep='\t')
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
                    continue

                # Extract a single 'contig' value
                if 'contig' in df.columns and not df['contig'].empty:
                    contig_col_name = f'contig_{safe_column_name(filename)}'
                    row[contig_col_name] = df['contig'].iloc[0]

                # Special case for 'pharokka_top_hits_mash_inphared.tsv'
                if filename == 'pharokka_top_hits_mash_inphared.tsv' and 'Accession' in df.columns and not df['Accession'].empty:
                    row['Accession_inphared'] = df['Accession'].iloc[0]

        results.append(row)

results_df = pd.DataFrame(results)

results_df = results_df.iloc[:, :-1]

# Function to check if all contig values are the same as the Accession value
def check_contig_accession(row):
    # Extract the Accession value
    accession_value = row['Accession_inphared'] if 'Accession_inphared' in row else None

    # Compare each contig column with the Accession value
    for col in row.index:
        if 'contig_' in col and row[col] != accession_value:
            return False
    return True

# Create a new column 'ContigEqualsAccession'
results_df['ContigEqualsAccession'] = results_df.apply(check_contig_accession, axis=1)


The ContigEqualsAccession column shows whether this issue appears in a given phage or not

In [22]:
results_df

Unnamed: 0,Subdirectory,contig_pharokka_cds_final_merged_output_tsv,contig_pharokka_length_gc_cds_density_tsv,contig_pharokka_cds_functions_tsv,contig_pharokka_top_hits_mash_inphared_tsv,Accession_inphared,ContigEqualsAccession
0,sequence_16448.fasta.pharokka,MT708545,MT708545,MT708545,MT708545,MT708545,True
1,sequence_16455.fasta.pharokka,MW012634,MW012634,MW012634,MW012634,NC_055908,False
2,sequence_16447.fasta.pharokka,MT708548,MT708548,MT708548,MT708548,MT708548,True
3,sequence_16452.fasta.pharokka,LC589952,LC589952,LC589952,LC589952,LC589952,True
4,sequence_16453.fasta.pharokka,MW074885,MW074885,MW074885,MW074885,MW074885,True
5,sequence_16446.fasta.pharokka,MW042787,MW042787,MW042787,MW042787,MW042787,True
6,sequence_16454.fasta.pharokka,MT708549,MT708549,MT708549,MT708549,MT708549,True
7,sequence_16449.fasta.pharokka,MW013503,MW013503,MW013503,MW013503,MW013503,True
8,sequence_16443.fasta.pharokka,MW042790,MW042790,MW042790,MW042790,MW042790,True
9,sequence_16456.fasta.pharokka,CP025907,CP025907,CP025907,CP025907,CP025907,True
