# Biological Data Project

Group members:

- Alberto Calabrese

- Marlon Helbing

- Lorenzo Baietti

"A protein domain is a conserved part of a given protein sequence and tertiary structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded." ([Wikipedia](https://en.wikipedia.org/wiki/Protein_domain)).

The project is about the characterization of a single domain. Each group is provided with a representative domain sequence and the corresponding Pfam identifier (see table below). The objective of the project is to build a sequence model starting from the assigned sequence and to provide a functional characterization of the entire domain family (homologous proteins).

## Input
A representative sequence of the domain family. Columns are: group, UniProt accession, organism, Pfam identifier, Pfam name, domain position in the corresponding UniProt protein, domain sequence.

```bash
UniProt : P54315 
PfamID : PF00151 
Domain Position : 18-353 
Organism : Homo sapiens (Human) 
Pfam Name : Lipase/vitellogenin 
Domain Sequence : KEVCYEDLGCFSDTEPWGGTAIRPLKILPWSPEKIGTRFLLYTNENPNNFQILLLSDPSTIEASNFQMDRKTRFIIHGFIDKGDESWVTDMCKKLFEVEEVNCICVDWKKGSQATYTQAANNVRVVGAQVAQMLDILLTEYSYPPSKVHLIGHSLGAHVAGEAGSKTPGLSRITGLDPVEASFESTPEEVRLDPSDADFVDVIHTDAAPLIPFLGFGTNQQMGHLDFFPNGGESMPGCKKNALSQIVDLDGIWAGTRDFVACNHLRSYKYYLESILNPDGFAAYPCTSYKSFESDKCFPCPDQGCPQMGHYADKFAGRTSEEQQKFFLNTGEASNF
```

## Domain model definition
The objective of the first part of the project is to build a PSSM and HMM model representing the assigned domain. The two models will be generated starting from the assigned input sequence. The accuracy of the models will be evaluated against Pfam annotations as provided in the SwissProt database.

Model Building 
1. Retrieve homologous proteins starting from your input sequence performing a BLAST search
against UniProt or UniRef50 or UniRef90, or any other database

- We use https://www.uniprot.org/blast 
    - use the Domain Sequence as Input
    Parameters : 
    - against UniProtKB
    - e-value thresh : 0.0001
    - 1000 hits

- results in 1000 hits 
- do ID matching to UniProtKB (we need to do that to download the .fasta file)
-  results in 'UNIPROTKB_INITIAL_ORIGINAL.FASTA' 

2. Generate a multiple sequence alignment (MSA) starting from retrieved hits using T-coffee or
ClustalOmega or MUSCLE
    - We use https://www.ebi.ac.uk/jdispatcher/msa/clustalo
    Parameters :
    - Output Format : FASTA

- results in ClustalOmegaUniPortAlignment_ORIGINAL.fasta

3. If necessary, edit the MSA with JalView (or with your custom script or CD-HIT) to remove not
conserved positions (columns) and/or redundant information (rows)
    - We first used JalView at a 100% threshold to check for redundant rows, which left us with 155 sequences

- results in ClustalOmegaUniPortAlignment.fasta

    - Then we utilize 'conservation.py' to remove columns based on different kinds of metrics (right now only that, lets check if the model is good enough and then we can clean up the code)
        - We have gap_threshold, which removes columns based on how high of a percentage of gaps they have 
        - We have entropy_threshold, which removes columns based on single amino acid entropy in the column
        - We have group_entropy_threshold, which calculates entropies based on predefined groups of amino acids in a column and removes based on the threshold
    
    - We did.... (gap_threshold, group_entropy_threshold,.....)

    - TODO : for the group_entropy thing, I don't know if the MSA clustering already takes that into account and maybe it doesn't help us, but we should try anyway
    - TODO : try some different models based on gap_threshold, entropy_threshold and group_entropy_threshold (especially use group_entropy_threshold and entropy_threshold, bc in class we used entropy for getting a better model); our current best model is :
        - HMM with first removed redundant rows (I'd say we can always initially start from ClustalOmegaUniProtAlignment.fasta, its in the Model/Building/MSA folder, it has rows with 100% identity removed by using JalView) and here in the notebook only gap threshold of 0.90, nothing else :


        Protein-Level Confusion Matrix HMM:
        True Positives: 82
        False Positives: 1
        False Negatives: 0
        True Negatives: 0


        Protein-Level Metrics HMM:
        Precision: 0.9880
        Recall: 1.0000
        F-score: 0.9939
        Balanced Accuracy: 0.5000
        MCC: 0.0000



        Residue-Level Confusion Matrix HMM:
        True Positives: 22810
        False Positives: 5440
        False Negatives: 2205
        True Negatives: 2278


        Residue-Level Metrics HMM:
        Precision: 0.8074
        Recall: 0.9119
        F-score: 0.8565
        Balanced Accuracy: 0.6035
        MCC: 0.2556



- results in trimmed_alignment.fasta



In [2]:
from Bio import AlignIO
from collections import Counter
import pandas as pd
from scipy.stats import entropy
import math
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio import SeqIO
import sys
import csv
import re 
import json
import requests
import time
from typing import Dict, List, Optional, Tuple
from collections import defaultdict
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, balanced_accuracy_score, matthews_corrcoef
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
from tqdm import tqdm
import logging
from Bio import Phylo
from io import StringIO
#from ete3 import Tree, TreeStyle, NodeStyle, TextFace
import xml.etree.ElementTree as ET
from scipy.stats import fisher_exact
from wordcloud import WordCloud
import random
import obonet
import networkx as nx
from statsmodels.stats.multitest import multipletests
from goatools import obo_parser
import ast


In [40]:
class ConservationAnalyzer:
    def __init__(self, alignment_file):
        """
        Initialize with an alignment file
            alignment_file (str): Path to the alignment file
        """
        self.alignment = AlignIO.read(alignment_file, 'fasta')
        self.num_sequences = len(self.alignment)
        self.alignment_length = self.alignment.get_alignment_length()
        
    def get_column(self, pos):
        """Extract a column from the alignment"""
        return [record.seq[pos] for record in self.alignment]
    
    def calculate_gap_frequency(self, pos):
        """Calculate frequency of gaps in a column"""
        column = self.get_column(pos)
        return column.count('-') / len(column)
    
    def calculate_amino_acid_frequencies(self, pos):
        """Calculate frequencies of each amino acid in a column"""
        column = self.get_column(pos)
        total = len(column) - column.count('-')  # Don't count gaps, such that when we calculate conservation scores the gaps don't mess it up 
        if total == 0:
            return {}
        
        counts = Counter(aa for aa in column if aa != '-')
        return {aa: count/total for aa, count in counts.items()}

    def calculate_entropy(self, pos):
        """
        Calculate Shannon entropy for a column
        Lower entropy means higher conservation
        """
        freqs = self.calculate_amino_acid_frequencies(pos)
        if not freqs:
            return float('inf')  
        
        return -sum(p * math.log2(p) for p in freqs.values())
    
    def get_amino_acid_groups(self):
        """Define groups of similar amino acids 
           Based on : https://en.wikipedia.org/wiki/Conservative_replacement#:~:text=There%20are%2020%20naturally%20occurring,both%20small%2C%20negatively%20charged%20residues.
        """
        return {
            'aliphatic': set('GAVLI'),
            'hydroxyl': set('SCUTM'),
            'cyclic': set('P'),
            'aromatic': set('FYW'),
            'basic': set('HKR'),
            'acidic': set('DENQ')
        }
    

    def calculate_group_entropy(self, pos):
        """
        Calculate Shannon entropy based on amino acid groups for a column
        Lower entropy means higher conservation of amino acid groups
        
        Returns:
            float: Group entropy value (lower means more conserved groups)
                Returns float('inf') if column contains only gaps
        """
        column = self.get_column(pos)
        groups = self.get_amino_acid_groups()
        
        # Create mapping from amino acid to group
        aa_to_group = {}
        for group_name, aas in groups.items():
            for aa in aas:
                aa_to_group[aa] = group_name
                
        # Count group occurrences (excluding gaps)
        group_counts = Counter(aa_to_group.get(aa, 'other') 
                            for aa in column if aa != '-')
        
        # If column has only gaps, return infinity
        if not group_counts:
            return float('inf')
            
        # Calculate group frequencies
        total = sum(group_counts.values())
        group_freqs = {group: count/total for group, count in group_counts.items()}
        
        # Calculate entropy
        return -sum(p * math.log2(p) for p in group_freqs.values())


    def analyze_columns(self, gap_threshold=0.90, group_entropy_threshold=0.4, entropy_threshold=0.5):
        """
        Analyze all columns and return comprehensive metrics
        Returns DataFrame with various conservation metrics for each position
        """
        data = []
        
        for i in range(self.alignment_length):
            gap_freq = self.calculate_gap_frequency(i)
          #  cons_score = self.calculate_conservation_score(i)
            info_content = self.calculate_entropy(i)
         #   group_cons = self.calculate_group_conservation(i)
            group_entropy = self.calculate_group_entropy(i)
            
            data.append({
                'position': i + 1,
                'gap_frequency': gap_freq,
             #   'single_conservation': cons_score,
                'entropy': info_content,
             #   'group_conservation': group_cons,
                'group_entropy': group_entropy,
                # Here we should look possibly for better ideas
                # Check gap frequency not too high (i.e. not nearly all elements in the columns gaps (-))
                # Check that the group conservation is high enough (i.e. the amino acids are not too different
                # ; right now we do with groups and not single amino acid sequence since I'd say the groups
                # are more representative (if we do single amino acids, we'd delete more stuff))
                'suggested_remove': (gap_freq > gap_threshold) #or      
                                 # or group_entropy < group_entropy_threshold)
                                 # or info_content < entropy_threshold)
            })
        
        return pd.DataFrame(data)


def remove_columns_from_alignment(input_file, output_file, columns_to_remove, format="fasta"):
    """
    Remove specified columns from a multiple sequence alignment and save to new file
    
    Args:
        input_file (str): Path to input alignment file
        output_file (str): Path where to save trimmed alignment
        columns_to_remove (list): List of column indices to remove (0-based)
        format (str): File format (default: "fasta")
    """
    # Read the alignment
    alignment = AlignIO.read(input_file, format)
    
    # Sort columns to remove in descending order
    # (so removing them doesn't affect the indices of remaining columns)
    columns_to_remove = sorted(columns_to_remove, reverse=True)
    
    # Create new alignment records
    new_records = []
    
    # Process each sequence
    for record in alignment:
        # Convert sequence to list for easier manipulation
        seq_list = list(record.seq)
        
        # Remove specified columns
        for col in columns_to_remove:
            del seq_list[col]
        
        # Create new sequence record
        new_seq = Seq(''.join(seq_list)) # Join the list element to a string again (i.e. after removal of amino acids out of sequence represented as list, turn into one string again) and turn into Seq object
        new_record = SeqRecord(new_seq,
                            id=record.id,
                            name=record.name,
                            description=record.description)
        new_records.append(new_record)
    
    # Create new alignment
    # TODO : Maybe we have to add some variables here (i.e. how to do the MSA)!
    new_alignment = MultipleSeqAlignment(new_records)
    
    # Write to file
    AlignIO.write(new_alignment, output_file, format)
    
    return new_alignment

In [41]:
if __name__ == "__main__":
    # Initialize analyzer 
    analyzer = ConservationAnalyzer("Model/Building/MSA/ClustalOmegaUniProtAlignment.fasta")
    
    # Get comprehensive analysis
    analysis = analyzer.analyze_columns(gap_threshold=0.90)
   # analysis_2 = analyzer.analyze_rows()
    
    # Print summary statistics
    print("\nAlignment Summary:")
    print(f"Number of sequences: {analyzer.num_sequences}")
    print(f"Alignment length: {analyzer.alignment_length}")


    # Print number of True/False
    counts = analysis['suggested_remove'].value_counts()

    counts_true = counts[True]  # To be removed
    counts_false = counts[False] # To be kept

    print(f"With the current removal tactic, we would remove {(counts_true / (counts_true + counts_false)):.2f} percent of columns ; we keep {counts_false} of {counts_false + counts_true} columns")
    

    # Save detailed analysis to CSV
    analysis.to_csv("Model/Building/MSA_processed/conservation_analysis.csv", index=False)


    # Get indices of columns marked for removal
    columns_to_remove = analysis[analysis['suggested_remove']]['position'].values.tolist()
    # Convert to 0-based indices (if positions were 1-based)
    columns_to_remove = [x-1 for x in columns_to_remove]
    
    # Remove columns and save new alignment
    new_alignment = remove_columns_from_alignment(
        "Model/Building/MSA/ClustalOmegaUniProtAlignment.fasta",
        "Model/Building/MSA_processed/trimmed_alignment.fasta",
        columns_to_remove
    )


        


    print(f"Original alignment length: {analyzer.alignment_length}")
    print(f"Number of columns removed: {len(columns_to_remove)}")
    print(f"New alignment length: {new_alignment.get_alignment_length()}")


Alignment Summary:
Number of sequences: 155
Alignment length: 3254
With the current removal tactic, we would remove 0.68 percent of columns ; we keep 1043 of 3254 columns
Original alignment length: 3254
Number of columns removed: 2211
New alignment length: 1043


The file trimmed_alignement.fasta gives some problem during the costruction of pssm so is necessary to cut some columns and some sequenceses in the alignement to avoid to much consecuitives gaps: the new alignement is filtered_trimmed_alignement.fasta and it must be used with PSIBLAST

In [None]:
def filter_sequences_and_columns(input_file, output_file, alpha):
    """
    Removes sequences with a gap percentage above a certain threshold
    """	
    # Load the alignment
    alignment = AlignIO.read(input_file, "fasta")
    
    # Step 1: Filter sequences based on gap percentage
    filtered_sequences = []
    for record in alignment:
        gap_count = record.seq.count('-')
        gap_percentage = (gap_count / len(record.seq))
        if gap_percentage <= alpha:
            filtered_sequences.append(record)
    
    if not filtered_sequences:
        print("No sequences left after filtering based on gap percentage.")
        sys.exit(1)
    
    filtered_alignment = MultipleSeqAlignment(filtered_sequences)

     # Step 2: Trim columns from both ends based on gap thresholds
    def trim_columns_based_on_gaps(alignment, gap_threshold):
        """
        Removes columns from the left and right ends of the alignment
        until no sequence has more than the gap threshold in a row
        """
        alignment_length = alignment.get_alignment_length()
        start, end = 0, alignment_length - 1

        # Remove columns from the left until no sequence has more than the gap threshold in a row
        while start < alignment_length:
            # Check if any sequence exceeds the gap threshold
            if any(record.seq[start:start + gap_threshold].count('-') == gap_threshold for record in alignment):
                start += 1
            else:
                break
        
        # Remove columns from the right until no sequence has more than the gap threshold in a row
        while end >= 0: 
            # Check if any sequence exceeds the gap threshold
            if any(record.seq[end - gap_threshold + 1:end + 1].count('-') == gap_threshold for record in alignment):
                end -= 1
            else:
                break
    
        # Trim the alignment
        valid_columns = range(start, end + 1)
        trimmed_sequences = []
        for record in alignment:
            new_seq = Seq("".join(record.seq[i] for i in valid_columns))
            trimmed_sequences.append(SeqRecord(new_seq, id=record.id, description=record.description))
        
        return MultipleSeqAlignment(trimmed_sequences)
    
    final_alignment = trim_columns_based_on_gaps(filtered_alignment, gap_threshold)

    # Save the filtered and trimmed alignment
    AlignIO.write(final_alignment, output_file, "fasta")
    print(f"Filtered and trimmed alignment saved to {output_file}")

# Parameters
input_file = "Model/Building/MSA_processed/trimmed_alignment.fasta"
output_file = "Model/Building/MSA_processed/filtered_trimmed_alignment.fasta"
gap_threshold = 50  # Maximum consecutive gaps allowed from the beginning or end
alpha = 0.79  # Change this threshold as needed (percentage of gaps)
# I choose 0.79 because in trimmed_alignement.fasta there are few sequences that have a lot of gaps, but other sequences have at most 60% of gaps

filter_sequences_and_columns(input_file, output_file, alpha)

Filtered and trimmed alignment saved to Model/Building/MSA_processed/filtered_trimmed_alignment.fasta


4. Build a PSSM model starting from the MSA

- First direct in the folder where ncbi-blast-2.16.0+ was installed

- Then use this terminal command : 
```bash
ncbi-blast-2.16.0+/bin/psiblast -subject data/protein_family/filtered_trimmed_alignment.fasta -in_msa data/protein_family/filtered_trimmed_alignment.fasta -out_ascii_pssm data/protein_family/trimmed_alignment.pssm_ascii -out_pssm data/protein_family/trimmed_alignment.pssm
```

    Note that the trimmed_alignment.fasta needs to be in the described folder (data/protein_family/)


5. Build a HMM model starting from the MSA

- First direct in the folder where hmmer-3.4 was installed

- Then use this terminal command :
```bash
hmmer-3.4/src/hmmbuild data/protein_family/trimmed_alignment.hmm data/protein_family/trimmed_alignment.fasta
```
    Note that the trimmed_alignment.fasta needs to be in the described folder (data/protein_family/)



## Models evaluation
1. Generate predictions. Run HMM-SEARCH and PSI-BLAST with your models against
SwissProt.

    - Collect the list of retrieved hits

    - Collect matching positions of your models in the retrieved hits

PSI-BLAST :
-	When working on MAC : First have to change settings so we have access to use psiblast & makeblastdb (in terminal, go to folder where psiblast/makeblastdb located and run this)
```bash
    xattr -c psiblast
    chmod +x psiblast
    xattr -c makeblastdb
    chmod +x makeblastdb
```
- Then use this command to create a "formatted swissprot database" (we need to check what that exactly means)
```bash
      ./ncbi-blast-2.16.0+/bin/makeblastdb -in uniprot_sprot.fasta -dbtype prot -out swissprot
```
    Notice that uniprot_sprot.fasta needs to be installed and be located in the current folder that we are in (in the terminal)


- Finally, the actual predictions can be obtained with the following command
```bash
        ./ncbi-blast-2.16.0+/bin/psiblast -in_pssm trimmed_alignment.pssm \
            -db swissprot \
            -out psiblastsearch_output.txt \
            -outfmt "6 qseqid sseqid qstart qend sstart send pident evalue" \
```
where 
```bash
    qseqid: Query sequence identifier (your domain)
    sseqid: Subject sequence identifier (matched protein)
    qstart: Start position in your query domain
    qend: End position in your query domain
    sstart: Start position in the matched sequence
    send: End position in the matched sequence
    pident: Percentage of identical matches
    evalue: Expectation value (statistical significance)
```

HMMER :

- Simply use 
```bash
    ./hmmer-3.4/src/hmmsearch trimmed_alignment.hmm uniprot_sprot.fasta > hmmsearch_output.txt
```

- To make both of the output files more accessible, we parsed them by writing two scripts 
    - in that way we create two .csv files that are easier to be compared, also directly by eye
    - these .csv files then contained the matching positions of our two models in the retrieved hits
```bash
Returns psiblastsearch_output.csv
Returns hmmsearch_output.csv
```

- the PSIBLAST output was easy to parse, since there was always only one domain hit

In [4]:
"""FOR PSIBLAST"""
def parse_psiblast_output(input_file):
    results = []
    
    with open(input_file, 'r') as f:
        for line in f:
            # Skip empty lines
            if not line.strip():
                continue
                
            # Split the line by tabs or multiple spaces
            parts = re.split(r'\s+', line.strip())
            
            if len(parts) >= 8:  # Make sure we have all required fields
                query_id = parts[0]
                subject_id = parts[1]
                
                # Extract UniProt ID and organism from subject_id
                # Format is usually sp|UniprotID|Name
                subject_parts = subject_id.split('|')
                if len(subject_parts) >= 2:
                    uniprot_id = subject_parts[1]
                    
                    # Create result dictionary
                    result = {
                        'protein_name': subject_id,
                        'uniprot_id': uniprot_id,
                        'organism': 'N/A',  # PSIBLAST output doesn't include organism
                        'domain_start': int(parts[4]),  # sstart
                        'domain_end': int(parts[5]),    # send
                        'domain_length': int(parts[5]) - int(parts[4]) + 1,
                        'E-value': float(parts[7])  
                    }
                    results.append(result)
    
    return results

def write_csv(results, output_file):
    if not results:
        return
    # Notice that we skip the start & end positions in the query domain (i.e. the PSSM here), as we are only interested in where we found matches in the sequence of SwissProt we looked through
    fieldnames = ['protein_name', 'uniprot_id', 'organism', 'domain_start', 
                 'domain_end', 'domain_length', 'E-value']
    
    with open(output_file, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(results)


input_file = 'Model/Evaluation/Predictions/PSI-BLAST/psiblastsearch_output.txt'
output_file = 'Model/Evaluation/Predictions/PSI-BLAST/psiblastsearch_output.csv'

results = parse_psiblast_output(input_file)
write_csv(results, output_file)



The HMM one was a bit more complicated in the sense that for some predicted proteins we had multiple domain hits. We decided to select only the domain hits with the lowest e-value, i.e. the ones with the highest significance 

In [44]:
"""FOR HMM"""
# File paths
input_file_path = "Model/Evaluation/Predictions/HMM-SEARCH/hmmsearch_output.txt"
output_file_path = "Model/Evaluation/Predictions/HMM-SEARCH/hmmsearch_output.csv"

# Initialize storage for parsed data
parsed_data = []

# Regular expressions to capture key information
header_regex = r">> ([^\s]+)"
domain_regex = r"\s+(\d+) [!?]\s+[\d\.]+\s+[\d\.]+\s+[\de\.\+\-]+\s+([\de\.\+\-]+)\s+\d+\s+\d+\s+(?:\[\.|\.\.)+\s+(\d+)\s+(\d+)"

with open(input_file_path, "r") as infile:
    current_protein = None

    for line in infile:
        # Match protein header line
        header_match = re.match(header_regex, line)
        if header_match:
            # If we already captured a protein, save its data
            if current_protein:
                # Sort domains by E-value and keep only the best one (!)
                if current_protein["domains"]:
                    current_protein["domains"] = [min(current_protein["domains"], key=lambda x: x[0])]
                parsed_data.append(current_protein)

            # Start a new protein record
            protein_id = header_match.groups()[0]
            current_protein = {
                "protein_name": protein_id.split("|")[2],
                "uniprot_id": protein_id.split("|")[1],
                "domains": []
            }

        # Match domain annotation 
        domain_match = re.match(domain_regex, line)
        if domain_match and current_protein:
            _, score, start, end = domain_match.groups()
            start, end, score = int(start), int(end), float(score)
            length = end - start + 1
            current_protein["domains"].append((score, start, end, length))

    # Handle the last protein record
    if current_protein:
        if current_protein["domains"]:
            current_protein["domains"] = [min(current_protein["domains"], key=lambda x: x[0])]
        parsed_data.append(current_protein)

# Define fixed fieldnames
fieldnames = ["protein_name", "uniprot_id", "E-value", "domain_start", "domain_end", "domain_length"]

# Write to CSV
with open(output_file_path, "w", newline="") as outfile:
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()
    for protein in parsed_data:
        row = {
            "protein_name": protein["protein_name"],
            "uniprot_id": protein["uniprot_id"]
        }
        if protein["domains"]:  # Check if there are any domains
            # If there is, since we only keep the best domain, we can just take the first one
            best_domain = protein["domains"][0]  
            row["E-value"] = best_domain[0]
            row["domain_start"] = best_domain[1]
            row["domain_end"] = best_domain[2]
            row["domain_length"] = best_domain[3]
        writer.writerow(row)

2. Define your ground truth. Find all proteins in SwissProt annotated (and not annotated) with the assigned Pfam domain

    - Collect the list of proteins matching the assigned Pfam domain

    - Collect matching positions of the Pfam domain in the retrieved sequences. Domain positions are available [here](ftp://ftp.ebi.ac.uk/pub/databases/interpro/current/protein2ipr.dat.gz) (large tsv file) or using the [InterPro API](https://github.com/ProteinsWebTeam/interpro7-api/tree/master/docs) or align the Pfam domain yourself against SwissProt (HMMSEARCH)

        - For this, we decided to use the InterPro API : 

        https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/

        which contains the reviewed proteins that match our assigned PFAM domain, PF00151

        - to effectively scrape this website, we wrote a parser

```bash
    Returns pfam_domain_positions.json 
```


In [9]:
class InterProAPIFetcher:
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.processed_count = 0
        self.all_results = []
        self.seen_accessions = set()  # Track unique protein accessions
        self.duplicate_count = 0

    def fetch_page(self, url: str) -> Optional[Dict]:
        max_retries = 3
        retry_delay = 2
        
        for attempt in range(max_retries):
            try:
                response = requests.get(url)
                response.raise_for_status()
                return response.json()
            except requests.exceptions.RequestException as e:
                print(f"Attempt {attempt + 1} failed: {str(e)}")
                if attempt < max_retries - 1:
                    print(f"Waiting {retry_delay} seconds before retrying...")
                    time.sleep(retry_delay)
                    retry_delay *= 2
                else:
                    print("Max retries reached. Moving on...")
                    return None

    def fetch_all_pages(self) -> List[Dict]:
        next_url = self.base_url
        total_count = None
        page_number = 1
        
        while next_url:
            print(f"\nFetching page {page_number}...")
            print(f"URL: {next_url}")
            
            page_data = self.fetch_page(next_url)
            
            if page_data is None:
                print("Failed to fetch page. Stopping pagination.")
                break
            
            if total_count is None:
                total_count = page_data['count']
                print(f"API reports total count: {total_count}")
            
            # Check for duplicates in this page
            new_proteins = []
            page_duplicates = 0
            
            for protein in page_data['results']:
                accession = protein['metadata']['accession']
                if accession in self.seen_accessions:
                    page_duplicates += 1
                    self.duplicate_count += 1
                else:
                    self.seen_accessions.add(accession)
                    new_proteins.append(protein)
            
            print(f"Page {page_number} stats:")
            print(f"- Proteins in response: {len(page_data['results'])}")
            print(f"- New unique proteins: {len(new_proteins)}")
            print(f"- Duplicates found: {page_duplicates}")
            
            self.all_results.extend(new_proteins)
            self.processed_count = len(self.all_results)
            
            next_url = page_data.get('next')
            page_number += 1
            
            time.sleep(1)
        
        print(f"\nFinal Statistics:")
        print(f"Total unique proteins: {len(self.all_results)}")
        print(f"Total duplicates found: {self.duplicate_count}")
        print(f"Total processed entries: {self.processed_count + self.duplicate_count}")
        
        return self.all_results

    def save_results(self, filename: str):
        output_data = {
            'count': len(self.all_results),
            'results': self.all_results
        }
        
        with open(filename, 'w') as f:
            json.dump(output_data, f, indent=2)
        print(f"\nResults saved to {filename}")
        print(f"File contains {len(self.all_results)} unique proteins")

def main():
    base_url = "https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/"
    
    fetcher = InterProAPIFetcher(base_url)
    fetcher.fetch_all_pages()
    fetcher.save_results('Model/Evaluation/Ground Truth/pfam_domain_positions.json')

if __name__ == "__main__":
    main()


Fetching page 1...
URL: https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/
API reports total count: 82
Page 1 stats:
- Proteins in response: 20
- New unique proteins: 20
- Duplicates found: 0

Fetching page 2...
URL: https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/?cursor=source%3As%3Ap0dmb4
Page 2 stats:
- Proteins in response: 20
- New unique proteins: 20
- Duplicates found: 0

Fetching page 3...
URL: https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/?cursor=source%3As%3Ap51528
Page 3 stats:
- Proteins in response: 20
- New unique proteins: 20
- Duplicates found: 0

Fetching page 4...
URL: https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/?cursor=source%3As%3Aq5rbq5
Page 4 stats:
- Proteins in response: 20
- New unique proteins: 20
- Duplicates found: 0

Fetching page 5...
URL: https://www.ebi.ac.uk/interpro/api/protein/reviewed/entry/pfam/PF00151/?cursor=source%3As%3Aq9u6w0
Page 5 stats:
- Protein

- After the successful parsing, we also turn the .json file into .csv for further processing (and again, easier interpretation by eye)

For this we wrote another script 
```bash
Returns pfam_domain_positions.csv
```

In [12]:
def extract_pfam_info(json_file):
    # Read the JSON file
    with open(json_file, 'r') as f:
        data = json.load(f)
    
    # List to store the extracted information
    pfam_matches = []
    
    # Iterate through each result in the JSON data
    for result in data['results']:
        # Extract protein metadata
        protein_info = {
            'protein_name': result['metadata']['name'],
            'uniprot_id': result['metadata']['accession'],
            'organism': result['metadata']['source_organism']['scientificName']
        }
        
        # Extract PFAM domain information
        # We know there's only one entry because we queried for a specific PFAM domain
        pfam_entry = result['entries'][0]
        
        # Get the domain fragments (start and end positions)
        for location in pfam_entry['entry_protein_locations']:
            for fragment in location['fragments']:
                domain_info = {
                    **protein_info,  # Include all protein information
                    'domain_start': fragment['start'],
                    'domain_end': fragment['end'],
                    'domain_length': fragment['end'] - fragment['start'] + 1,
                    'protein_length': pfam_entry['protein_length'],
                    'score': location['score']
                }
                pfam_matches.append(domain_info)
    
    return pfam_matches

# Use the function to extract information
json_file = 'Model/Evaluation/Ground Truth/pfam_domain_positions.json'
matches = extract_pfam_info(json_file)


# Convert the matches to a DataFrame
df = pd.DataFrame(matches)

# Save to CSV
df.to_csv('Model/Evaluation/Ground Truth/pfam_domain_positions.csv', index=False)
print("\nResults have been saved to 'Model/Evaluation/Ground Truth/pfam_domain_positions.csv'")


Results have been saved to 'Model/Evaluation/Ground Truth/pfam_domain_positions.csv'


- Finally, we now have 3 similar .csv files , 1 depicting the Ground Truth and 2 depicting the predictions from both the PSSM and HMM model



3. Compare your model with the assigned Pfam. Calculate the precision, recall, F-score, balanced accuracy, MCC

    - Comparison at the protein level. Measure the ability of your model to retrieve the same proteins matched by Pfam

    - Comparison at the residue level. Measure the ability of your model to match the same position matched by Pfam


- To this extent , we created a script for evaluation purposes

    - The comparison on protein level is more or less trivial, as we just had to compare whether the protein was in the PFAM ones or not
    - The comparison on residual level was a bit more complicated : We decided to create binary residue vector for the matched domain area in both the prediction and ground truth and then compared them to see where the vectors predicted the same areas and where not (see the code for more details)





In [9]:
"""Based on the .csv files that closely resemble each other, do the calculations"""


def create_residue_vectors(pred_df, pfam_df, protein_id):
    """
    For a single protein (based on its protein_id), create residue vectors for the ground truth and the prediction to compare them
    in their matching positions.
    """
 
    # Find both the ground truth and the prediction for the given protein
    pred_match = pred_df[pred_df['uniprot_id'] == protein_id]
    pfam_match = pfam_df[pfam_df['uniprot_id'] == protein_id]
    


    # Get the max of the lengths of the domains of prediction and ground truth (i.e. the longest domain)
    max_length = max(
        pred_match['domain_end'].iloc[0],
        pfam_match['domain_end'].iloc[0]
    )
    
    # With that, we can create vectors of the same size
    # Create binary vectors for each position from 0 to max_length
    # Using this, we can directly compare the vectors for evaluation
    true_positions = np.zeros(int(max_length))
    pred_positions = np.zeros(int(max_length))
    

    # Fill in PFAM (true) positions
    pfam_row = pfam_match.iloc[0]
    start = pfam_row['domain_start'] - 1  # Convert to 0-based indexing
    end = pfam_row['domain_end']
    true_positions[start:end] = 1
        
    # Fill in PSSM/HMM (predicted) positions
    pred_row = pred_match.iloc[0]
    start = int(pred_row['domain_start'] - 1)  # Convert to 0-based indexing
    end = int(pred_row['domain_end'])
    pred_positions[start:end] = 1
    
    return true_positions, pred_positions




def evaluate_model(psiblast_file, hmm_file, pfam_file, only_found=False, e_threshold=0.0001):
    """
    Evaluate PSIBLAST model performance against Pfam annotations
    
    Parameters:
    - psiblast_file: Path to PSIBLAST results CSV
    - hmm_file : Path to HMM results CSV
    - pfam_file: Path to Pfam ground truth CSV
    - only_found: If True, only evaluate proteins found by PSIBLAST
    """
    # Step 1: Load both CSV files
    psiblast_df = pd.read_csv(psiblast_file)
    hmm_df = pd.read_csv(hmm_file)
    pfam_df = pd.read_csv(pfam_file)

    # HMM finds a lot of hits, a lot with extremely high e-values : Take only the ones that are above some threshold (score for now, later e-value)
    # We filter based on the first domain hit 
    filtered_hmm_proteins = hmm_df[hmm_df['E-value'] <= e_threshold]['uniprot_id']

    # Also PSIBLAST finds some hits with an high e-values so we filter them out
    filtered_psiblast_proteins = psiblast_df[psiblast_df['E-value'] <= e_threshold]['uniprot_id']
    
    # Step 2: Get unique list of proteins from both files
    psiblast_proteins = set(filtered_psiblast_proteins)
    hmm_proteins = set(filtered_hmm_proteins)
    pfam_proteins = set(pfam_df['uniprot_id'])

    
    if only_found:
        # Only consider proteins that PSIBLAST/HMM found
        all_proteins_psiblast = psiblast_proteins
        all_proteins_hmm = hmm_proteins
        print("\nEvaluating only PSIBLAST-found proteins:")
    else:
        # Consider all proteins from both sets
        all_proteins_psiblast = psiblast_proteins.union(pfam_proteins)
        all_proteins_hmm = hmm_proteins.union(pfam_proteins)
        print("\nEvaluating all proteins:")
    
    print(f"Number of proteins predicted by PSIBLAST: {len(psiblast_proteins)}")
    print(f"Number of proteins predicted by HMM: {len(hmm_proteins)}")
    print(f"Number of proteins in Pfam ground truth: {len(pfam_proteins)}")
    print(f"Number of proteins being evaluated for PSIBLAST: {len(all_proteins_psiblast)}")
    print(f"Number of proteins being evaluated for HMM: {len(all_proteins_hmm)}")
    

    print("\n=== Protein-Level Evaluation ===")
    # Step 3: Create binary vectors for true and predicted labels
    y_true_psiblast = []  # Ground truth from Pfam
    y_pred_psiblast = []  # Predictions from PSIBLAST
    y_true_hmm = []
    y_pred_hmm = []


    
    for protein in all_proteins_psiblast:
        y_true_psiblast.append(1 if protein in pfam_proteins else 0)
        y_pred_psiblast.append(1 if protein in psiblast_proteins else 0)


    for protein in all_proteins_hmm:
        y_true_hmm.append(1 if protein in pfam_proteins else 0)
        y_pred_hmm.append(1 if protein in hmm_proteins else 0)

    # So we have something like
    # y_true_psiblast 0 0 1 0 1 ...
    # y_pred_psiblast 0 1 1 0 1 ...
    
    # Step 4: Calculate performance metrics
    protein_results_psiblast = {
        'Precision': precision_score(y_true_psiblast, y_pred_psiblast),
        'Recall': recall_score(y_true_psiblast, y_pred_psiblast),
        'F-score': f1_score(y_true_psiblast, y_pred_psiblast),
        'Balanced Accuracy': balanced_accuracy_score(y_true_psiblast, y_pred_psiblast),
        'MCC': matthews_corrcoef(y_true_psiblast, y_pred_psiblast)
    }


    protein_results_hmm = {
        'Precision': precision_score(y_true_hmm, y_pred_hmm),
        'Recall': recall_score(y_true_hmm, y_pred_hmm),
        'F-score': f1_score(y_true_hmm, y_pred_hmm),
        'Balanced Accuracy': balanced_accuracy_score(y_true_hmm, y_pred_hmm),
        'MCC': matthews_corrcoef(y_true_hmm, y_pred_hmm)
    }


    # Step 5: Calculate confusion matrix components
    tp_psiblast = sum(1 for t, p in zip(y_true_psiblast, y_pred_psiblast) if t == 1 and p == 1)
    fp_psiblast = sum(1 for t, p in zip(y_true_psiblast, y_pred_psiblast) if t == 0 and p == 1)
    fn_psiblast = sum(1 for t, p in zip(y_true_psiblast, y_pred_psiblast) if t == 1 and p == 0)
    tn_psiblast = sum(1 for t, p in zip(y_true_psiblast, y_pred_psiblast) if t == 0 and p == 0)


    tp_hmm = sum(1 for t, p in zip(y_true_hmm, y_pred_hmm) if t == 1 and p == 1)
    fp_hmm = sum(1 for t, p in zip(y_true_hmm, y_pred_hmm) if t == 0 and p == 1)
    fn_hmm = sum(1 for t, p in zip(y_true_hmm, y_pred_hmm) if t == 1 and p == 0)
    tn_hmm = sum(1 for t, p in zip(y_true_hmm, y_pred_hmm) if t == 0 and p == 0)
    
    # Print detailed results
    print("\nProtein-Level Confusion Matrix PSIBLAST:")
    print(f"True Positives: {tp_psiblast}")
    print(f"False Positives: {fp_psiblast}")
    print(f"False Negatives: {fn_psiblast}")
    print(f"True Negatives: {tn_psiblast}")



    print("\nProtein-Level Confusion Matrix HMM:")
    print(f"True Positives: {tp_hmm}")
    print(f"False Positives: {fp_hmm}")
    print(f"False Negatives: {fn_hmm}")
    print(f"True Negatives: {tn_hmm}")


    print("\nProtein-Level Metrics PSIBLAST:")
    for metric, value in protein_results_psiblast.items():
        print(f"{metric}: {value:.4f}")


    print("\nProtein-Level Metrics HMM:")
    for metric, value in protein_results_hmm.items():
        print(f"{metric}: {value:.4f}")



    # Residue-level evaluation
   
    print("\n=== Residue-Level Evaluation ===")
    # Only evaluate residues for proteins found in both sets 
    common_proteins_psiblast = psiblast_proteins.intersection(pfam_proteins)
    common_proteins_hmm = hmm_proteins.intersection(pfam_proteins)
    print(f"Number of proteins for residue-level evaluation PSIBLAST: {len(common_proteins_psiblast)}")
    print(f"Number of proteins for residue-level evaluation HMM: {len(common_proteins_hmm)}")
    
    # Collect all residue-level predictions
    all_true_residues_psiblast = []
    all_pred_residues_psiblast = []

    all_true_residues_hmm = []
    all_pred_residues_hmm = []

    
    for protein in common_proteins_psiblast:
        result = create_residue_vectors(psiblast_df, pfam_df, protein)
        if result is not None:
            true_pos, pred_pos = result
            all_true_residues_psiblast.extend(true_pos)
            all_pred_residues_psiblast.extend(pred_pos)


    for protein in common_proteins_hmm:
        result = create_residue_vectors(hmm_df, pfam_df, protein)
        if result is not None:
            true_pos, pred_pos = result
            all_true_residues_hmm.extend(true_pos)
            all_pred_residues_hmm.extend(pred_pos)
    
    # Calculate residue-level metrics
    residue_results_psiblast = {
        'Precision': precision_score(all_true_residues_psiblast, all_pred_residues_psiblast),
        'Recall': recall_score(all_true_residues_psiblast, all_pred_residues_psiblast),
        'F-score': f1_score(all_true_residues_psiblast, all_pred_residues_psiblast),
        'Balanced Accuracy': balanced_accuracy_score(all_true_residues_psiblast, all_pred_residues_psiblast),
        'MCC': matthews_corrcoef(all_true_residues_psiblast, all_pred_residues_psiblast)
    }
    
    # Calculate residue-level confusion matrix
    tp = sum(1 for t, p in zip(all_true_residues_psiblast, all_pred_residues_psiblast) if t == 1 and p == 1)
    fp = sum(1 for t, p in zip(all_true_residues_psiblast, all_pred_residues_psiblast) if t == 0 and p == 1)
    fn = sum(1 for t, p in zip(all_true_residues_psiblast, all_pred_residues_psiblast) if t == 1 and p == 0)
    tn = sum(1 for t, p in zip(all_true_residues_psiblast, all_pred_residues_psiblast) if t == 0 and p == 0)
    
    print("\nResidue-Level Confusion Matrix PSIBLAST:")
    print(f"True Positives: {tp}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Negatives: {tn}")
    
    print("\nResidue-Level Metrics PSIBLAST:")
    for metric, value in residue_results_psiblast.items():
        print(f"{metric}: {value:.4f}")



    # Calculate residue-level metrics
    residue_results_hmm = {
        'Precision': precision_score(all_true_residues_hmm, all_pred_residues_hmm),
        'Recall': recall_score(all_true_residues_hmm, all_pred_residues_hmm),
        'F-score': f1_score(all_true_residues_hmm, all_pred_residues_hmm),
        'Balanced Accuracy': balanced_accuracy_score(all_true_residues_hmm, all_pred_residues_hmm),
        'MCC': matthews_corrcoef(all_true_residues_hmm, all_pred_residues_hmm)
    }
    
    # Calculate residue-level confusion matrix
    tp = sum(1 for t, p in zip(all_true_residues_hmm, all_pred_residues_hmm) if t == 1 and p == 1)
    fp = sum(1 for t, p in zip(all_true_residues_hmm, all_pred_residues_hmm) if t == 0 and p == 1)
    fn = sum(1 for t, p in zip(all_true_residues_hmm, all_pred_residues_hmm) if t == 1 and p == 0)
    tn = sum(1 for t, p in zip(all_true_residues_hmm, all_pred_residues_hmm) if t == 0 and p == 0)


    print("\nResidue-Level Confusion Matrix HMM:")
    print(f"True Positives: {tp}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Negatives: {tn}")
    
    print("\nResidue-Level Metrics HMM:")
    for metric, value in residue_results_hmm.items():
        print(f"{metric}: {value:.4f}")
    

psiblast_file = 'Model/Evaluation/Predictions/PSI-BLAST/psiblastsearch_output.csv'
hmm_file = 'Model/Evaluation/Predictions/HMM-SEARCH/hmmsearch_output.csv'
pfam_file = 'Model/Evaluation/Ground Truth/pfam_domain_positions.csv'



print("\n===============================")
print("Evaluation only on found proteins in both PSSM/HMM:")
evaluate_model(psiblast_file,hmm_file, pfam_file, only_found=False, e_threshold= 0.001)


Evaluation only on found proteins in both PSSM/HMM:

Evaluating all proteins:
Number of proteins predicted by PSIBLAST: 83
Number of proteins predicted by HMM: 83
Number of proteins in Pfam ground truth: 82
Number of proteins being evaluated for PSIBLAST: 83
Number of proteins being evaluated for HMM: 83

=== Protein-Level Evaluation ===

Protein-Level Confusion Matrix PSIBLAST:
True Positives: 82
False Positives: 1
False Negatives: 0
True Negatives: 0

Protein-Level Confusion Matrix HMM:
True Positives: 82
False Positives: 1
False Negatives: 0
True Negatives: 0

Protein-Level Metrics PSIBLAST:
Precision: 0.9880
Recall: 1.0000
F-score: 0.9939
Balanced Accuracy: 0.5000
MCC: 0.0000

Protein-Level Metrics HMM:
Precision: 0.9880
Recall: 1.0000
F-score: 0.9939
Balanced Accuracy: 0.5000
MCC: 0.0000

=== Residue-Level Evaluation ===
Number of proteins for residue-level evaluation PSIBLAST: 82
Number of proteins for residue-level evaluation HMM: 82

Residue-Level Confusion Matrix PSIBLAST:
Tr

4. Consider refining your models to improve their performance

- Multiple Steps that we took

    - remove redundancy at different thresholds in JalView
        - but we saw that at 100% already removal of around 80% most of the times, so we just removed at 100% threshold (which is the lowest in this case) to not have too little sequences to build the model
    - use other starting database
        - we tried using UniRef90 first, but got much less initial homologous sequences (with the same settings, so e-value 0.0001 etc.), as this resulted in weaker model performance, we wanted to change to a bigger database and therefore chose UniProtKB for the final model
    - when removing columns, tweak the two parameters that we had differently to see what gives better results
        - tried different things (still TODO) ; using the 3 parameters (3 thresholds) we have : aim was to remove a good amount of columns that basically were only gaps (thats why we use the gap threshold) and in the other columns that were aligned already, to not have completely random amino acids (i.e. tweaking of entropy)
    - and here then say we found our best model using
        - TODO 

## Domain family characterization
Once the family model is defined (previous step), you will look at functional (and structural) aspects/properties of the entire protein family. The objective is to provide insights about the main function of the family.

### Taxonomy

1. Collect the taxonomic lineage (tree branch) for each protein of the family_sequences dataset
from UniProt (entity/organism/lineage in the UniProt XML)

- TODO : it says here we have to do it in the UniProt XML and we did it by parsing the website, ask in email if that gives the same result

2. Plot the taxonomic tree of the family with nodes size proportional to their relative abundance 


In [10]:
def load_protein_ids(psiblast_file, hmm_file, e_threshold=0.001):
    """Load protein IDs from PSI-BLAST and HMM search results."""
    psiblast_df = pd.read_csv(psiblast_file)
    hmm_df = pd.read_csv(hmm_file)
    
    filtered_hmm_proteins = hmm_df[hmm_df['E-value'] <= e_threshold]['uniprot_id']
    psiblast_proteins = set(psiblast_df['uniprot_id'])
    hmm_proteins = set(filtered_hmm_proteins)
    
    return list(psiblast_proteins.union(hmm_proteins))

In [65]:
# TaxonomyAnalyzer Class for fetching taxonomy information
class TaxonomyAnalyzer:
    def __init__(self, max_retries: int = 3, retry_delay: int = 1):
        self.max_retries = max_retries
        self.retry_delay = retry_delay
        self.uniprot_base_url = "https://rest.uniprot.org/uniprotkb/"

    def fetch_taxonomy_info(self, protein_ids: list, output_file: str):
        """ Get the lineage data by parsing UniProt for a list of protein IDs (our family sequences) and save to a CSV file """
        taxonomy_data = []

        pbar = tqdm(protein_ids, desc="Fetching taxonomy data")
        
        # Iterate through each protein of our family
        for protein_id in pbar:
            pbar.set_description(f"Processing {protein_id}")

            for attempt in range(self.max_retries):
                try:
                    # Fetch data from UniProt API
                    response = requests.get(f"{self.uniprot_base_url}{protein_id}.json")
                    response.raise_for_status()
                    data = response.json()

                    # Get lineage information 
                    taxonomy = data.get("organism", {})
                    scientific_name = taxonomy.get("scientificName", "N/A")
                    lineage = taxonomy.get("lineage", [])

                    taxonomy_data.append([protein_id, scientific_name, " > ".join(lineage)])
                    break

                # Error Catching
                except requests.exceptions.RequestException as e:
                    print(f"Error fetching data for {protein_id}: {e}")
                    if attempt == self.max_retries - 1:
                        taxonomy_data.append([protein_id, "Error", ""])
                    else:
                        time.sleep(self.retry_delay)

        # Save to CSV
        taxonomy_df = pd.DataFrame(taxonomy_data, columns=["Protein ID", "Scientific Name", "Lineage"])
        taxonomy_df.to_csv(output_file, index=False)
        return taxonomy_df

# Process taxonomy data
def process_taxonomy(data):
    """ Process lineage data into a nested dictionary and frequency counts """
    taxonomy_dict = {}
    frequency_counts = {}
    
    for _, row in data.iterrows():
        # Split the lineage by " > " to get each level
        lineage = row["Lineage"].split(" > ")
        current = taxonomy_dict
        # Keep track of the full path as we traverse the lineage
        current_path = [] # such that we count occurences of terms in the correct "level" where they appear (i.e. always count just in the "column" of the linage)
        
        # Iterate through each level in the lineage
        for level in lineage:
            current_path.append(level)
            path_key = " > ".join(current_path)
            
            # Count frequencies using the full path as key
            if path_key not in frequency_counts:
                frequency_counts[path_key] = 0
            frequency_counts[path_key] += 1
            
            if level not in current:
                current[level] = {}
            # Set current to be at the next level
            current = current[level]
    
    return taxonomy_dict, frequency_counts

# Create a Newick string for the taxonomy tree for representation purposes
def dict_to_newick(d):
    """ Convert dictionary to Newick format string """
    newick = ""
    for key, sub_dict in d.items():
        sub_tree = dict_to_newick(sub_dict)
        newick += f"({sub_tree}){key}," if sub_tree else f"{key},"
    return newick.rstrip(",")



def main():
    psiblast_file = "Model/Evaluation/Predictions/PSI-BLAST/psiblastsearch_output.csv"
    hmm_file = "Model/Evaluation/Predictions/HMM-SEARCH/hmmsearch_output.csv"
    protein_ids = load_protein_ids(psiblast_file, hmm_file)

    analyzer = TaxonomyAnalyzer()
    taxonomy_data = analyzer.fetch_taxonomy_info(protein_ids, "Taxonomy/taxonomy_info.csv")

    print("Taxonomy file saved to: Taxonomy/taxonomy_info.csv")


    # Create a nested dictionary of lineage data and frequency counts (of occurences of each term in the Lineage)
    taxonomy_dict, frequency_counts = process_taxonomy(taxonomy_data)

    # Create the newick tree based on the lineage data 
    newick_tree = f"({dict_to_newick(taxonomy_dict)});"

    # Plot using ETE Toolkit
    phylo_tree = Tree(newick_tree, format=1)
    tree_style = TreeStyle()
    tree_style.show_leaf_name = False


    # Adjust node sizes 
    max_size = 12  # Increase max size for nodes for visibility
    for node in phylo_tree.traverse():
        # Get the full path from root to this node
        path = []
        current = node
        while current:
            if current.name:  # Skip empty names
                path.insert(0, current.name)
            current = current.up
        
        path_key = " > ".join(path)
        count = frequency_counts.get(path_key, 1)
        nstyle = NodeStyle()
        # Set node sizes based on relative abundance
        nstyle["size"] = min(count, max_size)  
        node.set_style(nstyle)
        # Add label with name and count
        node.add_face(TextFace(f"{node.name} ({count})", fsize=10), column=0)

    # Improve tree spacing
    tree_style.branch_vertical_margin = 30  # Increase spacing for better visibility

    # Save the tree to PNG file
    output_file = "Taxonomy/phylogenetic_tree_freq.png"
    phylo_tree.render(output_file, w=3000, h=2000, tree_style=tree_style)

    print(f"Tree saved to: {output_file}")

if __name__ == "__main__":
    main()


0     P06857
1     P54316
2     Q5BKQ4
3     P16233
4     P54315
       ...  
78    P02844
79    P27587
80    P0DSI2
81    Q68KK0
82    P83629
Name: uniprot_id, Length: 83, dtype: object


Processing Q02157: 100%|██████████| 83/83 [00:52<00:00,  1.59it/s]    


Taxonomy file saved to: Taxonomy/taxonomy_info.csv
Tree saved to: Taxonomy/phylogenetic_tree_freq.png


### Function

1. Collect GO annotations for each protein of the family_sequences dataset (entity/dbReference type="GO" in the UniProt XML)

- For this, we did two steps
    - First, find the "direct" GO annotations of the proteins (i.e. no ancestors, just the ones found in the UniProt XML directly for the current protein id) using fetch_go_annotations()
    - Then, we expanded the found GO terms with their ancestors using expand_go_terms_with_ancestors(), which takes all the found GO terms (i.e. their IDs), and based on that parses up the whole ontology tree to find the ancestors of that GO term aswell 
- TODO : maybe create .csv file here too so we also have the results of this stored

2. Calculate the enrichment of each term in the dataset compared to GO annotations available in the SwissProt database (you can download the entire SwissProt XML [here](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz)). You can use Fisher’ exact test and verify that both two-tails and right-tail P-values (or left-tail depending on how you build the confusion matrix) are close to zero

- For this, also multiple steps
    - First again parsing, just this time on the entire SwissProtXML using parse_swissprot_go_terms()
        - Notice that in this step, we skip over the family proteins such that later on it will be easier to calculate the contingency table
          for Fishers exact Test ; for calculating the metrics, we can then simply add these family proteins again 
    - Then again, expand the GO-terms using expand_go_terms_with_ancestors()
    - Now, we could calculate the actual enrichment
        - Note that we also use a function to get the actual GO terms from the GO ids using the helper function get_go_terms_given_goid()
        - Furthermore, as both our family and full SwissProt Parser created dictionaries of the form protein_id : [go_ids], we also employ a helper function to reverse this dictionary into go_id : [protein_ids], since this structure simplifies the calculations of Fishers Exact Test
    - We calculate some additional metrics for the enrichment, see the code (i.e all the things in the .csv)

Results in enrichment_results.csv

3. Plot enriched terms in a word cloud 
    - Plot a word cloud based on the enriched terms, i.e. the ones with two-tail & right tail P-value <0.05
    - To weight the terms in the word cloud, we use the proportion of the percentage that GO term had in our dataset to the percentage the GO term had in the entirety of SwissProt  

Results in go_enrichment_wordcloud.png

4. Take into consideration the hierarchical structure of the GO ontology and report most significantly enriched branches, i.e. high level terms
    - For each GO-Id in the family, we parsed up the whole tree such that we found all ancestors of that GO-Id and added an "enriched child" to each of these ancestors, i.e. the initial GO-term for the search itself was the enriched child (the "enrichment" was weighted by the two-sided p-value we obtained earlier).
    - After that, we then printed out the top 20 branches that had atleast 2 enriched children and were somewhere high up in the tree (i.e. not deeper than depth 3)

5. Always report the full name of the terms and not only the GO ID

In [2]:
def reverse_protein_go_dict(protein_to_go):
   """
   Convert protein id : [GO ids] dict to GO id  : [protein ids] dict.
   Args:
         protein_to_go (dict): Protein ID to GO id dictionary

    Returns:
         dict: GO ID to protein IDs dictionary
   
   """
   go_to_proteins = defaultdict(list)
   for protein, go_terms in protein_to_go.items():
       for go_term in go_terms:
           go_to_proteins[go_term].append(protein)
   return go_to_proteins

In [3]:
def expand_go_terms_with_ancestors(protein_to_go, go_obo):
    """
    Expand GO IDs with their ancestors.
    
    Args:
        protein_to_go (dict): Dictionary mapping GO ids to Protein ID.
        go_obo (GODag): Parsed GO DAG for ancestor retrieval.
        
    Returns:
        dict: Updated protein_to_go dictionary with expanded GO IDs.
    """

    expanded_protein_to_go = defaultdict(set)
    
    for protein, go_terms in protein_to_go.items():
        for go_id in go_terms:
            # Add the GO ID itself
            expanded_protein_to_go[protein].add(go_id)
            
            # Add all ancestor terms (i.e their IDs) of that GO ID 
            if go_id in go_obo:
                ancestors = go_obo[go_id].get_all_parents()  # Get all ancestors
                expanded_protein_to_go[protein].update(ancestors)
    
   
    return {protein: list(go_terms) for protein, go_terms in expanded_protein_to_go.items()}


In [6]:
def get_go_terms_given_goid(protein_go_dict: Dict[str, List[str]]) -> Dict[str, str]:
    """
    Convert GO IDs to their corresponding terms using the Gene Ontology API.
    
    Args:
        protein_go_dict (dict) : Dictionary mapping protein IDs to lists of GO IDs
        
    Returns:
        dict : Dictionary mapping GO IDs to their terms
        

    """
    # Extract unique GO IDs from the dictionary
    go_ids = set()
    for go_list in protein_go_dict.values():
        go_ids.update(go_list)
    
    # Initialize results dictionary
    go_terms_dict = {}
    
    # Base URL for the Gene Ontology API
    base_url = "http://api.geneontology.org/api/ontology/term/"
    
    # Process each GO ID
    for go_id in go_ids:
        try:
            # Add delay 
            time.sleep(0.1)
            
            # Make API request using the Gene Ontology API, i.e. search the API based on the current GO ID
            response = requests.get(f"{base_url}{go_id}")
            response.raise_for_status()
            
            # Extract the corresponding GO term
            data = response.json()
            go_terms_dict[go_id] = data.get('label', 'Term not found')
        
        # Some error catching 
        except requests.exceptions.RequestException as e:
            print(f"Error fetching term for {go_id}: {str(e)}")
            go_terms_dict[go_id] = 'Error fetching term'
            
        except Exception as e:
            print(f"Unexpected error processing {go_id}: {str(e)}")
            go_terms_dict[go_id] = 'Error processing term'
    
    return go_terms_dict

In [7]:
def fetch_go_annotations(protein_id):
    """
    Fetch GO annotations and create GO ID to protein list mapping.
    
    Args:
        protein_id (str): single protein ID of our family 
        
    Returns:
        List : List of the GO ids found for that protein
    """
    go_ids = []
    

    # Base URL for the UniProt API
    url = f"https://rest.uniprot.org/uniprotkb/{protein_id}.xml"

    try:
        response = requests.get(url)
        response.raise_for_status()
        
        namespaces = {'ns': 'http://uniprot.org/uniprot'}
        root = ET.fromstring(response.content)
        
        # Get all GO IDs for the protein
        for db_ref in root.findall(".//ns:dbReference[@type='GO']", namespaces):
            go_id = db_ref.attrib.get('id')
            
            if go_id:
                go_ids.append(go_id)
    
    # Some error catching
    except requests.exceptions.RequestException as e:
        print(f"Error fetching GO annotations for {protein_id}: {e}")
   
            
    return go_ids


# Let's add some debugging to help understand what's happening
# here we see that the big .xml file has the same structure as the small ones 
# we already analyzed ; thus,we can use the same parsing structure, but this time directly
# just collect the counts of GO terms, because that is all we need (no diff. categories, would just make our code slower)
def print_swissprot_file(swissprot_xml_path, length = 50):
    """
    Just to look at the first few lines to see the structure
    """

    with open(swissprot_xml_path, 'r') as f:
        print("First length lines of the file:")
        for i, line in enumerate(f):
            if i < length:
                print(line.strip())
            else:
                break



def parse_swissprot_go_terms(swissprot_xml_path, family_proteins):
   """
   Parse GO IDs from SwissProt XML file for each protein, excluding proteins in the family.
   
   Args:
       swissprot_xml_path (str): Path to SwissProt XML file
       family_proteins (set): UniProt IDs in protein family
   
   Returns:
       dict: protein ID : [GO IDs] for that protein
   """
   protein_to_go = defaultdict(list)
   total_proteins = 0
   skipped_proteins = 0
   
   namespaces = {'ns': 'http://uniprot.org/uniprot'}
   context = ET.iterparse(swissprot_xml_path, events=('end',))
   
   print("Starting to parse SwissProt XML...")
   
   for event, elem in context:
       if elem.tag.endswith('entry'):
           accession = elem.find(".//ns:accession", namespaces)
           if accession is not None:
               uniprot_id = accession.text
               
               # Exclude family proteins
               if uniprot_id in family_proteins:
                   skipped_proteins += 1
               else:
                   # Get all GO IDs for the protein (same structure as in fetch_go_annotations)
                   for db_ref in elem.findall(".//ns:dbReference[@type='GO']", namespaces):
                       go_id = db_ref.attrib.get('id')
                       if go_id:
                           protein_to_go[uniprot_id].append(go_id)
                   total_proteins += 1

           elem.clear()
           
           # Keep track of progress, as it takes some time to parse the whole file
           if (total_proteins + skipped_proteins) % 10000 == 0:
               print(f"Processed {total_proteins} proteins "
                     f"(skipped {skipped_proteins} family proteins)...")
             
    
               
                    
               
   return protein_to_go

def calculate_go_enrichment(go_to_proteins_family, go_to_proteins_swissprot, total_proteins_family, total_proteins_swissprot, go_id_to_go_term):
    """ 
    Perform Fisher's exact test to calculate GO term enrichment in our protein family.

    Args:
        go_to_proteins_family (dict): GO ID to list of proteins in family
        go_to_proteins_swissprot (dict): GO ID to list of proteins in SwissProt
        total_proteins_family (int): Total proteins in family
        total_proteins_swissprot (int): Total proteins in SwissProt
        go_id_to_go_term (dict): GO ID to GO term mapping

    Returns:
        pd.DataFrame: DataFrame with GO term enrichment results
"""
    results = []
    
    
    for go_id in go_to_proteins_family.keys():
   
        # Create the 2x2 contingency table for Fisher's exact test
        # The table looks like this:
        #                   Protein in family    Protein not in family (i.e. all in SwissProt - family proteins)
        # Has GO term            a                    b
        # No GO term             c                    d
        
        # Contingency table calculations:
        a = len(go_to_proteins_family[go_id])  # Proteins with this GO term in family
        
      
        b = len(go_to_proteins_swissprot.get(go_id, []))  # Proteins with GO term in rest of SwissProt (without family)
        
        c = total_proteins_family - a  # Proteins without GO term in family
        
  
        d = total_proteins_swissprot - b # Proteins without GO term in rest of SwissProt (without family)
        
        # Verify all values are non-negative before creating contingency table
        if all(x >= 0 for x in [a, b, c, d]):
            contingency_table = [[a, b], [c, d]]
            
            # Perform Fisher's exact test
            # We ask : is the GO term appearing more often in our family than we would expect by random chance ?
            # The null hypothesis (H0) is: "The proportion of proteins with this GO term in our family 
            # is the same as the proportion in the SwissProt dataset (without the protein in the family)." 
            # In other words, under H0, getting the GO term is independent of being in our family (so it doesn't represent the family)
            # Alternative Hypothesis (H1) using the right-tail and two-tail:
            #Right-tail (greater): Our family has a higher proportion of this GO term than SwissProt
            #Two-tail (two-sided): The proportion is different (either higher or lower)

            # ((((Left-tail (less): Our family has a lower proportion of this GO term than SwissProt)))))

            #Fisher's exact test calculates the probability of seeing our observed data (or more extreme) under the null hypothesis.
            #A very small p-value (like < 0.05) tells us:
            #Two-tail: This GO term's frequency is significantly different from SwissProt
            #Right-tail: This GO term is significantly enriched in our family(overrepresented)


            #(((((Left-tail: This GO term is significantly depleted in our family(underrepresented)))))))

            odds_ratio, pvalue_two_tail = fisher_exact(contingency_table, alternative='two-sided')
            _, pvalue_greater = fisher_exact(contingency_table, alternative='greater')
          #  _, pvalue_less = fisher_exact(contingency_table, alternative='less')
            
            # Calculate proportions
            my_proportion = a / total_proteins_family 
            swissprot_proportion = (a+b) / (total_proteins_swissprot + total_proteins_family)

     
            
            results.append({
                'GO_ID': go_id,
                'GO_Term': go_id_to_go_term.get(go_id, 'N/A'), # Include GO term name
                'Count_Prot_Dataset': a,
                'Count_Prot_SwissProt': b,
                'Count_Prot_SwissProt_Actual': a+b,
                'Percentage_Dataset': round(my_proportion * 100, 2),
                'Percentage_SwissProt': round(swissprot_proportion * 100, 10),
                'Fold_Enrichment': round(my_proportion/swissprot_proportion,2),
                'P_Value_Two_Tail': pvalue_two_tail,
                'P_Value_Greater': pvalue_greater,
            })
    
    # Convert to DataFrame and sort by p-value
    df_results = pd.DataFrame(results)
    if not df_results.empty:
        df_results = df_results.sort_values('P_Value_Two_Tail')

    df_results.to_csv("Function/enrichment_results.csv")
    
    return df_results

In [8]:
# Hierarchical Structure
def analyze_go_hierarchy(go_id_to_go_term):
    """
    Analyze the hierarchical structure of enriched GO terms.

    """

    # Load the Gene Ontology DAG
    go_obo = obo_parser.GODag('go.obo')
    
    # Read our enrichment results
    df = pd.read_csv("Function/enrichment_results.csv")
    
    # Filter for significantly enriched terms
    enriched_terms = df[
        (df['P_Value_Two_Tail'] < 0.05) &
        (df['P_Value_Greater'] < 0.05)
    ]
    
    # Create a dictionary to store branch information
    branch_info = {}
    
    # For each enriched term, traverse up its ancestry
    for _, row in enriched_terms.iterrows():
        go_id = row['GO_ID']
        if go_id in go_obo:
            term = go_obo[go_id]
            
            # Get all ancestors (parents) up to the root of the DAG of the current term (i.e. the current GO ID)
            
            ancestors = term.get_all_parents()
            
            # Add information about this term to all its ancestor branches
            for ancestor_id in ancestors:
                if ancestor_id not in branch_info:
                    branch_info[ancestor_id] = {
                        'term_name': go_obo[ancestor_id].name,
                        'enriched_children': [], # initialize empty
                        'total_significance': 0, # initialize empty 
                        'depth': go_obo[ancestor_id].depth, # depth based on depth in tree (i.e root has depth 0)
                    }

               
                # Our go_id in the current iteration is a child to ALL ancestors we found using "get_all_parents()"
                #  (note that this is not necessarily a direct child, but maybe also much more down in the tree somewhere)
                # Thus, add this child into the enriched_children list of the ancestor with its two-tailed p-value 
                branch_info[ancestor_id]['enriched_children'].append({
                    'id': go_id,
                    'name': term.name,
                    'p_value': row['P_Value_Two_Tail']
                })
                # Measure significance based on -log value of the p value of all the childs of the ancestor (lower p values have higher scores)
                branch_info[ancestor_id]['total_significance'] += -np.log10(row['P_Value_Two_Tail'])
    
    # Filter for high-level terms (lower depth) with multiple enriched children
    significant_branches = {
        go_id: info for go_id, info in branch_info.items() # take each key,value of the branch_info dictionary
        if len(info['enriched_children']) >= 2  # At least 2 enriched children
        and info['depth'] <= 3  # High-level terms having maximum depth of 3 (i.e. only look at GO terms high up in the tree)
    } 
    
    # Sort branches by their total significance
    sorted_branches = sorted(
        significant_branches.items(),
        key=lambda x: x[1]['total_significance'],
        reverse=True
    )
    
    # Create a list to store the branch information
    branch_data = []


    for go_id, info in sorted_branches[:20]:  # Top 20 branches
        branch_data.append({
            'GO_ID': go_id,
            'GO_Term': info['term_name'],
            'Hierarchy_Depth': info['depth'],
            'Number_Enriched_Terms': len(info['enriched_children']),
            'Total_Significance_Score': info['total_significance']
        })

    # Create a DataFrame and save to CSV
    branches_df = pd.DataFrame(branch_data)
    branches_df.to_csv('Function/enriched_branches.csv', index=False)

In [12]:
def main():

    psiblast_file = "Model/Evaluation/Predictions/PSI-BLAST/psiblastsearch_output.csv"
    hmm_file = "Model/Evaluation/Predictions/HMM-SEARCH/hmmsearch_output.csv"
    protein_ids = load_protein_ids(psiblast_file, hmm_file)


    # Proteins_to_GO terms for our family 
    print("Fetching GO annotations...")
    family_annotations = {}
    for pid in tqdm(protein_ids, desc="Fetching GO annotations"):
        family_annotations[pid] = fetch_go_annotations(pid)

    total_proteins_family = len(family_annotations)


    
    # Proteins_to_GO terms for SwissProt
    swissprot_annotations = parse_swissprot_go_terms("uniprot_sprot.xml", protein_ids)

    total_proteins_swissprot = len(swissprot_annotations)

    # Load the GO DAG for ancestor expansion
    # We downloaded the go.obo file so we can parse the whole ontology
    # Note that "go.obo" we downloaded locally and not to the Git Repository due to its size
    print("Expanding GO terms to include ancestors...")
    go_obo = obo_parser.GODag('go.obo')
    expanded_family_annotations = expand_go_terms_with_ancestors(family_annotations, go_obo)
    expanded_swissprot_annotations = expand_go_terms_with_ancestors(swissprot_annotations, go_obo)

    # Fetch all GO terms for all found GO IDs in the family after expanding 
    go_id_to_go_term = get_go_terms_given_goid(expanded_family_annotations)

    # Reverse mapping (go_id to proteins mapping) for enrichment
    go_to_proteins_family = reverse_protein_go_dict(expanded_family_annotations)
    go_to_proteins_swissprot = reverse_protein_go_dict(expanded_swissprot_annotations)

    # Calculate GO enrichments
    _ = calculate_go_enrichment(go_to_proteins_family, go_to_proteins_swissprot,
                                total_proteins_family, total_proteins_swissprot, go_id_to_go_term)

    
    
    # Read the enrichment results
    df = pd.read_csv("Function/enrichment_results.csv")

    # Get the terms to the GO ids from the family data
  #  go_id_to_term = create_go_id_to_term_mapping(family_annotations)

    # Filter for significantly enriched terms
    enriched_terms = df[
    (df['P_Value_Two_Tail'] < 0.05) &
    (df['P_Value_Greater'] < 0.05)
    ]


    # Create word frequencies using the actual GO terms instead of IDs
    word_frequencies = {}
    for _, row in enriched_terms.iterrows():
        go_id = row['GO_ID']
        if go_id in go_id_to_go_term:  # Make sure we have the term for this ID
            term = go_id_to_go_term[go_id]
            # Use fold enrichment as weight
            weight = row['Fold_Enrichment']
            word_frequencies[term] = weight

    # Create and display the word cloud
    wordcloud = WordCloud(
        width=1200, 
        height=800,
        background_color='white',
        prefer_horizontal=0.7,
        max_words=50,  # Limit to top 50 terms for better readability
        min_font_size=10,
        max_font_size=60
    ).generate_from_frequencies(word_frequencies)

    # Plot and save the word cloud
    plt.figure(figsize=(20, 12))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('GO Term Enrichment Word Cloud', fontsize=16, pad=20)
    plt.savefig('Function/go_enrichment_wordcloud.png', dpi=300, bbox_inches='tight')
    plt.close()

    # Print out the enriched terms for verification
    print("\nTop enriched GO terms:")
    sorted_terms = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)
    for term, weight in sorted_terms[:10]:
        print(f"\nTerm: {term}")
        print(f"Weight in word cloud: {weight:.2f}")

    
    # Hierarchy

    analyze_go_hierarchy(go_id_to_go_term)

    
if __name__ == "__main__":
    main()

Fetching GO annotations...


Fetching GO annotations: 100%|██████████| 92/92 [00:25<00:00,  3.67it/s]


Starting to parse SwissProt XML...
Processed 10000 proteins (skipped 0 family proteins)...
Processed 19999 proteins (skipped 1 family proteins)...
Processed 29983 proteins (skipped 17 family proteins)...
Processed 39982 proteins (skipped 18 family proteins)...
Processed 49982 proteins (skipped 18 family proteins)...
Processed 59979 proteins (skipped 21 family proteins)...
Processed 69979 proteins (skipped 21 family proteins)...
Processed 79978 proteins (skipped 22 family proteins)...
Processed 89958 proteins (skipped 42 family proteins)...
Processed 99958 proteins (skipped 42 family proteins)...
Processed 109958 proteins (skipped 42 family proteins)...
Processed 119958 proteins (skipped 42 family proteins)...
Processed 129958 proteins (skipped 42 family proteins)...
Processed 139958 proteins (skipped 42 family proteins)...
Processed 149958 proteins (skipped 42 family proteins)...
Processed 159958 proteins (skipped 42 family proteins)...
Processed 169958 proteins (skipped 42 family prot

### Motifs
1. Search significantly conserved short motifs inside your family. Use [ELM classes](http://elm.eu.org/elms) and [ProSite patterns](https://ftp.expasy.org/databases/prosite/prosite.dat) (for ProSite consider only patterns “PA” lines, not the profiles). Make sure to consider as true matches only those that are found inside disordered regions. Disordered regions for the entire SwissProt (as defined by MobiDB-lite) are available [here](https://drive.google.com/file/d/1m7rdFvQiCRizOx54YPk1eMw4qF1iskbz/view?usp=sharing)

The following script come from https://github.com/stevin-wilson/PrositePatternsToPythonRegex/tree/master and is useful for converts prosite patterns into regex patters 

In [25]:
# This script goes through the prosite .dat file and create a dictionary with 
# accession pattern pairs
# also converts patterns to be compatible with python re module


aaList =  ["A",
           "C",
           "D",
           "E",
           "F",
           "G",
           "H",
           "I",
           "K",
           "L",
           "M",
           "N",
           "P",
           "Q",
           "R",
           "S",
           "T",
           "V",
           "W",
           "Y"]

prositeFile = open('Motifs/prosite.dat.txt','r')
outputFile = open('Motifs/prosite_preprocessed.txt','a')

currentDict = {}

#currentPA holds the prosite pattern from dat file; if it spans more than one line, read multiple files and concatenate 
#pattern from consecutive lines with PA tag in the beginning of the line

#when a new line with AC tag is read, currentPA from previous accession is cleared
# AC is the row accession for the new protein (i.e. in that row accession of protein such as PS10203)


# PA is the row where the pattern is found 



for line in prositeFile:
    line = line.strip()
    if re.match("AC   ", line):
        currentAC = line.split()[1][:-1]
        if currentAC not in currentDict.keys():
            currentDict[currentAC] = []
            if 'currentPA' in globals():
                del currentPA
    elif re.match("PA   ", line): 
        currentPA = line.split()[1] # Gets the current Pattern
        if currentPA[-1] == ".":
            currentPA = currentPA[:-1]
        if lastLineTag == "PA   ": # check if in the last line thre was also a PA
            currentPA =  lastLinePA + currentPA # if yes, concatenate the patterns (both Rows PA)
            # since we iterate over all lines in the main loop, we catch all PA rows and concatenate everything with this code
    else:
        if not 'currentPA' in globals():
            lastLineTag = line[:5]
            continue # skip the current loop (i.e. go until we find again AC/PA)

        
        currentPAList = currentPA.split("-")
        refinedCurrentPAList = []

        for n in range(len(currentPAList)):
        # Redefine the Pattern such that it works with Python Regex

            #change to range

            betweenCurlyList = []
            betweenSquaresList = []

            currentAAList = aaList.copy()
            if re.search("x", currentPAList[n]):
                currentPAList[n] = currentPAList[n].replace("x",".")
            if re.search("{" , currentPAList[n]):
                currentPAList[n] = currentPAList[n].replace("{","#")
            if re.search("}" , currentPAList[n]):
                currentPAList[n] = currentPAList[n].replace("}","%")
            if "(" in currentPAList[n]:
                currentPAList[n] = currentPAList[n].replace("(","{")
            if ")" in currentPAList[n]:
                currentPAList[n] = currentPAList[n].replace(")","}")
            if currentPAList[n][0] == "<":
                currentPAList[n] = currentPAList[n].replace("<","^")
            if currentPAList[n][-1] == ">":
                currentPAList[n] = currentPAList[n].replace(">","$")
            if re.search("#" , currentPAList[n]):
                element = currentPAList[n]
                betweenCurly =  element[element.find("#")+1 : element.find("%")]
                if len(betweenCurly)>1:
                    betweenCurlyList = list(betweenCurly)
                    for aa in betweenCurlyList:
                        if aa in currentAAList:
                            currentAAList.remove(aa)
                    betweenCurly = str("[") + "".join(currentAAList) + str("]")
                    #print(betweenCurly)
                    currentPAList[n] = re.sub(r'#.*%', betweenCurly,currentPAList[n])
                else:
                    if betweenCurly in currentAAList:
                            currentAAList.remove(betweenCurly)
                    betweenCurly = str("[") + "".join(currentAAList) + str("]")
                    #print(betweenCurly)
                    currentPAList[n] = re.sub(r'#.*%', betweenCurly,currentPAList[n])

            
        finalPA = "".join(currentPAList)
        if len(finalPA) > 1:
            if finalPA[-2] == ">":
                finalPA1 = finalPA[:-2] + "]"
                refinedCurrentPAList.append(finalPA1)
                finalPA2 = finalPA[:finalPA.rindex("[")] + "$"
                refinedCurrentPAList.append(finalPA2)
            else:
                refinedCurrentPAList.append(finalPA)

        currentDict[currentAC] = refinedCurrentPAList
        del currentPA
#save information from the present line for retrival while reading the next line
# useful while dealing with long Prosite patterns spanning multiple lines
    lastLineTag = line[:5]
    if re.match("PA   ", line):
        if 'currentPA' in globals():
            lastLinePA = currentPA
        else:
            lastLinePA = line.split()[1]

for key in currentDict.keys():
    if len(currentDict[key]) != 0:
        print(key,currentDict[key] )
        outputFile.write(key + "\t" + "\t&\t".join(currentDict[key])+ "\n")

PS00001 ['N[ACDEFGHIKLMNQRSTVWY][ST][ACDEFGHIKLMNQRSTVWY]']
PS00004 ['[RK]{2}.[ST]']
PS00005 ['[ST].[RK]']
PS00006 ['[ST].{2}[DE]']
PS00007 ['[RK].{2}[DE].{3}Y']
PS00008 ['G[ACGILMNQSTV].{2}[STAGCN][ACDEFGHIKLMNQRSTVWY]']
PS00009 ['.G[RK][RK]']
PS00010 ['C.[DN].{4}[FY].C.C']
PS00011 ['E.{2}[ERK]E.C.{6}[EDR].{10,11}[FYA][YW]']
PS00012 ['[DEQGSTALMKRH][LIVMFYSTAC][GNQ][LIVMFYAG][DNEKHS]S[LIVMST][ADEGHIKLMNQRSTVW][STAGCPQLIVMF][LIVMATN][DENQGTAKRHLM][LIVMWSTA][LIVGSTACR][ACDEFGHKMNQRSTVW][ACDEFGHIKLMNPQRSTW][LIVMFA]']
PS00014 ['[KRHQSA][DENQ]EL$']
PS00016 ['RGD']
PS00017 ['[AG].{4}GK[ST]']
PS00018 ['D[ACDEFGHIKLMNPQRSTVY][DNS][ACDEGHKMNPQRST][DENSTG][DNQGHRK][ACDEFHIKLMNQRSTVWY][LIVMC][DENQSTAGC].{2}[DE][LIVMFYW]']
PS00019 ['[EQ][ACDEFGIKMPQRSTVW].[ATV][FY][CEFGHIKNPQRSTVWY][ACDEFGHIKLMNPQRSVWY]W[ACDEFHIKLMNQRSTVWY]N']
PS00020 ['[LIVM].[SGNL][LIVMN][DAGHENRS][SAGPNVT].[DNEAG][LIVM].[DEAGQ].{4}[LIVM].[LM][SAG][LIVM][LIVMT][WS].{0,1}[LIVM]{2}']
PS00021 ['[FY]C[RH][NS].{7,8}[WY]C']
PS00022 [

In [28]:
def load_protein_ids(psiblast_file, hmm_file, e_threshold=0.001):
    """Load protein IDs from PSI-BLAST and HMM search results."""
    psiblast_df = pd.read_csv(psiblast_file)
    hmm_df = pd.read_csv(hmm_file)
    
    filtered_hmm_proteins = hmm_df[hmm_df['E-value'] <= e_threshold]['uniprot_id']
    print(filtered_hmm_proteins)
    psiblast_proteins = set(psiblast_df['uniprot_id'])
    hmm_proteins = set(filtered_hmm_proteins)
    
    return psiblast_proteins.union(hmm_proteins)

def load_disordered_regions(mobidb_file, protein_ids):
    """ Fidnthe disordered regions of the proteins in our family to do the analysis on"""
    disordered_regions = {}
    with open(mobidb_file, "r") as file:
        reader = csv.reader(file)
        for row in reader:
            protein_id = row[0]
            # If protein found in family, save its disordered regions
            if protein_id in protein_ids: #
                parsed_list = ast.literal_eval(row[1]) # use literal_eval bc the areas are saved as nested lists [[...]] , and we want to parse it as these lists 
                disordered_regions[protein_id] = [tuple(pair) for pair in parsed_list] # turn each list (i.e. disordered region area) to tuple 
    return disordered_regions

def load_sequences(fasta_file, protein_ids):
    sequences = {}
    for record in SeqIO.parse(fasta_file, "fasta"):
        protein_id = record.id.split("|")[1]
        if protein_id in protein_ids:
            sequences[protein_id] = str(record.seq)
    return sequences

def load_motifs(elm_file, prosite_file):
    motifs = {}
    # Extract needed information from ELM 
    with open(elm_file, 'r') as file:
        for _ in range(5):  # Skip first 5 header lines (see elm file, first 5 are just headers)
            next(file)
        reader = csv.DictReader(file, delimiter='\t')
        for row in reader:
            motif_name = row['ELMIdentifier']
            pattern = row['Regex'].strip()
            motifs[motif_name] = pattern

    # Extract needed information from Prosite
    with open(prosite_file, 'r') as file:
        for line in file:
            # Split on first tab to separate AC from pattern
            parts = line.strip().split('\t') # split on tab character, as entries on one row are like : PS00001 "TAB" N[ACDEFGHIKLMNQRSTVWY][ST][ACDEFGHIKLMNQRSTVWY]
            if len(parts) == 2:
                ac, pattern = parts # split into the two parts (i.e. the name and the pattern)
                motifs[ac] = pattern
    return motifs

def match_motifs(sequences, disordered_regions, motifs):
    # in the sequences found in the family that have a disordered region (or we atleast found one), now look into the disordered regions and see if we find any motifs
    results = {}
    for protein_id, sequence in sequences.items():
        if protein_id in disordered_regions:
            results[protein_id] = []
            for start, end in disordered_regions[protein_id]:
                region_seq = sequence[start - 1:end] # start-1 due to python indexing starting from 0 
                # Now that we are looking at a disordered region in one of the proteins of our family, lets
                # see if we find any motifs
                for motif_name, pattern in motifs.items():
                    # Example values:
                    #pattern = "N[ACDEFGHIKLMNQRSTVWY][ST][ACDEFGHIKLMNQRSTVWY]"
                    #region_seq = "NKSTMNLSTPNQSTV"

                    # The re.finditer() will find ALL occurrences where:
                    # - N matches literally
                    # - followed by ANY ONE of the amino acids in [ACDEFGHIKLMNQRSTVWY]
                    # - followed by either S or T
                    # - followed by ANY ONE of the amino acids in [ACDEFGHIKLMNQRSTVWY]
                    # --> that is what matches does with the regex search
                    matches = re.finditer(r"{}".format(pattern), region_seq)
                    for match in matches:
                        results[protein_id].append({
                            "motif": motif_name,
                            "match": match.group(),
                            "start": start + match.start(),
                            "end": start + match.end()
                        })
    return results

def save_results(results, output_file):
    with open(output_file, "w") as file:
        for protein_id, matches in results.items():
            file.write(f"> {protein_id}\n")
            for match in matches:
                file.write(
                    f"{match['motif']}\t{match['match']}\t{match['start']}-{match['end']}\n"
                )


# Note that the uniprot_sprot.fasta we downloaded locally and not on the Git repository due to its size
def main():
    mobidb_file = "Motifs/mobidb_lite_swissprot.csv"
    fasta_file = "uniprot_sprot.fasta"
    elm_file = "Motifs/elm_classes.tsv"
    prosite_file = "Motifs/prosite_preprocessed.txt"
    psiblast_file = "Model/Evaluation/Predictions/PSI-BLAST/psiblastsearch_output.csv"
    hmm_file = "Model/Evaluation/Predictions/HMM-SEARCH/hmmsearch_output.csv"
    output_file = "Motifs/conserved_motifs_in_disorder.txt"

    protein_ids = load_protein_ids(psiblast_file, hmm_file)
    sequences = load_sequences(fasta_file, protein_ids)
    disordered_regions = load_disordered_regions(mobidb_file, protein_ids)
    motifs = load_motifs(elm_file, prosite_file)

    results = match_motifs(sequences, disordered_regions, motifs)
    save_results(results, output_file)
    print(f"Results saved to {output_file}")

if __name__ == "__main__":
    main()

0     P06857
1     P54316
2     Q5BKQ4
3     P16233
4     P54315
       ...  
78    P02844
79    P27587
80    P0DSI2
81    Q68KK0
82    P83629
Name: uniprot_id, Length: 83, dtype: object
Results saved to Motifs/conserved_motifs_in_disorder.txt
