# Ubicon Tutorial

This notebook demonstrates how to predict E3 ligase-substrate interactions using the Ubicon model. The tutorial walks through loading protein sequences, generating embeddings, and predicting interaction scores.

In [None]:

import pandas as pd
import torch
from embedding import Embedding

E3_fasta_path = "examples/E3.fasta"
Sub_fasta_path = "examples/Substrate.fasta"

# examples of E3 ligases (UniProtID) and their substrate (UniProtID).

# FBXL5 (Q9UKA1) - IRP2 (P48200)
# VHL (P40337) - HIF1a (Q16665)
# RNF4 (P78317) - DDIT4 (Q9NX09)
# βTrCP2 (Q9UKB1) - p53 (P04637)

## 1. Loading Protein Sequences

We'll start by loading FASTA files containing protein sequences for E3 ligases and their substrates. The `fasta_to_dict` function converts these sequences into dictionaries that can be used for further processing.

In [None]:
from Bio import SeqIO
def fasta_to_dict(input_fasta):
    """
    Load the specified FASTA file, create a dictionary of {ID: sequence}, 
    and save it as a .pt file.

    Parameters:
        input_fasta (str): Path to the input FASTA file.
        output_dict (str): Path to the output dictionary file (.pt).
    """
    fasta_dict = {}

    for record in SeqIO.parse(input_fasta, "fasta"):
        # Sequence length restriction
        if len(record.seq) <= 2046:
            uniprot_id = record.id.split("|")[1] if "|" in record.id else record.id
            fasta_dict[uniprot_id] = str(record.seq)

    return fasta_dict


E3_seq_dict = fasta_to_dict(E3_fasta_path)
Sub_seq_dict = fasta_to_dict(Sub_fasta_path)

## 2. Feature Embeddings

Next, we'll generate or load feature embeddings for the proteins. These embeddings capture the protein sequence information in a format suitable for machine learning models.

Note: The embedding generation is commented out as it can be computationally intensive. We'll use pre-computed embeddings in this tutorial.

In [None]:
# Feature embeddings using finetuned ESM C
E3_feature_embed, Sub_feature_embed = Embedding(E3_seq_dict = E3_seq_dict,Sub_seq_dict = Sub_seq_dict)

In [None]:
E3_feature_embed

In [None]:
Sub_feature_embed

## 3. Loading Pre-computed Embeddings

For this tutorial, we'll load pre-computed embeddings for:

- **Sequence features (using ESM-C)**: These embeddings capture protein sequence information using a fine-tuned language model.

- **Subcellular localization (using DeepLoc2)**: DeepLoc2 predicts protein subcellular localization based on sequence information. You can access DeepLoc2 through their [web server](https://services.healthtech.dtu.dk/services/DeepLoc-2.0/) or [GitHub repository](https://github.com/TviNet/DeepLoc-2.0).

- **Structural information (using Foldseek)**: Foldseek generates 3Di (3D structure-based) embeddings from protein structures. The 3Di representation encodes local structural environments of each amino acid into a 1D string. To generate these embeddings:
  - First, we obtain protein structures (e.g., from AlphaFold)
  - Then we run Foldseek's createdb command to extract the 3Di structural alphabet
  - This converts 3D structural information into a sequence-like representation that captures important structural features

These three types of embeddings provide complementary information about the proteins that helps predict their interactions accurately. By combining sequence, structure, and localization information, Ubicon can identify potential E3-substrate pairs more effectively than using any single data type alone.

In [None]:
import json
# If you cannot obtain the E3 and Sub feature embeddings, you can use the following code to obtain the embeddings.
# If you wish to use existing embeddings, please use the code below.
# E3_feature_embed = torch.load('examples/E3_feature_embedding.pt')
# Sub_feature_embed = torch.load('examples/Sub_feature_embedding.pt')



# Location embeddings using DeepLoc2
# This embeddings are obtained using the DeepLoc2 model. You can see the details in the DeepLoc2 paper (https://doi.org/10.1093/nar/gkac278)  or github (https://github.com/TviNet/DeepLoc-2.0).

# If you cannot obtain the E3 and Sub location embeddings, you can use the following code to obtain the embeddings.
E3_location_embed = pd.read_csv('examples/E3_location_embedding.csv')
Sub_location_embed = pd.read_csv('examples/Sub_location_embedding.csv')



# Structure embeddings using Foldseek
# This embeddings are obtained using the Foldseek model. You can see the detail in the Foldseek paper (https://doi.org/10.1038/s41587-023-01773-0) or github (https://github.com/steineggerlab/foldseek)

# If you cannot obtain the E3 and Sub structure embeddings, you can use the following code to obtain the embeddings
# Loading examples/E3_structure_embed.json
with open('examples/E3_structure_embed.json', 'r') as f:
    E3_structure_embed = json.load(f)
with open('examples/Sub_structure_embed.json', 'r') as f:
    Sub_structure_embed = json.load(f)

## 4. Creating Protein Pairs

Now we'll create a dataframe containing E3-substrate pairs for prediction. For this example, we'll use four known E3-substrate pairs from the literature.

In [None]:
# Create dataframe for E3-substrate pairs
# Using 4 sample pairs
pairs_data = [
    {"e3_uniprot_id": "Q9UKA1", "substrate_uniprot_id": "P48200", "e3_name": "FBXL5", "substrate_name": "IRP2"},
    {"e3_uniprot_id": "P40337", "substrate_uniprot_id": "Q16665", "e3_name": "VHL", "substrate_name": "HIF1a"},
    {"e3_uniprot_id": "P78317", "substrate_uniprot_id": "Q9NX09", "e3_name": "RNF4", "substrate_name": "DDIT4"},
    {"e3_uniprot_id": "Q9UKB1", "substrate_uniprot_id": "P04637", "e3_name": "βTrCP2", "substrate_name": "p53"}
]
pairs_df = pd.DataFrame(pairs_data)

## 5. Predicting Interaction Scores

With all the embeddings loaded and pairs defined, we can now predict interaction scores using the Ubicon model. The following steps combine all embeddings and load the model for prediction.

In [None]:
import sys
sys.path.append("src")
from score_utils import load_model, process_chunk

# Path to required resources
model_path = "models/final_catboost_model.cbm"  # Please change this path to the actual model path
# Combining embedding data
combined_embeddings = {**E3_feature_embed, **Sub_feature_embed}

# Combining location information dataframes
combined_location = pd.concat([E3_location_embed, Sub_location_embed])

# Combining structure embeddings
combined_structure = {**E3_structure_embed, **Sub_structure_embed}

# Load the model
print("Loading model...")
model = load_model(model_path)

# Calculate scores
print("Calculating scores for E3-substrate pairs...")
results_df = process_chunk(
    pairs_df, 
    model, 
    combined_embeddings, 
    combined_location, 
    combined_structure
)

## 6. Score Calibration

Finally, we calibrate the raw prediction scores to produce the final Ubicon scores. This calibration ensures that the scores are properly scaled and can be interpreted as confidence levels for the predicted interactions.

In [None]:
# Calculating calibration scores (Ubicon)
import numpy as np
import joblib

# Path to calibration model
isotonic_model_path = "models/isotonic_calibration_model.pkl"  # Specify the actual model path

# Loading calibration model
try:
    # Load Isotonic Regression model
    ir_model = joblib.load(isotonic_model_path)
    
    # Calculate calibrated scores (Ubicon) from the original scores
    scores = np.array(results_df['substrate_prediction_score'])
    calibrated_scores = ir_model.predict(scores)
    
    # Add results to dataframe
    results_df['ubicon_score'] = calibrated_scores
    
    # Display calibration scores (Ubicon)
    print("Ubicon scores calculated successfully")
    display(results_df[['e3_name', 'substrate_name', 'e3_uniprot_id', 'substrate_uniprot_id', 'ubicon_score']])
    
except Exception as e:
    print(f"Failed to load calibration model: {e}")