<a href="https://colab.research.google.com/github/Shimizu-team/Ubicon/blob/main/colab_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ubicon: E3 Ligase-Substrate Interaction Prediction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shimizu-team/ubicon/blob/main/colab_demo.ipynb)

This notebook demonstrates how to predict E3 ligase-substrate interactions using Ubicon in Google Colab.

**Paper**: High-Resolution Mapping of the Human E3-Substrate Interactome using Ubicon Uncovers Network Architecture and Cancer Vulnerabilities

**Repository**: https://github.com/shimizu-team/ubicon

## Setup and Installation

First, let's clone the repository and install the required packages.

In [1]:
# Clone the repository
!git clone https://github.com/shimizu-team/ubicon.git
%cd ubicon

# List contents to verify
!ls -la

Cloning into 'ubicon'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 47 (delta 7), reused 29 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (47/47), 6.10 MiB | 25.83 MiB/s, done.
Resolving deltas: 100% (7/7), done.
/content/ubicon
total 616
drwxr-xr-x 8 root root   4096 Jun 29 02:23 .
drwxr-xr-x 1 root root   4096 Jun 29 02:23 ..
-rw-r--r-- 1 root root  98779 Jun 29 02:23 colab_demo.ipynb
drwxr-xr-x 2 root root   4096 Jun 29 02:23 config
-rwxr-xr-x 1 root root   3086 Jun 29 02:23 embedding.py
-rw-r--r-- 1 root root   9346 Jun 29 02:23 environment.yml
drwxr-xr-x 2 root root   4096 Jun 29 02:23 examples
drwxr-xr-x 8 root root   4096 Jun 29 02:23 .git
-rw-r--r-- 1 root root    586 Jun 29 02:23 .gitignore
-rw-r--r-- 1 root root   1068 Jun 29 02:23 LICENSE
drwxr-xr-x 2 root root   4096 Jun 29 02:23 models
drwxr-xr-x 2 root root   4096 Jun 29 02:23 params
-rw-r--r-- 

In [2]:
# Install required packages
!pip install torch==2.6.0
!pip install catboost==1.2.8
!pip install transformers==4.46.3
!pip install peft==0.14.0
!pip install accelerate==1.4.0
!pip install datasets==3.3.2
!pip install evaluate==0.4.3
!pip install einops==0.8.1
!pip install safetensors==0.5.2
!pip install biopython==1.83
!pip install tqdm

# Standard packages (usually pre-installed in Colab)
!pip install pandas numpy scikit-learn matplotlib seaborn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.6.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.6.0)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torc

## Verify Installation

Let's check if everything is properly installed.

In [3]:
import torch
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
import sys
import os

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Current directory: {os.getcwd()}")

# Check if model files exist
print(f"\nModel files:")
print(f"CatBoost model exists: {os.path.exists('models/final_catboost_model.cbm')}")
print(f"Calibration model exists: {os.path.exists('models/isotonic_calibration_model.pkl')}")
print(f"LoRA parameters exist: {os.path.exists('params/lora_param.pt')}")

Python version: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
PyTorch version: 2.6.0+cu124
CUDA available: True
Current directory: /content/ubicon

Model files:
CatBoost model exists: True
Calibration model exists: True
LoRA parameters exist: True


## Loading Protein Sequences

We'll start by loading FASTA files containing protein sequences for E3 ligases and their substrates. The `fasta_to_dict` function converts these sequences into dictionaries that can be used for further processing.

In [5]:
from Bio import SeqIO
from embedding import Embedding

E3_fasta_path = "examples/E3.fasta"
Sub_fasta_path = "examples/Substrate.fasta"
def fasta_to_dict(input_fasta):
    """
    Load the specified FASTA file, create a dictionary of {ID: sequence},
    and save it as a .pt file.

    Parameters:
        input_fasta (str): Path to the input FASTA file.
        output_dict (str): Path to the output dictionary file (.pt).
    """
    fasta_dict = {}

    for record in SeqIO.parse(input_fasta, "fasta"):
        # Sequence length restriction
        if len(record.seq) <= 2046:
            uniprot_id = record.id.split("|")[1] if "|" in record.id else record.id
            fasta_dict[uniprot_id] = str(record.seq)

    return fasta_dict


E3_seq_dict = fasta_to_dict(E3_fasta_path)
Sub_seq_dict = fasta_to_dict(Sub_fasta_path)

## Feature Embeddings

Next, we'll generate or load feature embeddings for the proteins. These embeddings capture the protein sequence information in a format suitable for machine learning models.

Note: The embedding generation is commented out as it can be computationally intensive. We'll use pre-computed embeddings in this tutorial.

In [6]:
# Feature embeddings using finetuned ESM C
E3_feature_embed, Sub_feature_embed = Embedding(E3_seq_dict = E3_seq_dict,Sub_seq_dict = Sub_seq_dict)

config.json:   0%|          | 0.00/770 [00:00<?, ?B/s]

modeling_esm_plusplus.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Synthyra/ESMplusplus_small:
- modeling_esm_plusplus.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

In [7]:
E3_feature_embed

{'Q9UKA1': tensor([-1.7971e-02,  2.8005e-02,  3.2803e-02, -2.1185e-02,  2.2427e-03,
         -1.1852e-02, -2.1873e-02, -7.7720e-02, -3.9713e-02,  7.9991e-02,
          7.1428e-04, -5.8928e-02,  6.8013e-02,  1.5798e-02,  3.5328e-04,
         -2.6678e-02,  3.1527e-02, -4.5968e-02, -2.7916e-02,  3.4701e-02,
          1.4747e-02,  3.5335e-02, -2.6412e-02,  4.3075e-01, -1.1384e-02,
         -3.2949e-02, -1.7170e-02,  5.5331e-03,  7.1072e-02,  3.3606e-02,
          1.2064e-02,  7.8899e-02,  2.6033e-02, -4.3563e-02,  7.6398e-02,
          3.3207e-02,  2.3335e-02,  7.8749e-03,  3.0956e-03, -3.6149e-02,
         -1.6760e-02, -1.3650e-02, -5.4117e-02,  6.6705e-02, -5.9508e-02,
         -3.8095e-03, -1.0164e-02,  1.8336e-03, -6.0509e-02,  1.8894e-02,
         -1.9601e-02, -5.8772e-02, -5.5626e-02,  6.5476e-02,  3.5556e-02,
          4.7189e-02, -8.4873e-03, -4.4894e-02,  4.5144e-03, -3.7358e-02,
          2.4258e-02, -2.4783e-02, -9.6798e-03,  1.2051e-02,  5.7635e-05,
         -2.5497e-02, -9.420

In [8]:
Sub_feature_embed

{'P48200': tensor([-1.0256e-02, -8.2375e-04,  1.0948e-02, -1.9100e-02,  5.5208e-03,
          2.5606e-03,  1.2838e-03, -1.0149e-02, -1.8164e-02,  1.5094e-02,
          5.6958e-04, -1.6581e-02,  2.6992e-02,  2.2671e-02, -3.3346e-03,
          1.1482e-03,  1.9756e-02, -4.6691e-04, -9.7999e-03, -2.5259e-03,
          1.7904e-02,  8.7145e-03, -1.2864e-02,  6.0375e-01, -2.7164e-02,
         -1.5490e-02, -5.8356e-03,  9.7886e-03,  7.7909e-03,  2.4431e-02,
         -1.6514e-02,  1.1411e-02, -1.2129e-02, -1.1640e-02,  7.9603e-03,
          1.5659e-02, -6.8223e-03,  7.3915e-03,  7.1746e-03, -4.7423e-04,
         -2.6969e-02, -1.8539e-02,  8.7721e-03, -1.6671e-03, -1.7449e-02,
         -1.0913e-02, -1.1659e-02,  1.6708e-02, -1.4448e-02, -2.6743e-03,
          3.6679e-03, -8.2582e-03, -1.1607e-02,  2.8174e-02, -1.1163e-03,
          2.4683e-03,  6.0017e-03,  1.8819e-03, -5.8054e-03,  6.4707e-03,
         -1.1196e-02, -2.0034e-03, -1.2140e-02, -2.1066e-02,  6.4217e-03,
          6.2175e-03, -6.675

## Loading Pre-computed Embeddings

For this tutorial, we'll load pre-computed embeddings for:

- **Sequence features (using ESM-C)**: These embeddings capture protein sequence information using a fine-tuned language model.

- **Subcellular localization (using DeepLoc2)**: DeepLoc2 predicts protein subcellular localization based on sequence information. You can access DeepLoc2 through their [web server](https://services.healthtech.dtu.dk/services/DeepLoc-2.0/) or [GitHub repository](https://github.com/TviNet/DeepLoc-2.0).

- **Structural information (using Foldseek)**: Foldseek generates 3Di (3D structure-based) embeddings from protein structures. The 3Di representation encodes local structural environments of each amino acid into a 1D string. To generate these embeddings:
  - First, we obtain protein structures (e.g., from AlphaFold)
  - Then we run Foldseek's createdb command to extract the 3Di structural alphabet
  - This converts 3D structural information into a sequence-like representation that captures important structural features

These three types of embeddings provide complementary information about the proteins that helps predict their interactions accurately. By combining sequence, structure, and localization information, Ubicon can identify potential E3-substrate pairs more effectively than using any single data type alone.

In [9]:
import json
# If you cannot obtain the E3 and Sub feature embeddings, you can use the following code to obtain the embeddings.
# If you wish to use existing embeddings, please use the code below.
E3_feature_embed = torch.load('examples/E3_feature_embedding.pt')
Sub_feature_embed = torch.load('examples/Sub_feature_embedding.pt')



# Location embeddings using DeepLoc2
# This embeddings are obtained using the DeepLoc2 model. You can see the details in the DeepLoc2 paper (https://doi.org/10.1093/nar/gkac278)  or github (https://github.com/TviNet/DeepLoc-2.0).

# If you cannot obtain the E3 and Sub location embeddings, you can use the following code to obtain the embeddings.
E3_location_embed = pd.read_csv('examples/E3_location_embedding.csv')
Sub_location_embed = pd.read_csv('examples/Sub_location_embedding.csv')



# Structure embeddings using Foldseek
# This embeddings are obtained using the Foldseek model. You can see the detail in the Foldseek paper (https://doi.org/10.1038/s41587-023-01773-0) or github (https://github.com/steineggerlab/foldseek)

# If you cannot obtain the E3 and Sub structure embeddings, you can use the following code to obtain the embeddings
# Loading examples/E3_structure_embed.json
with open('examples/E3_structure_embed.json', 'r') as f:
    E3_structure_embed = json.load(f)
with open('examples/Sub_structure_embed.json', 'r') as f:
    Sub_structure_embed = json.load(f)

## Creating Protein Pairs

Now we'll create a dataframe containing E3-substrate pairs for prediction. For this example, we'll use four known E3-substrate pairs from the literature.

In [10]:
# Create dataframe for E3-substrate pairs
# Using 4 sample pairs
pairs_data = [
    {"e3_uniprot_id": "Q9UKA1", "substrate_uniprot_id": "P48200", "e3_name": "FBXL5", "substrate_name": "IRP2"},
    {"e3_uniprot_id": "P40337", "substrate_uniprot_id": "Q16665", "e3_name": "VHL", "substrate_name": "HIF1a"},
    {"e3_uniprot_id": "P78317", "substrate_uniprot_id": "Q9NX09", "e3_name": "RNF4", "substrate_name": "DDIT4"},
    {"e3_uniprot_id": "Q9UKB1", "substrate_uniprot_id": "P04637", "e3_name": "βTrCP2", "substrate_name": "p53"}
]
pairs_df = pd.DataFrame(pairs_data)

## Predicting Interaction Scores

With all the embeddings loaded and pairs defined, we can now predict interaction scores using the Ubicon model. The following steps combine all embeddings and load the model for prediction.

In [11]:
import sys
sys.path.append("src")
from score_utils import load_model, process_chunk  # load_resourcesからload_modelに変更

# Path to required resources
model_path = "models/final_catboost_model.cbm"  # Please change this path to the actual model path
# Combining embedding data
combined_embeddings = {**E3_feature_embed, **Sub_feature_embed}

# Combining location information dataframes
combined_location = pd.concat([E3_location_embed, Sub_location_embed])

# Combining structure embeddings
combined_structure = {**E3_structure_embed, **Sub_structure_embed}

# Load the model
print("Loading model...")  # メッセージも変更
model = load_model(model_path)  # load_resources(...)[0]からload_modelに変更

# Calculate scores
print("Calculating scores for E3-substrate pairs...")
results_df = process_chunk(
    pairs_df,
    model,
    combined_embeddings,
    combined_location,
    combined_structure
)

Loading model...
Loading the model... models/final_catboost_model.cbm
Calculating scores for E3-substrate pairs...
Processing batch: 0-4/4


Preparing features: 100%|██████████| 4/4 [00:00<00:00, 619.93it/s]


## Score Calibration

Finally, we calibrate the raw prediction scores to produce the final Ubicon scores. This calibration ensures that the scores are properly scaled and can be interpreted as confidence levels for the predicted interactions.

In [12]:
# Calculating calibration scores (Ubicon)
import numpy as np
import joblib

# Path to calibration model
isotonic_model_path = "models/isotonic_calibration_model.pkl"  # Specify the actual model path

# Loading calibration model
try:
    # Load Isotonic Regression model
    ir_model = joblib.load(isotonic_model_path)

    # Calculate calibrated scores (Ubicon) from the original scores
    scores = np.array(results_df['substrate_prediction_score'])
    calibrated_scores = ir_model.predict(scores)

    # Add results to dataframe
    results_df['ubicon_score'] = calibrated_scores

    # Display calibration scores (Ubicon)
    print("Ubicon scores calculated successfully")
    display(results_df[['e3_name', 'substrate_name', 'e3_uniprot_id', 'substrate_uniprot_id', 'ubicon_score']])

except Exception as e:
    print(f"Failed to load calibration model: {e}")

Ubicon scores calculated successfully


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Unnamed: 0,e3_name,substrate_name,e3_uniprot_id,substrate_uniprot_id,ubicon_score
0,FBXL5,IRP2,Q9UKA1,P48200,0.883333
1,VHL,HIF1a,P40337,Q16665,0.821705
2,RNF4,DDIT4,P78317,Q9NX09,0.923077
3,βTrCP2,p53,Q9UKB1,P04637,0.878049
