<a href="https://colab.research.google.com/github/Shimizu-team/Ubicon/blob/main/colab_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ubicon: E3 Ligase-Substrate Interaction Prediction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shimizu-team/ubicon/blob/main/colab_demo.ipynb)

This notebook demonstrates how to predict E3 ligase-substrate interactions using Ubicon in Google Colab.

**Paper**: High-Resolution Mapping of the Human E3-Substrate Interactome using Ubicon Uncovers Network Architecture and Cancer Vulnerabilities

**Repository**: https://github.com/shimizu-team/ubicon

## Setup and Installation

First, let's clone the repository and install the required packages.

In [1]:
# Clone the repository
!git clone https://github.com/shimizu-team/ubicon.git
%cd ubicon

# List contents to verify
!ls -la

Cloning into 'ubicon'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 41 (delta 4), reused 29 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (41/41), 6.09 MiB | 11.90 MiB/s, done.
Resolving deltas: 100% (4/4), done.
/content/ubicon
total 516
drwxr-xr-x 8 root root   4096 Jun 29 01:57 .
drwxr-xr-x 1 root root   4096 Jun 29 01:57 ..
drwxr-xr-x 2 root root   4096 Jun 29 01:57 config
-rwxr-xr-x 1 root root   3086 Jun 29 01:57 embedding.py
-rw-r--r-- 1 root root   9346 Jun 29 01:57 environment.yml
drwxr-xr-x 2 root root   4096 Jun 29 01:57 examples
drwxr-xr-x 8 root root   4096 Jun 29 01:57 .git
-rw-r--r-- 1 root root    586 Jun 29 01:57 .gitignore
-rw-r--r-- 1 root root   1068 Jun 29 01:57 LICENSE
drwxr-xr-x 2 root root   4096 Jun 29 01:57 models
drwxr-xr-x 2 root root   4096 Jun 29 01:57 params
-rw-r--r-- 1 root root   6165 Jun 29 01:57 predict.py
-rwxr-xr-x 1 root

In [2]:
# Install required packages
!pip install torch==2.6.0
!pip install catboost==1.2.8
!pip install transformers==4.46.3
!pip install peft==0.14.0
!pip install accelerate==1.4.0
!pip install datasets==3.3.2
!pip install evaluate==0.4.3
!pip install einops==0.8.1
!pip install safetensors==0.5.2
!pip install biopython==1.83
!pip install tqdm

# Standard packages (usually pre-installed in Colab)
!pip install pandas numpy scikit-learn matplotlib seaborn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.6.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.6.0)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torc

## Verify Installation

Let's check if everything is properly installed.

In [3]:
import torch
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
import sys
import os

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Current directory: {os.getcwd()}")

# Check if model files exist
print(f"\nModel files:")
print(f"CatBoost model exists: {os.path.exists('models/final_catboost_model.cbm')}")
print(f"Calibration model exists: {os.path.exists('models/isotonic_calibration_model.pkl')}")
print(f"LoRA parameters exist: {os.path.exists('params/lora_param.pt')}")

Python version: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
PyTorch version: 2.6.0+cu124
CUDA available: True
Current directory: /content/ubicon

Model files:
CatBoost model exists: True
Calibration model exists: True
LoRA parameters exist: True


## Quick Demo: Predict E3-Substrate Interactions

Let's run a quick prediction using the provided sample data.

In [4]:
# Run a quick prediction using the command line interface
!python predict.py --e3_id P40337 --substrate_id Q16665

Loading embeddings...
Loading structure data...
Loading location data...
Loading model from models/final_catboost_model.cbm
Loading the model... models/final_catboost_model.cbm
Calculating scores for E3: P40337 and Substrate: Q16665...
Processing batch: 0-1/1
Preparing features:   0% 0/1 [00:00<?, ?it/s]Preparing features: 100% 1/1 [00:00<00:00, 70.05it/s]
Applying calibration model...
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

===== Prediction Result =====
E3 ligase: P40337
Substrate: Q16665
Ubicon score: 0.878049



## Batch Predictions: Multiple E3-Substrate Pairs

Now let's predict interactions for multiple known E3-substrate pairs.

In [10]:
# Add src to path for imports
sys.path.append("src")
from score_utils import load_model, process_chunk
import json

In [11]:
# Load pre-computed embeddings and features
print("Loading feature embeddings...")
E3_embeddings = torch.load('examples/E3_feature_embedding.pt')
Sub_embeddings = torch.load('examples/Sub_feature_embedding.pt')
combined_embeddings = {**E3_embeddings, **Sub_embeddings}

print("Loading location data...")
E3_location = pd.read_csv('examples/E3_location_embedding.csv', index_col=0)
Sub_location = pd.read_csv('examples/Sub_location_embedding.csv', index_col=0)
combined_location = pd.concat([E3_location, Sub_location])

print("Loading structure data...")
with open('examples/E3_structure_embed.json', 'r') as f:
    E3_structure = json.load(f)
with open('examples/Sub_structure_embed.json', 'r') as f:
    Sub_structure = json.load(f)
combined_structure = {**E3_structure, **Sub_structure}

print("\nAvailable proteins:")
print(f"E3 ligases: {list(E3_embeddings.keys())}")
print(f"Substrates: {list(Sub_embeddings.keys())}")

Loading feature embeddings...
Loading location data...
Loading structure data...

Available proteins:
E3 ligases: ['Q9UKA1', 'P40337', 'P78317', 'Q9UKB1']
Substrates: ['P48200', 'Q16665', 'Q9NX09', 'P04637']


In [12]:
# Create dataframe with known E3-substrate pairs
pairs_data = [
    {"e3_uniprot_id": "Q9UKA1", "substrate_uniprot_id": "P48200", "e3_name": "FBXL5", "substrate_name": "IRP2"},
    {"e3_uniprot_id": "P40337", "substrate_uniprot_id": "Q16665", "e3_name": "VHL", "substrate_name": "HIF1a"},
    {"e3_uniprot_id": "P78317", "substrate_uniprot_id": "Q9NX09", "e3_name": "RNF4", "substrate_name": "DDIT4"},
    {"e3_uniprot_id": "Q9UKB1", "substrate_uniprot_id": "P04637", "e3_name": "βTrCP2", "substrate_name": "p53"}
]

pairs_df = pd.DataFrame(pairs_data)
print("E3-Substrate pairs to predict:")
print(pairs_df[['e3_name', 'substrate_name', 'e3_uniprot_id', 'substrate_uniprot_id']])

E3-Substrate pairs to predict:
  e3_name substrate_name e3_uniprot_id substrate_uniprot_id
0   FBXL5           IRP2        Q9UKA1               P48200
1     VHL          HIF1a        P40337               Q16665
2    RNF4          DDIT4        P78317               Q9NX09
3  βTrCP2            p53        Q9UKB1               P04637


In [13]:
# Load model and run predictions
print("Loading Ubicon model...")
model = load_model("models/final_catboost_model.cbm")

print("\nCalculating interaction scores...")
results_df = process_chunk(
    pairs_df,
    model,
    combined_embeddings,
    combined_location,
    combined_structure
)

print("\nRaw prediction results:")
display(results_df[['e3_name', 'substrate_name', 'substrate_prediction_score']])

Loading Ubicon model...
Loading the model... models/final_catboost_model.cbm

Calculating interaction scores...
Processing batch: 0-4/4


Preparing features: 100%|██████████| 4/4 [00:00<00:00, 777.26it/s]


Raw prediction results:





Unnamed: 0,e3_name,substrate_name,substrate_prediction_score
0,FBXL5,IRP2,0.991312
1,VHL,HIF1a,0.989225
2,RNF4,DDIT4,0.997815
3,βTrCP2,p53,0.978066


In [14]:
# Apply calibration to get final Ubicon scores
import joblib

print("Applying score calibration...")
try:
    # Load calibration model
    calibration_model = joblib.load("models/isotonic_calibration_model.pkl")

    # Apply calibration
    scores = np.array(results_df['substrate_prediction_score'])
    calibrated_scores = calibration_model.predict(scores)
    results_df['ubicon_score'] = calibrated_scores

    print("\n Final Ubicon Scores:")
    final_results = results_df[['e3_name', 'substrate_name', 'e3_uniprot_id', 'substrate_uniprot_id', 'ubicon_score']].copy()
    final_results['ubicon_score'] = final_results['ubicon_score'].round(6)
    display(final_results)

except Exception as e:
    print(f"Calibration failed: {e}")
    print("Using raw scores instead.")

Applying score calibration...

 Final Ubicon Scores:


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Unnamed: 0,e3_name,substrate_name,e3_uniprot_id,substrate_uniprot_id,ubicon_score
0,FBXL5,IRP2,Q9UKA1,P48200,0.878049
1,VHL,HIF1a,P40337,Q16665,0.878049
2,RNF4,DDIT4,P78317,Q9NX09,0.923077
3,βTrCP2,p53,Q9UKB1,P04637,0.821705
