<a href="https://colab.research.google.com/github/Angelique28/Designing-Protein-Binding-Peptides---CECAM-Workshop/blob/main/notebooks/4_Peptide_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop Notebook 4: Generating Candidate Peptides (25 min)

In this notebook, we will:

1. Convert backbones into sequences with ProteinMPNN and ESM-IF1.
2. Filter and inspect candidate peptides for downstream evaluation.


In [1]:
%%time
#@title **Set up our environment (~3 mins)**
#@markdown Please execute this cell by pressing the *Play* button on
#@markdown the left.

#@markdown > ⚠️⚠️⚠️No Module named **numpy.rec** or ValueError: **dtype header** changed? Click Runtime > Restart Session!

import os, time, signal
import sys, random, string, re
from pathlib import Path
from google.colab import drive


### START ESM-IF1 Install
# Install biotite
!pip install -q biotite==0.41.1


# Install the correct version of Pytorch Geometric.
import torch
if not torch.cuda.is_available():
    print("⚠️ Warning: GPU runtime not detected. Please go to Runtime > Change runtime type > select GPU.")
else:
    print("✅ GPU detected:", torch.cuda.get_device_name(0))

def format_pytorch_version(version):
  return version.split('+')[0]

TORCH_version = torch.__version__
TORCH = format_pytorch_version(TORCH_version)

def format_cuda_version(version):
  return 'cu' + version.replace('.', '')

CUDA_version = torch.version.cuda
CUDA = format_cuda_version(CUDA_version)

!pip install biopython
!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-cluster -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-spline-conv -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html
!pip install -q torch-geometric

# Install esm
!pip install -q git+https://github.com/facebookresearch/esm.git
!sed -i 's|from biotite.structure import filter_backbone|from biotite.structure import filter_peptide_backbone as filter_backbone|' /usr/local/lib/python3.12/dist-packages/esm/inverse_folding/util.py

### END ESM-IF1 Install


if not os.path.isdir("colabdesign"):
  print("installing ColabDesign...")
  os.system("pip -q install git+https://github.com/sokrypton/ColabDesign.git")
  os.system("ln -s /usr/local/lib/python3.*/dist-packages/colabdesign colabdesign")


from google.colab import files
import json
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import ipywidgets as widgets
import py3Dmol
import hashlib
import uuid

from colabdesign.shared.protein import pdb_to_string
from colabdesign.shared.plot import plot_pseudo_3D

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

from typing import Iterable, Tuple, List
import re

def write_sequences(sequences: list[str],
                    model_id: str,
                    path: Path
                    ):
  records = []
  for seq in sequences:
    ident = f'{model_id}_{(hashlib.sha1(seq.encode()).hexdigest())[:7]}'
    rec = SeqRecord(Seq(str(seq)), id=ident, description="")
    records.append(rec)

  out_path = path / f'{str(uuid.uuid4())}.fasta'
  return SeqIO.write(records, out_path, "fasta")

✅ GPU detected: Tesla T4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
CPU times: user 22.8 s, sys: 3.47 s, total: 26.3 s
Wall time: 59.8 s


In [2]:
#@title **Set up our Paths and mount a Google Drive folder**

#@markdown We will set a project ID so that we can keep separate executions separated, and a step ID so that we can keep the outputs of each step separate
PROJECT_ID = "MDM2" #@param {type:"string"}
STEP_ID = "4"

#@markdown We will use Google Drive mounts for persistence between multiple notebooks in this tutorial.

#@markdown Log in with your Google account and give permissions to access the drive.
WORKSHOP_DIRECTORY = Path('/content/drive/MyDrive/cecam_workshop_2025_generative')
drive.mount(str(WORKSHOP_DIRECTORY.parent.parent))
STEP_PATH = WORKSHOP_DIRECTORY / 'projects' / PROJECT_ID / STEP_ID
STEP_PATH.mkdir(exist_ok = True, parents = True)

Mounted at /content/drive


## Convert Backbones into Peptide Sequences
We use two models:
1. ProteinMPNN
2. ESM-IF1

This gives us diverse candidate sequences for each backbone.

Each model can produce `num_seqs >= 1` sequences for a given backbone.
> Each backbone now has two sequence variants. We can now inspect and filter them before co-folding with MDM2.


In [3]:
#@title run **ProteinMPNN** to generate sequences
#@markdown ### ProteinMPNN Settings

#@markdown Define how many sequences you want to generate for each of the backbones
num_seqs = 4 #@param ["1", "2", "4", "8", "16", "32", "64"] {type:"raw"}
#@markdown ---
peptide_chain = 'B' #@param {type:"string"}
#@markdown ---

#@markdown A higher sampling temperature will result in more diverse sequences
mpnn_sampling_temp = 0.5 #@param ["0.0001", "0.1", "0.15", "0.2", "0.25", "0.3", "0.5", "0.75", "1.0"] {type:"raw"}

#@markdown ---
rm_aa = "C" #@param {type:"string"}
#@markdown ---
use_solubleMPNN = True #@param {type:"boolean"}

from colabdesign.mpnn import mk_mpnn_model

if rm_aa == "":
  rm_aa = None

batch_size = 8
if num_seqs < batch_size:
  batch_size = num_seqs

print("running proteinMPNN...")
sampling_temp = mpnn_sampling_temp
mpnn_model = mk_mpnn_model(weights="soluble" if use_solubleMPNN else "original")
outs = []
full_path = STEP_PATH.parent / '3' / 'rfdiffusion_outputs'
for pdb_filename in list(full_path.glob('*.pdb')):
  mpnn_model.prep_inputs(str(pdb_filename),
                         rm_aa = rm_aa,
                         chain = peptide_chain,
                         )
  outs.append(mpnn_model.sample(num=num_seqs//batch_size, batch=batch_size, temperature=sampling_temp))

seqs = []
for out in outs:
  for sampled_seq in out['seq']:
    seqs.append(sampled_seq)
    print('Sampled sequence:', sampled_seq)

write_sequences(seqs,
                model_id = 'rfp',
                path = STEP_PATH.parent / 'sequences'
                )

running proteinMPNN...
Sampled sequence: REEMERRGSQ
Sampled sequence: EEAARQERAR
Sampled sequence: AEEVARRHAQ
Sampled sequence: QEETQRERAK
Sampled sequence: SSHWLQTLQS
Sampled sequence: TDVTREAEAQ
Sampled sequence: SATKREAAER
Sampled sequence: TEQEERERLS
Sampled sequence: EEQQRLREKE
Sampled sequence: ASQKLEESTR
Sampled sequence: VDEELRKQQQ
Sampled sequence: EEALYDVKTS
Sampled sequence: EEERRKAGLA
Sampled sequence: KEKIEREKQK
Sampled sequence: KEEENQAKVT
Sampled sequence: KLEIIRQLLK


16

In [4]:
#@title run **ESM-IF1** to generate sequences
#@markdown ### ESM-IF1 Settings

#@markdown Define how many sequences you want to generate for each of the backbones

num_seqs_esm = 4 #@param ["1", "2", "4", "8", "16", "32", "64"] {type:"raw"}
#@markdown ---
peptide_chain_esm = 'B' #@param {type:"string"}
#@markdown ---

#@markdown A higher sampling temperature will result in more diverse sequences

esm1_sampling_temp = 1.0 #@param ["0.1", "0.15", "0.2", "0.25", "0.3", "0.5", "0.75", "1.0"] {type:"raw"}

import esm.inverse_folding
import esm
model, alphabet = esm.pretrained.esm_if1_gvp4_t16_142M_UR50()
model = model.eval()

print("running ESM-IF1...")
seqs = []
full_path = STEP_PATH.parent / '3' / 'rfdiffusion_outputs'

for pdb_filename in list(full_path.glob('*.pdb')):
  structure = esm.inverse_folding.util.load_structure(str(pdb_filename), peptide_chain_esm)
  coords, native_seq = esm.inverse_folding.util.extract_coords_from_structure(structure)
  for _ in range(num_seqs_esm):
    sampled_seq = model.sample(coords, temperature=esm1_sampling_temp)
    print('Sampled sequence:', sampled_seq)
    seqs.append(sampled_seq)

write_sequences(seqs,
                model_id = 'rfe',
                path = STEP_PATH.parent / 'sequences'
                )

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt" to /root/.cache/torch/hub/checkpoints/esm_if1_gvp4_t16_142M_UR50.pt




running ESM-IF1...
Sampled sequence: PDFGAMLAQF
Sampled sequence: MMLIAVEALI
Sampled sequence: PEQQDWNDRR
Sampled sequence: PSEVLSLLEE
Sampled sequence: DLARAVGIAE
Sampled sequence: ALLLAILIVV
Sampled sequence: SYAAAVAIWS
Sampled sequence: QIENNMHQAI
Sampled sequence: LVAFLQARAG
Sampled sequence: ASAQIEKLYD
Sampled sequence: SLNALWQQQQ
Sampled sequence: NIEKVIAAFR
Sampled sequence: AFGAAVSQLG
Sampled sequence: MAVHAVDQLH
Sampled sequence: MLFDVLKQMK
Sampled sequence: AIRALIARLQ


16

### Notebook Summary
- Designed sequences for each backbone using ProteinMPNN and ESM-IF1.

➡️ Next: In Notebook 5, we will co-fold these peptide candidates with MDM2 using Boltz-2 and evaluate binding metrics.
