**PepMLM: Target Sequence-Conditioned Generation of Peptide Binders via Masked Language Modeling**

Target proteins that lack accessible binding pockets and conformational stability have posed increasing challenges for drug development. Induced proximity strategies, such as PROTACs and molecular glues, have thus gained attention as pharmacological alternatives, but still require small molecule docking at binding pockets for targeted protein degradation (TPD). The computational design of protein-based binders presents unique opportunities to access “undruggable” targets, but have often relied on stable 3D structures or predictions for effective binder generation. Recently, we have leveraged the expressive latent spaces of protein language models (pLMs) for the prioritization of peptide binders from sequence alone, which we have then fused to E3 ubiquitin ligase domains, creating a CRISPR-analogous TPD system for target proteins. However, our methods rely on training discriminator models for ranking heuristically or unconditionally-derived “guide” peptides for their target binding capability. In this work, we introduce PepMLM, a purely target sequence-conditioned de novo generator of linear peptide binders. By employing a novel masking strategy that uniquely positions cognate peptide sequences at the terminus of target protein sequences, PepMLM tasks the state-of-the-art ESM-2 pLM to fully reconstruct the binder region, achieving low perplexities matching or improving upon previously-validated peptide-protein sequence pairs. After successful in silico benchmarking with AlphaFold-Multimer, we experimentally verify PepMLM’s efficacy via fusion of model-derived peptides to E3 ubiquitin ligase domains, demonstrating endogenous degradation of target substrates in cellular models. In total, PepMLM enables the generative design of candidate binders to any target protein, without the requirement of target structure, empowering downstream programmable proteome editing applications.     


In [None]:
#@title Install Packages
!pip install Bio
! pip install transformers

from google.colab import files
import pandas as pd
from Bio import SeqIO
import io

Collecting Bio
  Downloading bio-1.6.0-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.4/279.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting biopython>=1.80 (from Bio)
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Collecting mygene (from Bio)
  Downloading mygene-3.2.2-py2.py3-none-any.whl (5.4 kB)
Collecting gprofiler-official (from Bio)
  Downloading gprofiler_official-1.0.0-py3-none-any.whl (9.3 kB)
Collecting biothings-client>=0.2.6 (from mygene->Bio)
  Downloading biothings_client-0.3.1-py2.py3-none-any.whl (29 kB)
Installing collected packages: biopython, gprofiler-official, biothings-client, mygene, Bio
Successfully installed Bio-1.6.0 biopython-1.81 biothings-client-0.3.1 gprofiler-official-1.0.0 mygene-3.2.2


In [None]:
#@title Inputs and Parameters

#@markdown <font size = 4> How many sequences do you have?

#@markdown <font size = 3> If you only have one sequence, check off the box below and provide your sequence.
single_sequence = True #@param {type:"boolean"}
protein_seq = "YAPSALVLTVGKGVSATTAAPERAVTLTCAPGPSGTHPAAGSACADLAAVGGDLNALTRGEDVMCPMVYDPVLLTVDGVWQGKRVSYERVFSNECEMNAHGSSVFAF" #@param {type:"string"}
#@markdown

#@markdown If you have multiple sequences, leave the <b>`single_sequence`</font></b> box unchecked and upload your file containing your sequences.

#@markdown Format of your <b><font color='darkblue'>`.csv`</font></b>: Put all your target sequences in **One column called "sequence"**

#@markdown Watch for a prompt to upload your <b><font color='darkblue'>`.csv`</font></b> file!!
#@markdown

jobname = "test" #@param {type: "string"}

if single_sequence:
  protein_seq = protein_seq
else:
  uploaded = files.upload()
  use_templates = True
  key = list(uploaded.keys())[0]
  file_id = key
  df = pd.read_csv(io.BytesIO(uploaded[key]),header=0)
  df['sequence'] = df['sequence'].str.strip()
  if list(df.columns) != ['sequence']:
    print('ERROR: improperly formatted file')
  sequences = df['sequence'].tolist()
  protein_seq = sequences


###Sliders
import ipywidgets as widgets
from ipywidgets import Layout
from IPython.display import display

style = {'description_width': 'initial'}

# Initial value for num_binders
num_binders = 1

# Initial values for top_k and peptide_length
top_k = 3
peptide_length = 15

# Define the function that will save the selected value from the dropdown to num_binders
def on_change(change):
    global num_binders
    if change['type'] == 'change' and change['name'] == 'value':
        num_binders = change['new']
        print(f"Updated num_binders: {num_binders}")

def update_values(change):
    global top_k, peptide_length
    top_k = top_k_slider.value
    peptide_length = peptide_length_slider.value
    print(f"Updated Top K Value: {top_k}")
    print(f"Updated Peptide Length: {peptide_length}")

# Create sliders for Top K Value and Peptide Length
peptide_length_slider = widgets.IntSlider(value=15, min=3, max=50, step=1, description='Peptide Length:', style=style)
top_k_slider = widgets.IntSlider(value=3, min=1, max=10, step=1, description='Top K Value:', style=style)

# Display the sliders
display(peptide_length_slider)
print("Default value is 15")
display(top_k_slider)
print("Default value is 3")

# Attach the update function to the sliders
peptide_length_slider.observe(update_values, names='value')
top_k_slider.observe(update_values, names='value')


# Create a dropdown with options
dropdown = widgets.Dropdown(
    options=[1, 2, 4, 8, 16, 32],
    value=1,
    description='Number of Binders',
    disabled=False,
    style=style
)


# Display the dropdown
display(dropdown)

# Attach the callback function to the dropdown
dropdown.observe(on_change)

IntSlider(value=15, description='Peptide Length:', max=50, min=3, style=SliderStyle(description_width='initial…

Default value is 15


IntSlider(value=3, description='Top K Value:', max=10, min=1, style=SliderStyle(description_width='initial'))

Default value is 3


Dropdown(description='Number of Binders', options=(1, 2, 4, 8, 16, 32), style=DescriptionStyle(description_wid…

Updated num_binders: 16


In [None]:
#@title Load Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
from torch.distributions.categorical import Categorical
import numpy as np
import pandas as pd

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("ChatterjeeLab/PepMLM-650M")
model = AutoModelForMaskedLM.from_pretrained("ChatterjeeLab/PepMLM-650M").to(device)

def compute_pseudo_perplexity(model, tokenizer, protein_seq, binder_seq):
    '''
    For alternative computation of PPL (in batch/matrix format), please check our github repo:
    https://github.com/programmablebio/pepmlm/blob/main/scripts/generation.py
    '''
    sequence = protein_seq + binder_seq
    tensor_input = tokenizer.encode(sequence, return_tensors='pt').to(model.device)
    total_loss = 0

    # Loop through each token in the binder sequence
    for i in range(-len(binder_seq)-1, -1):
        # Create a copy of the original tensor
        masked_input = tensor_input.clone()

        # Mask one token at a time
        masked_input[0, i] = tokenizer.mask_token_id
        # Create labels
        labels = torch.full(tensor_input.shape, -100).to(model.device)
        labels[0, i] = tensor_input[0, i]

        # Get model prediction and loss
        with torch.no_grad():
            outputs = model(masked_input, labels=labels)
            total_loss += outputs.loss.item()

    # Calculate the average loss
    avg_loss = total_loss / len(binder_seq)

    # Calculate pseudo perplexity
    pseudo_perplexity = np.exp(avg_loss)
    return pseudo_perplexity


def generate_peptide_for_single_sequence(protein_seq, peptide_length = 15, top_k = 3, num_binders = 4):

    peptide_length = int(peptide_length)
    top_k = int(top_k)
    num_binders = int(num_binders)

    binders_with_ppl = []

    for _ in range(num_binders):
        # Generate binder
        masked_peptide = '<mask>' * peptide_length
        input_sequence = protein_seq + masked_peptide
        inputs = tokenizer(input_sequence, return_tensors="pt").to(model.device)

        with torch.no_grad():
            logits = model(**inputs).logits
        mask_token_indices = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
        logits_at_masks = logits[0, mask_token_indices]

        # Apply top-k sampling
        top_k_logits, top_k_indices = logits_at_masks.topk(top_k, dim=-1)
        probabilities = torch.nn.functional.softmax(top_k_logits, dim=-1)
        predicted_indices = Categorical(probabilities).sample()
        predicted_token_ids = top_k_indices.gather(-1, predicted_indices.unsqueeze(-1)).squeeze(-1)

        generated_binder = tokenizer.decode(predicted_token_ids, skip_special_tokens=True).replace(' ', '')

        # Compute PPL for the generated binder
        ppl_value = compute_pseudo_perplexity(model, tokenizer, protein_seq, generated_binder)

        # Add the generated binder and its PPL to the results list
        binders_with_ppl.append([generated_binder, ppl_value])

    return binders_with_ppl

def generate_peptide(input_seqs, peptide_length=15, top_k=3, num_binders=4):
    if isinstance(input_seqs, str):  # Single sequence
        binders = generate_peptide_for_single_sequence(input_seqs, peptide_length, top_k, num_binders)
        return pd.DataFrame(binders, columns=['Binder', 'Pseudo Perplexity'])

    elif isinstance(input_seqs, list):  # List of sequences
        results = []
        for seq in input_seqs:
            binders = generate_peptide_for_single_sequence(seq, peptide_length, top_k, num_binders)
            for binder, ppl in binders:
                results.append([seq, binder, ppl])
        return pd.DataFrame(results, columns=['Input Sequence', 'Binder', 'Pseudo Perplexity'])

tokenizer_config.json:   0%|          | 0.00/135 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.61G [00:00<?, ?B/s]

In [None]:
#@title Generate Peptides
peptide_df = generate_peptide(protein_seq, peptide_length, top_k, num_binders)
peptide_df

Unnamed: 0,Binder,Pseudo Perplexity
0,ARVYDYLAQQAACAX,11.075824
1,GCSYSTKSQQAACCK,20.955079
2,ACVADTLLRQQALLG,14.841255
3,CCSPATYLQQLACAK,41.743638
4,ARSYDYLSQQQLCCK,24.325061
5,GCVPDTKLQLQQLAK,10.2399
6,ACSYATYAQQAQLCK,24.198378
7,GCTYSDYSELAAACK,17.171036
8,GCTYDDLLRQLLACK,20.263574
9,ARSYSTYLRLQLALK,13.905217


In [None]:
#@title Download Results
from google.colab import files

# Assuming peptide_df is already defined and filled with data
peptide_df.to_csv(f"{jobname}.csv", index=False)  # Save the dataframe to a csv file

# Use colab's files.download method to trigger a file download in the browser
files.download(f"{jobname}.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>