ESM-2.0 Frozen Feature Extraction Notebook

This Jupyter notebook demonstrates how to extract frozen ESM-2.0 embeddings using the Hugging Face transformers implementation of the ESM-2 model. The extracted embeddings are intended to be used as fixed (non-trainable) features for downstream deep learning models.

In this example, embeddings are generated for  peptide sequences (external_data), but the same procedure should be followed for all other datasets (e.g., training) to ensure consistency.

The notebook uses the pretrained ESM-2 (facebook/esm2_t33_650M_UR50D) model to encode input sequences. For each sequence, token-level representations are obtained from the final hidden layer, and a mean pooling operation is applied across the sequence length to produce a fixed-length embedding vector.

Before running this notebook, the required dependencies must be installed, including:

-transformers

-torch

-tqdm

The generated embeddings are saved as NumPy (.npy) files and can be directly loaded during model training or evaluation as frozen ESM-2.0 features.

In [34]:
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
from transformers import AutoTokenizer, EsmModel

In [32]:
## MHC II external validation data 447
data_path = "MHC2_External_dataset.csv" # Follow your own path

In [21]:
data_ex = pd.read_csv(data_path)

In [22]:
data_ex.columns

Index(['MHC_allele', 'Peptide', 'Immunogenicity', 'rename_MHC_fasta_Name',
       'Pseudo_Sequence', 'Full_sequence'],
      dtype='object')

In [23]:
y = data_ex["Immunogenicity"].values.astype(int)# labels

In [24]:
# save the labels
np.save("Labels_ex_data_447npy",y)

In [25]:
len(y)

447

In [33]:
# extract feature for peptide and MHC sequence in two seperate file independently

In [27]:
model_checkpoint =  "facebook/esm2_t33_650M_UR50D" # model checkpoint huggingface

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = EsmModel.from_pretrained(model_checkpoint)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Load dataset

#full = data_ex["Full_sequence"].tolist() # for MHC seq
peptide = data_ex["Peptide"].tolist() # for peptide seq
# To extract mean embeddings
# ---------------------------------------------------
def get_esm_embedding(sequence):
    inputs = tokenizer(sequence, return_tensors="pt", truncation=True, max_length=1024).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        # outputs.last_hidden_state → shape: (1, seq_len, hidden_size)
        # Mean-pooling over sequence length
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
    return embedding

embeddings = []
for seq in tqdm(full, desc="Extracting ESM2 embeddings"):
    emb = get_esm_embedding(seq)
    embeddings.append(emb)

# Convert to numpy array
embeddings = np.vstack(embeddings)  # shape: (N, hidden_size)
print("Final embedding shape:", embeddings.shape)

np.save("ESM_2_Peptide_seq_external.npy", embeddings)  # save
print("ESM_2_Peptide_seq_external.npy")

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Extracting ESM2 embeddings: 100%|█████████████████████████████████████████████████| 447/447 [00:25<00:00, 17.77it/s]

Final embedding shape: (447, 1280)
ESM_2_Peptide_seq_external.npy





In [29]:
feature_peptide_external = np.load("ESM_2_Peptide_seq_external.npy")

In [31]:
feature_peptide_external.shape

(447, 1280)