
# ESM Embedding Pipeline (Hugging Face Transformers)

This notebook generates **protein embeddings** from amino-acid sequences using **ESM-2** from the Hugging Face Hub (e.g., `facebook/esm2_t33_650M_UR50D`).  
It produces both **per-residue** and **per-sequence** embeddings, with batching and optional CSV input.

**Where do the pretrained weights come from?**  
They are downloaded automatically by `transformers` from the official model repositories on the **Hugging Face Hub** (mirrors of Meta FAIR's releases).


In [None]:

# --- 1) Install dependencies (uncomment if needed) ---
# If you're running locally and already have these, you can skip.
# %pip install -U torch transformers pandas numpy pyarrow tqdm


In [None]:

# --- 2) Config & imports ---
from typing import List, Tuple, Dict
from pathlib import Path
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, EsmModel

# Choose a model size:
#   facebook/esm2_t6_8M_UR50D     (8M params, fast, 320-dim)
#   facebook/esm2_t12_35M_UR50D   (35M, 480-dim)
#   facebook/esm2_t30_150M_UR50D  (150M, 640-dim)
#   facebook/esm2_t33_650M_UR50D  (650M, 1280-dim)  <-- good quality/speed trade-off
#   facebook/esm2_t48_15B_UR50D   (15B, huge)
MODEL_ID = "facebook/esm2_t33_650M_UR50D"  # quick pass
BATCH_SIZE = 24                            # tune up/down
USE_FP16 = True
DEVICE     = "cuda" if torch.cuda.is_available() else "cpu"
OUT_DIR    = Path("esm_outputs")  # where to save outputs

OUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Using device: {DEVICE} | Model: {MODEL_ID}")


Using device: cpu | Model: facebook/esm2_t33_650M_UR50D


In [None]:

# --- 3) Embedding function (per-residue + per-sequence) ---
from contextlib import nullcontext

@torch.inference_mode()
def esm_embed_hf(
    seqs: List[Tuple[str, str]],
    hub_id: str = MODEL_ID,
    batch_size: int = BATCH_SIZE,
    device: str = DEVICE,
    use_fp16: bool = USE_FP16
) -> Dict[str, Dict[str, torch.Tensor]]:
    """Compute ESM embeddings.

    Args:
        seqs: list of (seq_id, aa_sequence)
        hub_id: Hugging Face model id
        batch_size: batch size for inference
        device: 'cuda' or 'cpu'
        use_fp16: use autocast float16 on CUDA

    Returns:
        {
          'per_residue': {seq_id: FloatTensor[L, D]},
          'per_sequence': {seq_id: FloatTensor[D]}
        }
    """
    tokenizer = AutoTokenizer.from_pretrained(hub_id, do_lower_case=False)
    model = EsmModel.from_pretrained(hub_id).eval().to(device)

    per_residue, per_sequence = {}, {}
    for i in tqdm(range(0, len(seqs), batch_size), desc="Embedding"):
        batch = seqs[i:i+batch_size]
        ids, strings = zip(*batch)

        enc = tokenizer(
            list(strings),
            return_tensors="pt",
            padding=True,
            truncation=True,      # respects model max length (~1022 with specials)
            max_length=1022,
            add_special_tokens=True,
        ).to(device)

        ctx = torch.autocast(device_type="cuda", dtype=torch.float16) if (use_fp16 and device.startswith("cuda")) else nullcontext()
        with ctx:
            out = model(**enc)  # last_hidden_state: [B, T, D]
            hs = out.last_hidden_state

        # Build mask to exclude padding + CLS/EOS specials
        pad_id = tokenizer.pad_token_id
        cls_id = tokenizer.cls_token_id
        eos_id = tokenizer.eos_token_id
        ids_mat = enc["input_ids"]                    # [B, T]
        not_pad = (ids_mat != pad_id)
        not_special = (ids_mat != cls_id) & (ids_mat != eos_id)
        valid_mask = (not_pad & not_special).unsqueeze(-1)  # [B, T, 1]

        for b, sid in enumerate(ids):
            valid_idx = valid_mask[b, :, 0].nonzero(as_tuple=True)[0]
            residue_repr = hs[b, valid_idx, :].detach().cpu()       # [L, D]
            seq_repr = residue_repr.mean(dim=0)                     # [D]
            per_residue[sid] = residue_repr
            per_sequence[sid] = seq_repr

    return {"per_residue": per_residue, "per_sequence": per_sequence}


# Helper for autocast when on CPU (no-op)
from contextlib import nullcontext


In [None]:

# --- 4) Save utilities (Parquet for per-seq; .npy for per-residue) ---

def save_per_sequence_to_parquet(per_sequence: Dict[str, torch.Tensor], path: Path):
    """Save per-sequence embeddings as a Parquet with columns: id, length, dim, embedding(list[float])."""
    rows = []
    for sid, vec in per_sequence.items():
        v = vec.detach().cpu().numpy().astype(np.float16)
        rows.append({
            "id": sid,
            "length": int(len(v)),  # actually 'dim', kept for compatibility
            "dim": int(len(v)),
            "embedding": v.tolist(),  # parquet supports nested lists via pyarrow
        })
    df = pd.DataFrame(rows, columns=["id", "length", "dim", "embedding"])
    df.to_parquet(path, index=False)
    print(f"Saved per-sequence embeddings → {path}")

def save_per_residue_npy(per_residue: Dict[str, torch.Tensor], out_dir: Path):
    """Save each sequence's per-residue matrix as an .npy file named <id>.npy (shape [L, D])."""
    out_dir.mkdir(parents=True, exist_ok=True)
    for sid, mat in per_residue.items():
        arr = mat.detach().cpu().numpy().astype(np.float32)
        np.save(out_dir / f"{sid}.npy", arr)
    print(f"Saved per-residue embeddings → {out_dir} (one .npy per sequence)")


In [None]:

# # --- 5) Example usage with toy sequences ---

# toy_seqs = [
#     ("P05362", "MKTIIALSYIFCLVFADYKDDDDK"),
#     ("Q9Y2Z4", "MGSSHHHHHHSSGLVPRGSHMASMTGGQQMGRGSLENS"),
# ]

# embs = esm_embed_hf(toy_seqs)
# print("Per-sequence vector shape (first):", next(iter(embs["per_sequence"].values())).shape)

# # Save outputs
# per_seq_path = OUT_DIR / "per_sequence.parquet"
# save_per_sequence_to_parquet(embs["per_sequence"], per_seq_path)

# per_res_dir = OUT_DIR / "per_residue_npy"
# save_per_residue_npy(embs["per_residue"], per_res_dir)



## Optional: Load sequences from CSV

If you have a CSV with columns `id,sequence`, you can load it and embed in batches.


In [None]:
# # --- 6) Optional: Embed from CSV (id,sequence) ---
# # csv_path = "my_sequences.csv"   # <- set your path
# # df = pd.read_csv(csv_path) # changed accordingly
# seqs = list(df_parquet[["id", "sequence"]].itertuples(index=False, name=None))
# embs = esm_embed_hf(seqs)
# save_per_sequence_to_parquet(embs["per_sequence"], OUT_DIR / "per_sequence_from_csv.parquet")
# save_per_residue_npy(embs["per_residue"], OUT_DIR / "per_residue_from_parquet")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Replace 'path/to/your/file.parquet' with the actual path to your parquet file in Google Drive
parquet_file_path = '/content/drive/MyDrive/scope_onside_common_v3.parquet/scope_onside_common_v3.parquet'
df_parquet = pd.read_parquet(parquet_file_path)
display(df_parquet.head())

Unnamed: 0,drug_chembl_id,target_uniprot_id,label,smiles,sequence,molfile_3d,rxcui
0,CHEMBL1000,O15245,0,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MPTVDDILEQVGESGWFQKQAFLILCLLSAAFAPICVGIVFLGFTP...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
1,CHEMBL1000,P08183,1,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MDLEGDRNGGAKKKNFFKLNNKSEKDKKEKKPTVSVFSMFRYSNWL...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
2,CHEMBL1000,P35367,1,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MSLPNSSCLLEDKMCEGNKTTMASPQLMPLVVVLSTICLVTVGLNL...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
3,CHEMBL1000,Q02763,0,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MDSLASLVLCGVSLLLSGTVEGAMDLILINSLPLVSDAETSLTCIA...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610
4,CHEMBL1000,Q12809,0,O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1,MPVRRGHVAPQNTFLDTIIRKFEGQSRKFIIANARVENCAVIYCND...,\n RDKit 3D\n\n 52 54 0 0 0 0...,20610


In [None]:
seqs = list(df_parquet[["target_uniprot_id", "sequence"]].itertuples(index=False, name=None))
embs = esm_embed_hf(seqs)
save_per_sequence_to_parquet(embs["per_sequence"], OUT_DIR / "ESM embeddings.parquet")
save_per_residue_npy(embs["per_residue"], OUT_DIR / "per_residues")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.61G [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Embedding:   0%|          | 4/1448 [48:54<295:05:01, 735.67s/it]


---

### Notes & Provenance
- **Weights**: Downloaded via `transformers` from the Hugging Face Hub model repos such as `facebook/esm2_t33_650M_UR50D`.  
- **Max length**: ESM-2 supports ~1022 tokens including special tokens. Sequences longer than that will be truncated by default; consider chunking if needed.  
- **Pooling**: We use a masked mean over residue embeddings to get a single per-sequence vector. You can swap in `[CLS]` token representation or attention pooling.  
- **Precision**: With CUDA, FP16 autocast can speed up inference and reduce memory usage.
- **Formats**: Per-sequence embeddings → Parquet (list column). Per-residue embeddings → one `.npy` per sequence.

Happy embedding!
