# MolEncoder CLS Token Embedding Tutorial

This notebook demonstrates how to extract CLS token embeddings from the MolEncoder model for molecular representations.

In this example, you'll learn how to:
- Load the pre-trained MolEncoder model
- Extract CLS token embeddings from SMILES strings

**Important Note:** CLS token embeddings are most meaningful after the model has been fine-tuned on a specific downstream task. For examples of fine-tuning MolEncoder, please check the `classification_finetune.ipynb` and `regression_finetune.ipynb` notebooks in this folder.


In [None]:
%pip install transformers torch numpy


In [1]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

model_id = "fabikru/MolEncoder"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

In [2]:
smiles = "CCCCO"
inputs = tokenizer(smiles, return_tensors='pt', max_length=502, truncation=True)

# ModernBERT doesn't need token_type_ids
if "token_type_ids" in inputs:
    del inputs["token_type_ids"]
    
with torch.no_grad():
    outputs = model(**inputs)
    
# Extract [CLS] token representation
cls_embedding = outputs.last_hidden_state[:, 0, :]
# remove batch dimension
cls_embedding = cls_embedding.squeeze(0)

Compiling the model with `torch.compile` and using a `torch.cpu` device is not supported. Falling back to non-compiled mode.


In [3]:
cls_embedding.shape

torch.Size([384])

In [4]:
# convert to numpy array
cls_embedding = cls_embedding.cpu().numpy()
cls_embedding.shape

(384,)