In [1]:
import pandas as pd

In [2]:
zinc_data_cleaned = pd.read_csv('zinc_data_100to149_cleaned.csv')

In [3]:
zinc_data_cleaned.shape

(1086866, 3)

In [4]:
zinc_data_cleaned = zinc_data_cleaned.drop_duplicates(subset=['smiles'])

In [5]:
zinc_data_cleaned = zinc_data_cleaned.drop_duplicates(subset=['zinc_id'])

In [6]:
zinc_data_cleaned.shape

(1086836, 3)

In [7]:
zinc_data = zinc_data_cleaned.copy()

In [8]:
zinc_data.head()

Unnamed: 0,smiles,zinc_id,sanitized_smiles
0,OC[C@@H]1CCCNC1,388342,OC[C@@H]1CCCNC1
1,C[C@@H](O)C(CO)[C@@H](C)O,410291,C[C@@H](O)C(CO)[C@@H](C)O
2,NCC1(O)CC1,2540025,NCC1(O)CC1
3,CN1CCN=C1N,3075393,CN1CCN=C1N
4,O=Cc1cn(CC(=O)O)cn1,59724508,O=Cc1cn(CC(=O)O)cn1


#### Pretrained model

In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the ChemBERTa tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1", clean_up_tokenization_spaces=False)
model = AutoModelForMaskedLM.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

# Ensure the model is in evaluation mode
model.eval()

# Function to compute embeddings
def get_chemberta_embeddings(smiles_series):
    embeddings = []
    
    for smiles in smiles_series:
        # Tokenize the SMILES string
        inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True, max_length=512)
        
        # Forward pass through the model to get hidden states
        with torch.no_grad():
            outputs = model(**inputs, output_hidden_states=True)
        
        # Extract the last hidden state (embedding for each token)
        hidden_states = outputs.hidden_states[-1]
        
        # Average the embeddings for all tokens (this creates a single vector for the molecule)
        molecule_embedding = hidden_states.mean(dim=1).squeeze()
        
        # Convert to numpy array (or keep as tensor if preferred)
        embeddings.append(molecule_embedding.cpu().numpy())
    
    return embeddings


Some weights of the model checkpoint at seyonec/ChemBERTa-zinc-base-v1 were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The warning is related to the model architecture mismatch when initializing the ChemBERTa model (AutoModelForMaskedLM) from a pretrained checkpoint. The weights for the pooler layer (roberta.pooler.dense) are part of the model used for tasks such as classification, which is not needed for masked language modeling. Therefore, it's safe to ignore this warning because the pooler weights aren't used for the task you are doing (i.e., generating embeddings for molecules).

Why this happens:
ChemBERTa was pretrained for masked language modeling (MLM), but you are not using it for that task (you're using it to generate embeddings). Hence, some layers, like the pooler, which are meant for downstream tasks (e.g., classification), are not loaded.
When using models for purposes other than the originally intended task, warnings like this are common but harmless.
What you can do:
Ignore the warning: Since you're only using the model to generate embeddings, and not for the original masked language modeling task, the unused weights won't affect your work.

Load the model without MaskedLM if you'd like: If you only need embeddings and not the MLM capability, you could use AutoModel instead of AutoModelForMaskedLM:

In [10]:
import torch
from transformers import AutoTokenizer, AutoModel

# Load the ChemBERTa tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1", clean_up_tokenization_spaces=False)
model = AutoModel.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

# Ensure the model is in evaluation mode
model.eval()

# Function to compute embeddings
def get_chemberta_embeddings(smiles_series):
    embeddings = []
    
    for smiles in smiles_series:
        # Tokenize the SMILES string
        inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True, max_length=512)
        
        # Forward pass through the model to get hidden states
        with torch.no_grad():
            outputs = model(**inputs, output_hidden_states=True)
        
        # Extract the last hidden state (embedding for each token)
        hidden_states = outputs.hidden_states[-1]
        
        # Average the embeddings for all tokens (this creates a single vector for the molecule)
        molecule_embedding = hidden_states.mean(dim=1).squeeze()
        
        # Convert to numpy array (or keep as tensor if preferred)
        embeddings.append(molecule_embedding.cpu().numpy())
    
    return embeddings


In [11]:
# Get embeddings for the ZINC molecules
zinc_embeddings = get_chemberta_embeddings(zinc_data['sanitized_smiles'])

# Add embeddings to the dataframes for future use
zinc_data['embedding'] = zinc_embeddings

# Now, you can use these embeddings for similarity searches, clustering, etc.

In [12]:
# Save the embeddings as .csv files
zinc_data.to_csv('zinc_data100-149_embeddings.csv', index=False)

In [13]:
zinc_data.shape

(1086836, 4)