# Transformer Experiment: ChemBERTa Embeddings

## 1. Overview
We use a pre-trained Transformer model (`seyonec/ChemBERTa-zinc-base-v1`) to extract dense vector representations (embeddings) from SMILES strings. These embeddings capture deep chemical context learned from millions of molecules.

We then train a regressor (XGBoost) on these embeddings to predict melting point.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
import sys
import os

# Add src to path
sys.path.append(os.path.abspath('..'))

from src.features import ChemBERTaFeaturizer
from src.models import XGBoostModel
from src.utils.metrics import calculate_metrics

# Set plots style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load Data and Generate Embeddings
This step effectively replaces manual feature engineering with deep learning feature extraction.

In [2]:
train_raw = pd.read_csv('../data/raw/train.csv')
test_raw = pd.read_csv('../data/raw/test.csv')

print("Initializing ChemBERTa (this downloads the model if first time)...")
featurizer = ChemBERTaFeaturizer()

print("Generating embeddings for Train set...")
train_emb = featurizer.calculate_transformer_features(train_raw, smiles_col='SMILES')

print("Generating embeddings for Test set...")
test_emb = featurizer.calculate_transformer_features(test_raw, smiles_col='SMILES')

print("Shapes:", train_emb.shape, test_emb.shape)
print(train_emb.head())

Initializing ChemBERTa (this downloads the model if first time)...
Loading ChemBERTa model: seyonec/ChemBERTa-zinc-base-v1 on cuda...


tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/501 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/179M [00:00<?, ?B/s]

Generating embeddings for Train set...


Generating Embeddings:  58%|█████▊    | 49/84 [00:00<00:00, 66.85it/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating Embeddings:  86%|████████▌ | 72/84 [00:01<00:00, 68.62it/s]

model.safetensors:   0%|          | 0.00/179M [00:00<?, ?B/s]

Generating Embeddings: 100%|██████████| 84/84 [00:01<00:00, 56.22it/s]


Generating embeddings for Test set...


Generating Embeddings: 100%|██████████| 21/21 [00:00<00:00, 72.56it/s]

Shapes: (2662, 1195) (666, 1194)
     id                       SMILES      Tm  Group 1  Group 2  Group 3  \
0  2175        FC1=C(F)C(F)(F)C1(F)F  213.15        0        0        0   
1  1222  c1ccc2c(c1)ccc3Nc4ccccc4c23  407.15        0        0        0   
2  2994          CCN1C(C)=Nc2ccccc12  324.15        2        1        0   
3  1704                   CC#CC(=O)O  351.15        1        0        0   
4  2526                    CCCCC(S)C  126.15        2        3        0   

   Group 4  Group 5  Group 6  Group 7  ...  ChemBERTa_758  ChemBERTa_759  \
0        0        0        0        0  ...       0.516725      -0.220990   
1        0        0        0        0  ...       1.209761       0.328331   
2        0        0        0        0  ...       0.677359       0.531777   
3        0        0        0        0  ...       0.363788       0.355860   
4        0        0        0        0  ...      -0.732704      -0.026011   

   ChemBERTa_760  ChemBERTa_761  ChemBERTa_762  ChemBERTa_7




## 3. Train XGBoost on Embeddings
Embeddings are high-dimensional (768 dimensions), so XGBoost is a good choice to find non-linear patterns.

In [3]:
feature_cols = [c for c in train_emb.columns if c.startswith('ChemBERTa_')]
X = train_emb[feature_cols]
y = train_emb['Tm']
X_test = test_emb[feature_cols]

kf = KFold(n_splits=5, shuffle=True, random_state=42)
results = []
test_fold_preds = []

print("Training XGBoost on ChemBERTa embeddings...")

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Embeddings are dense, might need slightly different params
    model = XGBoostModel({'n_estimators': 2000, 'learning_rate': 0.02, 'max_depth': 6})
    model.fit(X_train, y_train, X_val, y_val)
    
    val_pred = model.predict(X_val)
    metrics = calculate_metrics(y_val, val_pred)
    results.append(metrics)
    print(f"Fold {fold+1} MAE: {metrics['MAE']:.4f}")
    
    test_fold_preds.append(model.predict(X_test))

avg_mae = np.mean([m['MAE'] for m in results])
print(f"\nAverage CV MAE (Transformer): {avg_mae:.4f}")

# Create Submission
avg_preds = np.mean(test_fold_preds, axis=0)
submission = pd.DataFrame({'id': test_emb['id'], 'Tm': avg_preds})
submission.to_csv('../submissions/submission_chemberta.csv', index=False)
print("Saved ChemBERTa submission.")

Training XGBoost on ChemBERTa embeddings...
Fold 1 MAE: 42.6619
Fold 2 MAE: 41.0009
Fold 3 MAE: 39.8200
Fold 4 MAE: 40.9981
Fold 5 MAE: 40.3526

Average CV MAE (Transformer): 40.9667
Saved ChemBERTa submission.
