# Getting Started

[![arXiv:2409.15370](https://img.shields.io/badge/cs.LG-2409.15370-b31b1b?style=flat&amp;logo=arxiv&amp;logoColor=red)](https://arxiv.org/abs/2409.15370)
![PyPI - Downloads](https://img.shields.io/pypi/dm/smirk)
[![GitHub Release Date](https://img.shields.io/github/release-date/BattModels/smirk?display_date=published_at&logo=github)
](https://github.com/BattModels/smirk)


Molecular Foundation Models are all the rage, but without a tokenizer that can represent *all* of chemistry, any model will be inheriently limited. Current "atomwise" tokenizers are fundementally limited by requiring a different token for every "bracketed atom" in the [SMILES](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System) encoding for a compound.

The problem is, most atoms are bracketed. Most elements (i.e. lithium `[Li]`), chiral centers, isotops, any charged species all require bracketed atoms. Compounds where bracketed atoms are critical to their effectiveness include:

- [Cisplatin](https://en.wikipedia.org/wiki/Cisplatin): Effective Chemotherapy drug, but it's isomer ([transplatin](https://en.wikipedia.org/wiki/Transplatin)) is not.
- [Sodium pertechnetate](https://en.wikipedia.org/wiki/Sodium_pertechnetate): [Radiopharmaceutical](https://en.wikipedia.org/wiki/Radiopharmaceutical) used for thyroid imaging
- [Lithium Iron Phosphate](https://en.wikipedia.org/wiki/Lithium_iron_phosphate): Widely used cathode material within EV battery packs.

Smirk fixes this by fully tokenizing a SMILES string all the way down to it's consitent elements. Enabling a vocab only 167 tokens to represent all of [OpenSMILES](http://opensmiles.org/) with all the special tokens needed for language modeling baked in.

Check out the paper for all the details, but otherwise let's see it in action 😏

🐍 Installation is easy with pre-build binaries on [PyPI](https://pypi.org/project/smirk/) and [GitHub](https://github.com/BattModels/smirk/releases). Just run: `pip install smirk`
> Want to install from source? See [installing from source](./developer.md#installing-from-source)

In [None]:
!python -m pip install smirk transformers rdkit torch

## Getting Started
🤗 smirk is built using HuggingFace's [tokenizers](https://huggingface.co/docs/tokenizers) library for out-of-the box compatability with [transformers's PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main_classes/tokenizer). No need to learn another tokenizer, things just work out of the box.

In [None]:
from smirk import SmirkTokenizerFast

# Just import and tokenize!
smirk = SmirkTokenizerFast()
smirk("CC(=O)Nc1ccc(O)cc1")

In [None]:
# Batch Tokenization with Padding
batch = smirk([
    "C[C@@H]1CCCCCCCCCCCCC(=O)C1",
    "O=C(O)C[C@H](N)C(=O)N[C@H](C(=O)OC)Cc1ccccc1",
    "CN(C)S[N][Re@OH18]([C][O])([C][O])([C][O])([C][O])[C][O]"
], padding="longest")
batch

In [None]:
# Back to molecules!
smirk.batch_decode(batch["input_ids"], skip_special_tokens=True)

In [None]:
# By default, we don't add `[CLS]` and `[SEP]` tokens, but that's just a flag
smirk_bert = SmirkTokenizerFast(template="[CLS] $0 [SEP]")
" ".join(smirk_bert.tokenize("CNCCC(c1ccccc1)Oc2ccc(cc2)C(F)(F)F", add_special_tokens=True))

## What Makes Smirk Special?

In [None]:
from transformers import AutoTokenizer

tokenizers = {
    "smirk": smirk,
    "molformer": AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True),
    "GPT-4o": AutoTokenizer.from_pretrained("Xenova/gpt-4o"),
}

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage, rdMolDraw2D
from IPython.display import SVG

smi = [
    "Cl[Pt@SP1](Cl)([NH3])[NH3]", # Cisplatin 
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", # Caffine
    "NCCc1cc(O)c(O)cc1", # Dopamine
    "Cl[Pt@SP2](Cl)([NH3])[NH3]", # Transplatin
    "[NH4+].[NH4+].OP([O-])([O-])=O", # Diammonium phosphate
    "[O-][99Tc](=O)(=O)=O.[Na+]", # Sodium pertechnetate with radiotracer

]
#smi = [smi[0]]
def get_legend(smi:str, tokenizers:dict):
    entries = []
    for name, tok in tokenizers.items():
        entries.append(f"{name}: {' '.join(tok.tokenize(smi))}")
    return "\n".join(entries)


drawOptions = rdMolDraw2D.MolDrawOptions()
drawOptions.fixedScale = 1
drawOptions.centreMoleculesBeforeDrawing = True
drawOptions.minFontSize = 6
drawOptions.legendFontSize = 24
drawOptions.legendFraction = 0.3
MolsToGridImage(
    [Chem.MolFromSmiles(smi) for smi in smi],
    molsPerRow=2, subImgSize=(400,200),
    legends=[get_legend(smi, tokenizers) for smi in smi],
    drawOptions=drawOptions,
)

## Zero to Molecular Foundation Model with Smirk!

### HuggingFace Accelerate

In [None]:
!python -m pip install accelerate datasets

In [None]:
from accelerate import Accelerator
from transformers import Trainer, TrainingArguments, RobertaForMaskedLM, RobertaConfig, DataCollatorForLanguageModeling
from datasets import load_dataset

# MoleculeNet's QM9 dataset. Normally this would be a larger (and unlabeled)
# dataset. But for a demo, it's perfect
dataset = load_dataset("csv", 
    data_files=["https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv"],
)["train"].select_columns("smiles").train_test_split(test_size=0.2)

# Tokenizer the splits! For a larger dataset, this would be done on-the-fly
dataset = dataset.map(smirk, input_columns=["smiles"], desc="Tokenizing", num_proc=1)

# A very small model
config = RobertaConfig(
    vocab_size=len(smirk),
    hidden_size=256,
    intermediate_size=1024,
    num_hidden_layers=4,
    num_attention_heads=4,
)
model = RobertaForMaskedLM(config)

trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=smirk,
    data_collator=DataCollatorForLanguageModeling(smirk),
)

In [None]:
trainer.train()