<a href="https://colab.research.google.com/github/JohnMommers/SMILES_TO_SELFIES/blob/main/SMILES_TO_SELFIES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

See
https://github.com/aspuru-guzik-group/selfies

Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation
Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik
Machine Learning: Science and Technology 1, 045024 (2020), extensive blog post January 2021.
Major contributors since v1.0.0: Alston Lo and Seyone Chithrananda

A main objective is to use SELFIES as direct input into machine learning models,
in particular in generative models, for the generation of molecular graphs
which are syntactically and semantically valid.

In [2]:
pip install selfies

Collecting selfies
  Downloading https://files.pythonhosted.org/packages/cc/a5/4b190ee192394068827f06a9e391f3e169cf6265b6491ef8a654cb763dcb/selfies-1.0.3-py3-none-any.whl
Installing collected packages: selfies
Successfully installed selfies-1.0.3


In [3]:
import selfies as sf

In [15]:
smiles = 'NCCc1ccccc1C#CO'

In [16]:
encoded_smiles = sf.encoder(smiles)
encoded_smiles

'[N][C][C][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][#C][O]'

In [17]:
decoded_smiles = sf.decoder(encoded_smiles)
decoded_smiles

'NCCC1=CC=CC=C1C#CO'

In [18]:
symbols_smiles = list(sf.split_selfies(encoded_smiles))
symbols_smiles

['[N]',
 '[C]',
 '[C]',
 '[C]',
 '[=C]',
 '[C]',
 '[=C]',
 '[C]',
 '[=C]',
 '[Ring1]',
 '[Branch1_2]',
 '[C]',
 '[#C]',
 '[O]']

In [None]:
import selfies as sf

benzene = "c1ccccc1"

# SMILES --> SELFIES translation
encoded_selfies = sf.encoder(benzene)  # '[C][=C][C][=C][C][=C][Ring1][Branch1_2]'

# SELFIES --> SMILES translation
decoded_smiles = sf.decoder(encoded_selfies)  # 'C1=CC=CC=C1'

len_benzene = sf.len_selfies(encoded_selfies)  # 8

symbols_benzene = list(sf.split_selfies(encoded_selfies))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[Branch1_2]']

In [20]:
import selfies as sf

dataset = ['[C][O][C]', '[F][C][F]', '[O][=O]', '[C][C][O][C][C]']
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add('[nop]')  # '[nop]' is a special padding symbol
alphabet = list(sorted(alphabet))
print(alphabet)  # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

pad_to_len = max(sf.len_selfies(s) for s in dataset)  # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

# SELFIES to label encode
dimethyl_ether = dataset[1]  # '[C][O][C]'

# [1, 3, 1, 4, 4]
print(sf.selfies_to_encoding(dimethyl_ether,
                             vocab_stoi=symbol_to_idx,
                             pad_to_len=pad_to_len,
                             enc_type='label'))
                             
# [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
print(sf.selfies_to_encoding(dimethyl_ether,
                             vocab_stoi=symbol_to_idx,
                             pad_to_len=pad_to_len,
                             enc_type='one_hot'))

['[=O]', '[C]', '[F]', '[O]', '[nop]']
[2, 1, 2, 4, 4]
[[0, 0, 1, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
