# Embeddings for Molecules

- use a `SeqToSeq` model to generate fingerprints for classifying molecules.  

- [Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery](https://doi.org/10.1145/3107411.3107424).


# import

In [1]:
!pip install --pre deepchem
import deepchem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchem
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[K     |████████████████████████████████| 608 kB 5.1 MB/s 
[?25hCollecting rdkit-pypi
  Downloading rdkit_pypi-2022.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[K     |████████████████████████████████| 36.8 MB 27 kB/s 
Installing collected packages: rdkit-pypi, deepchem
Successfully installed deepchem-2.6.1 rdkit-pypi-2022.3.5


In [4]:
import deepchem as dc
from deepchem.models.optimizers import Adam, ExponentialDecay

# Learning Embeddings with SeqToSeq

- Many types of models require their inputs to have a fixed shape.  But molecules can vary widely in the numbers of atoms and bonds 
- We need a way of generating a fixed length "fingerprint" for each molecule.  
 - Extended-Connectivity Fingerprints (ECFPs) we used in earlier tutorials.  
 - instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

- A `SeqToSeq` model 
 - often used to translate text from one language to another.  
 - The encoder is a stack of recurrent layers  and generates a fixed length vector called the "embedding vector".  
 - The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence.  
 - By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

- We will use SMILES strings describing molecules as the input sequences.  We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences.  
 - The encoder must create embedding vectors that contain all information from the original sequence.  
 - That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

- Use the MUV dataset.  It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

In [2]:
tasks, datasets, transformers = dc.molnet.load_muv(splitter='stratified')
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids

- We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences. 
- Make a list of every character that appears in any training sequence.

In [5]:
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))

In [7]:
tokens[:10]

['#', '(', ')', '+', '-', '/', '1', '2', '3', '4']

In [8]:
len(tasks)

17

- Use `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

In [9]:
max_length = max(len(s) for s in train_smiles)
batch_size = 100
batches_per_epoch = len(train_smiles)/batch_size
model = dc.models.SeqToSeq(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           model_dir='fingerprint',
                           batch_size=batch_size,
                           learning_rate=ExponentialDecay(0.001, 0.9, batches_per_epoch))

- The input to `fit_sequences()` is a generator that produces input/output pairs.  

In [12]:
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(40)) # epoch: 40

- We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

In [13]:
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')

reproduced 165 of 500 validation SMILES strings


- Now we'll trying using the encoder as a way to generate molecular fingerprints.  
- We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.  
- The amount of data is small enough that we can just store everything in memory.

In [14]:
import numpy as np
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w.astype(np.float32),
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w.astype(np.float32),
                                                valid_dataset.ids)

For classification, we'll use a simple fully connected network with one hidden layer.

In [15]:
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)

0.10887121200561524

Find out how well it worked.  Compute the ROC AUC for the training and validation datasets.

In [16]:
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)

Training set ROC AUC: {'mean-roc_auc_score': 0.9815878005141317}
Validation set ROC AUC: {'mean-roc_auc_score': 0.7945280315147936}
