# Generating Molecules

Here we focus on the final steps of our deep learning efforts, which is the generation of molecules with the built models. We will show the interface to use with different models and explore the generated chemistry via visualizations. In the `utils` package, we have defined the `smilesToGrid` function that will help us visualize the molecules as an interactive grid. Make sure you have the `mols2grid` package installed first: 

In [8]:
!pip install mols2grid

Collecting mols2grid
  Downloading mols2grid-1.0.0-py2.py3-none-any.whl (100 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.0/101.0 kB[0m [31m505.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting jinja2>=2.11.0
  Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting ipywidgets<8,>=7
  Downloading ipywidgets-7.7.2-py2.py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.4/123.4 kB[0m [31m661.7 kB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting jupyterlab-widgets<3,>=1.0.0
  Downloading jupyterlab_widgets-1.1.1-py3-none-any.whl (245 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting ipython-genutils~=0.2.0
  Using cached ipython_genutils-0.2.0-py2.py3-none-any.whl (26 kB)
Collecting widgetsnbextension~=3.6.0
  Downloading widgetsnbextension-3.6.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━

In [9]:
from utils import smilesToGrid

smilesToGrid(["CCO", "c1ccccc1N"])

MolGridWidget()

## Graph-Based Generators

### Pretrained Model

We will start with the pretrained model on ChEMBL 27 that we already have available in the tutorial data:

In [10]:
from drugex.training.models.transform import GraphModel
from drugex.data.corpus.vocabulary import VocGraph

GPUS = [0]

vocabulary = VocGraph.fromFile('jupyter/models/pretrained/graph/chembl27/chembl27_graph_voc.txt')
pretrained = GraphModel(voc_trg=vocabulary, use_gpus=GPUS)
pretrained.loadStatesFromFile('jupyter/models/pretrained/graph/chembl27/chembl27_graph.pkg')

The model has a convenience method (`sample_smiles`) that takes SMILES strings as input, fragments them, and generates the given number of molecules from the extracted fragments:

In [11]:
inputs = [
    "c1ccncc1CC2CC2", # pyridine ring and cyclopropane
]

smiles, frags = pretrained.sample_smiles(inputs, num_samples=30)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 13.89it/s]
Initialized empty dataset. The data set file does not exist (yet): /tmp/tmpeo_e9swv. You can add data by calling this instance with the appropriate parameters.
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 19.52it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 14.90it/s]
Generating molecules: 100%|██████████| 30/30 [00:01<00:00, 18.29it/s]


Let's take a look at the fragments that were created:

In [12]:
set(frags)

{'C1CC1', 'C1CC1.c1ccncc1', 'c1ccncc1'}

Therefore, we have generated 30 random molecules that will contain either the pyridine ring, cyclopropane ring or both:

In [13]:
smilesToGrid(smiles)

MolGridWidget()

And here are the incorporated fragments for each molecule above:

In [14]:
smilesToGrid(frags)

MolGridWidget()

We can now also check how this model performs in comparison to the model we trained from scratch in [pretraining](pretraining.ipynb). We just need to load its states:

In [9]:
# vocabulary = VocGraph.fromFile('data/models/pretrained/chembl_pretrained_fit.log.vocab')
pretrained = GraphModel(voc_trg=vocabulary, use_gpus=GPUS)
pretrained.loadStatesFromFile('data/models/pretrained/graph/chembl_pretrained.pkg')

In [10]:
inputs = [
    "c1ccncc1CC2CC2", # pyridine ring and cyclopropane
]

smiles, frags = pretrained.sample_smiles(inputs, num_samples=30)

Standardizing molecules (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]

Initialized empty dataset. The data set file does not exist (yet): /tmp/tmpsvhg4vag. You can add data by calling this instance with the appropriate parameters.


Creating fragment-molecule pairs (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]

Encoding fragment-molecule pairs. (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]

Generating molecules:   0%|          | 0/30 [00:00<?, ?it/s]

In [11]:
smilesToGrid(smiles)

MolGridWidget()

In [12]:
smilesToGrid(frags)

MolGridWidget()

Depending on how long you trained the model in [pretraining](pretraining.ipynb), you will see different degrees of crazyiness among the structures, but the model should generally produce reasonable results after 60 epochs if given at least 1 million molecules as examples (tested on a random sample of ChEMBL 30).

### Finetuned Model

Now let's take a look at the finetuned model. This is the one we trained in [this tutorial](./fintuning.ipynb). It is unlikely that we will see any difference without sufficient training, but we will include it for the sake of completeness:

In [15]:
from drugex.training.models.transform import GraphModel
from drugex.data.corpus.vocabulary import VocGraph

finetuned = GraphModel(
    voc_trg=VocGraph.fromFile('jupyter/models/finetuned/graph/ligand_finetuned.vocab'),
    use_gpus=GPUS
)
finetuned.loadStatesFromFile('jupyter/models/finetuned/graph/chembl_ligand.pkg')

And we will give the model same input fragments as above:

In [16]:
inputs = [
    "c1ccncc1CC2CC2",
    "CC2CC2",
]

smiles, frags = finetuned.sample_smiles(inputs, num_samples=30)
set(frags)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 20.38it/s]
Initialized empty dataset. The data set file does not exist (yet): /tmp/tmp8ob21zdy. You can add data by calling this instance with the appropriate parameters.
Creating fragment-molecule pairs (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]Only one retrieved fragment for molecule: CC1CC1. Skipping...
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 17.89it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 14.70it/s]
Generating molecules: 100%|██████████| 30/30 [00:01<00:00, 16.18it/s]


{'C1CC1', 'C1CC1.c1ccncc1', 'c1ccncc1'}

In [17]:
smilesToGrid(smiles)

MolGridWidget()

You can see that indeed our fragments again were incorporated into the structures and if the model was properly finetuned, you should also see some structural patterns from the finetuning data appearing.

You can also use the original validation set (or any other `GraphFragDataSet` you previously created) to sample with the fragments already encoded within:

In [18]:
from drugex.data.datasets import GraphFragDataSet

ds = GraphFragDataSet('jupyter/data/sets/graph/ligand_test.tsv')

smiles, frags = finetuned.sample(ds.asDataLoader(128))
len(smiles)

209

In [19]:
smilesToGrid(smiles)

MolGridWidget()

In [20]:
smilesToGrid(frags)

MolGridWidget()

### Agent from Reinforcement Learning

The only type of model left to show here for the graph-based transformer is the optimized agent from reinforcement learning that was trained in the [previous tutorial](rl_optimization.ipynb). Since it is a standard `GraphModel` with adjusted weights, it can be loaded and used just like the models before:

In [21]:
from drugex.training.models.transform import GraphModel
from drugex.data.corpus.vocabulary import VocGraph

reinforced = GraphModel(voc_trg=VocGraph.fromFile('jupyter/models/reinforced/graph/agent.vocab'), use_gpus=GPUS)
reinforced.loadStatesFromFile('jupyter/models/reinforced/graph/agent.pkg')

First, we will try to sample using our own fragment definitions from above:

In [22]:
inputs = [
    "c1ccncc1CC2CC2",
    "CC2CC2",
]

smiles, frags = reinforced.sample_smiles(inputs, num_samples=100)
set(frags)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 20.48it/s]
Initialized empty dataset. The data set file does not exist (yet): /tmp/tmpll1qzr9o. You can add data by calling this instance with the appropriate parameters.
Creating fragment-molecule pairs (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]Only one retrieved fragment for molecule: CC1CC1. Skipping...
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 17.40it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 14.57it/s]
Generating molecules: 100%|██████████| 100/100 [00:06<00:00, 15.92it/s]


{'C1CC1', 'C1CC1.c1ccncc1', 'c1ccncc1'}

In [23]:
smilesToGrid(smiles)

MolGridWidget()

Next, we try with the test set as well:

In [24]:
from drugex.data.datasets import GraphFragDataSet

ds = GraphFragDataSet('jupyter/data/sets/graph/ligand_test.tsv')

smiles, frags = reinforced.sample(ds.asDataLoader(128))
len(smiles)

209

In [25]:
smilesToGrid(smiles)

MolGridWidget()

Remember, that our model was not fully trained so a lot of these structures will not look so interesting, but if we used more data and optimized the model properly, the ratio of more reasonable structures should reflect the desirability ratio reported during training (see [rl_optimization.ipynb](rl_optimization.ipynb)).