# Generating Molecules

Here we focus on the final steps of our deep learning efforts, which is the generation of molecules with the built models. We will show the interface to use with different models and explore the generated chemistry via visualizations. In the `utils` package, we have defined the `smilesToGrid` function that will help us visualize the molecules as an interactive grid. Make sure you have the `mols2grid` package installed first: 

In [1]:
!pip install mols2grid



In [2]:
from utils import smilesToGrid

smilesToGrid(["CCO", "c1ccccc1N"])

  from .autonotebook import tqdm as notebook_tqdm


## Graph-Based Generators

### Pretrained Model

We will start with the pretrained model on ChEMBL 27 that we already have available in the tutorial data:

In [3]:
from drugex.training.models.transform import GraphModel
from drugex.data.corpus.vocabulary import VocGraph

vocabulary = VocGraph.fromFile('jupyter/models/pretrained/graph/chembl27/chembl27_graph_voc.txt')
pretrained = GraphModel(voc_trg=vocabulary, use_gpus=(6,7))
pretrained.loadStatesFromFile('jupyter/models/pretrained/graph/chembl27/chembl27_graph.pkg')

The model has a convenience method (`sampleFromSmiles`) that takes SMILES strings as input, fragments them, and generates the given number of molecules from the extracted fragments:

In [4]:
inputs = [
    "c1ccncc1CC2CC2", # pyridine ring and cyclopropane
]

smiles, frags = pretrained.sampleFromSmiles(inputs, min_samples=30)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 25.18it/s]
The data set file does not exist: /tmp/tmp4juey0zn. This data set is empty. You can add data by calling it.
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 23.95it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 15.69it/s]


Let's take a look at the fragments that were created:

In [5]:
set(frags)

{'C1CC1', 'C1CC1.c1ccncc1', 'c1ccncc1'}

Therefore, we have generated 30 random molecules that will contain either the pyridine ring, cyclopropane ring or both:

In [6]:
smilesToGrid(smiles)

And here are the incorporated fragments for each molecule above:

In [7]:
smilesToGrid(frags)

We can now also check how this model performs in comparison to the model we trained from scratch in [pretraining](pretraining.ipynb). We just need to load its states:

In [8]:
# vocabulary = VocGraph.fromFile('data/models/pretrained/chembl_pretrained_fit.log.vocab')
pretrained = GraphModel(voc_trg=vocabulary, use_gpus=(6,7))
pretrained.loadStatesFromFile('data/models/pretrained/chembl_pretrained.pkg')

In [9]:
inputs = [
    "c1ccncc1CC2CC2", # pyridine ring and cyclopropane
]

smiles, frags = pretrained.sampleFromSmiles(inputs, min_samples=30)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 21.03it/s]
The data set file does not exist: /tmp/tmpsxeybgna. This data set is empty. You can add data by calling it.
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 19.21it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 14.42it/s]


In [10]:
smilesToGrid(smiles)

In [11]:
smilesToGrid(frags)

Depending on how long you trained the model in [pretraining](pretraining.ipynb), you will see different degrees of crazyiness among the structures, but the model should generally produce reasonable results after 60 epochs if given at least 1 million molecules as examples (tested on a random sample of ChEMBL 30).

### Finetuned Model

Now let's take a look at the finetuned model. This is the one we trained in [this tutorial](./fintuning.ipynb). It is unlikely that we will see any difference without sufficient training, but we will include it for the sake of completeness:

In [12]:
from drugex.training.models.transform import GraphModel
from drugex.data.corpus.vocabulary import VocGraph

finetuned = GraphModel(
    voc_trg=VocGraph.fromFile('data/models/finetuned/graph/ligand_finetuned.vocab'),
    use_gpus=(6,7)
)
finetuned.loadStatesFromFile('data/models/finetuned/graph/chembl_ligand.pkg')

And we will give the model same input fragments as above:

In [13]:
inputs = [
    "c1ccncc1CC2CC2",
    "CC2CC2",
]

smiles, frags = finetuned.sampleFromSmiles(inputs, min_samples=30)
set(frags)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 17.91it/s]
The data set file does not exist: /tmp/tmpemj9c5kh. This data set is empty. You can add data by calling it.
Creating fragment-molecule pairs (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]Only one retrieved fragment for molecule: CC1CC1. Skipping...
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 14.98it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 12.57it/s]


{'C1CC1', 'C1CC1.c1ccncc1', 'c1ccncc1'}

In [14]:
smilesToGrid(smiles)

You can see that indeed our fragments again were incorporated into the structures and if the model was properly finetuned, you should also see some structural patterns from the finetuning data appearing.

You can also use the original validation set (or any other `GraphFragDataSet` you previously created) to sample with the fragments already encoded within:

In [15]:
from drugex.data.datasets import GraphFragDataSet

ds = GraphFragDataSet('data/sets/graph/ligand_test.tsv')

smiles, frags = finetuned.sample(ds.asDataLoader(128))
len(smiles)

292

In [16]:
smilesToGrid(smiles)

In [17]:
smilesToGrid(frags)

### Agent from Reinforcement Learning

The only type of model left to show here for the graph-based transformer is the optimized agent from reinforcement learning that was trained in the [previous tutorial](rl_optimization.ipynb). Since it is a standard `GraphModel` with adjusted weights, it can be loaded and used just like the models before:

In [18]:
from drugex.training.models.transform import GraphModel
from drugex.data.corpus.vocabulary import VocGraph

reinforced = GraphModel(voc_trg=VocGraph.fromFile('data/models/reinforced/graph/agent.vocab'), use_gpus=(6,7))
reinforced.loadStatesFromFile('data/models/reinforced/graph/agent.pkg')

First, we will try to sample using our own fragment definitions from above:

In [19]:
inputs = [
    "c1ccncc1CC2CC2",
    "CC2CC2",
]

smiles, frags = reinforced.sampleFromSmiles(inputs, min_samples=100)
set(frags)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:00<00:00, 19.85it/s]
The data set file does not exist: /tmp/tmpzofzyz3a. This data set is empty. You can add data by calling it.
Creating fragment-molecule pairs (batch processing):   0%|          | 0/1 [00:00<?, ?it/s]Only one retrieved fragment for molecule: CC1CC1. Skipping...
Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00, 17.04it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00, 14.25it/s]


{'C1CC1', 'C1CC1.c1ccncc1', 'c1ccncc1'}

In [20]:
smilesToGrid(smiles)

Next, we try with the test set as well:

In [21]:
from drugex.data.datasets import GraphFragDataSet

ds = GraphFragDataSet('data/sets/graph/ligand_test.tsv')

smiles, frags = reinforced.sample(ds.asDataLoader(128))
len(smiles)

292

In [22]:
smilesToGrid(smiles)

Remember, that our model was not fully trained so a lot of these structures will not look so interesting, but if we used more data and optimized the model properly, the ratio of more reasonable structures should reflect the desirability ratio reported during training (see [rl_optimization.ipynb](rl_optimization.ipynb)). In order to check desirability and other metrics of these molecules, we can actually reconstruct the environment from the previous notebook and score our generated molecules using the pickled scorers:

In [23]:
import pickle
from drugex.training.interfaces import Scorer
from drugex.training.scorers.modifiers import ClippedScore
from drugex.training.scorers.predictors import Predictor

class ModelScorer(Scorer):
    
    def __init__(self, model, prefix):
        super().__init__()
        self.model = model
        self.prefix = prefix
    
    def getScores(self, mols, frags=None):
        X = Predictor.calc_physchem(mols)
        return self.model.predict(X)
    
    def getKey(self):
        return f"{self.prefix}_{type(self.model)}"

scorer_a1 = pickle.load(open("data/models/reinforced/graph/scorer_a1.pkg", 'rb'))
scorer_a3 = pickle.load(open("data/models/reinforced/graph/scorer_a3.pkg", 'rb'))

from drugex.training.scorers.properties import Property

qed = Property(
    "QED",
    modifier=ClippedScore(lower_x=0, upper_x=1.0)
)

sascore = Property(
    "SA",
    modifier=ClippedScore(lower_x=4.5, upper_x=0)
)

In [24]:
from drugex.training.environment import DrugExEnvironment
from drugex.training.rewards import ParetoCrowdingDistance

scorers = [
    scorer_a1,
    scorer_a3,
    qed,
    sascore
]
thresholds = [
    0.99,
    0.99,
    0.0,
    0.0
]

environment = DrugExEnvironment(scorers, thresholds, reward_scheme=ParetoCrowdingDistance())

In [25]:
df_scores = environment.getScores(smiles)
df_scores.head()

Unnamed: 0,A1_<class 'sklearn.ensemble._forest.RandomForestRegressor'>,A3_<class 'sklearn.ensemble._forest.RandomForestRegressor'>,QED,SA,DESIRE,VALID
0,1.0,0.736371,0.410709,0.671622,0,1
1,1.0,1.0,0.746396,0.222007,1,1
2,0.833679,1.0,0.347574,0.0,0,1
3,1.0,0.939432,0.798488,0.041307,0,1
4,1.0,0.984794,0.241281,0.215869,0,1


We should get the same desirability ratio as we obtained at the end of reinforcement learning:

In [26]:
df_scores.DESIRE.sum() / len(df_scores.DESIRE)

0.2328767123287671

Therefore, this is the final number of molecules that we deem desirable from the model:

In [27]:
print(df_scores.DESIRE.sum())

68


Now, we can take a look at the desired molecules themselves and actually also remove those that have insufficient SAScore:

In [28]:
df_scores['SMILES'] = smiles
df_filtered = df_scores[(df_scores.DESIRE == 1) & (df_scores.SA > 0)]
len(df_filtered)

66

In [29]:
smilesToGrid(df_filtered.SMILES)

We can clearly see these molecules look more viable with fewer complex and unstable patterns as well.