#  The Basic Tools of the Deep Life Sciences

## predicting the solubility of small molecules 
- given their chemical formulas.  This is a very important property in drug development
- first thing we need is a data set of measured solubilities for real molecules
- MoleculeNet, a diverse collection of chemical and molecular data sets
 - use the Delaney solubility data set
 - log(solubility) where solubility is measured in moles/liter.

In [1]:
!pip install --pre deepchem[tensorflow]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchem[tensorflow]
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[K     |████████████████████████████████| 608 kB 7.4 MB/s 
[?25hCollecting rdkit-pypi
  Downloading rdkit_pypi-2022.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[K     |████████████████████████████████| 36.8 MB 33 kB/s 
Collecting tensorflow-addons
  Downloading tensorflow_addons-0.17.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 44.7 MB/s 
Installing collected packages: rdkit-pypi, tensorflow-addons, deepchem
Successfully installed deepchem-2.6.1 rdkit-pypi-2022.3.5 tensorflow-addons-0.17.1


In [2]:
import deepchem as dc
dc.__version__

'2.6.1'

# Training a Model with DeepChem

1. Select the data set you will train your model on (or create a new data set if there isn't an existing suitable one).
2. Create the model.
3. Train the model on the data.
4. Evaluate the model on an independent test set to see how well it works.
5. Use the model to make predictions about new data.



In [3]:
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

- featurizer
 - tell how to "featurize" the data
 - use graph convolutional network

In [4]:
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

In [5]:
model.fit(train_dataset, nb_epoch=20). # use 100

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." %

0.3162474822998047

- select an evaluation metric and calling `evaluate()` on the model.  
- let's use the Pearson correlation, also known as r<sup>2</sup>, as our metric

In [6]:
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

Training set score: {'pearson_r2_score': 0.6525287725587504}
Test set score: {'pearson_r2_score': 0.4223258310293701}


- a model that produced totally random outputs would have a correlation of 0, while one that made perfect predictions would have a correlation of 1.  

- let's just use the first ten molecules from the test set.  
 - For each one we print out the chemical structure (represented as a SMILES string) 
  - and the predicted log(solubility)

In [7]:
solubilities = model.predict_on_batch(test_dataset.X[:10])
for molecule, solubility, test_solubility in zip(test_dataset.ids, solubilities, test_dataset.y):
    print(solubility, test_solubility, molecule)

[-1.4408386] [-1.60114461] c1cc2ccc3cccc4ccc(c1)c2c34
[0.7145318] [0.20848251] Cc1cc(=O)[nH]c(=S)[nH]1
[-0.7271201] [-0.01602738] Oc1ccc(cc1)C2(OC(=O)c3ccccc23)c4ccc(O)cc4 
[-1.4408864] [-2.82191713] c1ccc2c(c1)cc3ccc4cccc5ccc2c3c45
[-1.1227936] [-0.52891635] C1=Cc2cccc3cccc1c23
[1.2689223] [1.10168349] CC1CO1
[-0.12336537] [-0.88987406] CCN2c1ccccc1N(C)C(=S)c3cccnc23 
[-1.557469] [-0.52649706] CC12CCC3C(CCc4cc(O)ccc34)C2CCC1=O
[-1.1528693] [-0.76358725] Cn2cc(c1ccccc1)c(=O)c(c2)c3cccc(c3)C(F)(F)F
[0.25987867] [-0.64020358] ClC(Cl)(Cl)C(NC=O)N1C=CN(C=C1)C(NC=O)C(Cl)(Cl)Cl 
