<a href="https://colab.research.google.com/github/KacperKubara/ml-cookbook/blob/master/solubility.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Please click below to open this notebook with colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1r3QAoLsI-k6se1EubeepUs8p0Bqvapb_?usp=sharing)

The Deepchem and dataset setup below was taken from the official tutorial: [link ](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/03_Modeling_Solubility.ipynb)

In [None]:
# Installing conda
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3490  100  3490    0     0  20773      0 --:--:-- --:--:-- --:--:-- 20898


add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added conda-forge to channels
added omnia to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [None]:
# Installing Deepchem
!pip install --pre deepchem
import deepchem
deepchem.__version__

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/5d/15/bc4f959ad4900875c28427b44196d645df06bdfebe669a6a4452d184b6f3/deepchem-2.4.0rc1.dev20200816062845.tar.gz (366kB)
[K     |█                               | 10kB 17.0MB/s eta 0:00:01[K     |█▉                              | 20kB 2.1MB/s eta 0:00:01[K     |██▊                             | 30kB 2.7MB/s eta 0:00:01[K     |███▋                            | 40kB 3.0MB/s eta 0:00:01[K     |████▌                           | 51kB 2.4MB/s eta 0:00:01[K     |█████▍                          | 61kB 2.7MB/s eta 0:00:01[K     |██████▎                         | 71kB 3.0MB/s eta 0:00:01[K     |███████▏                        | 81kB 3.2MB/s eta 0:00:01[K     |████████                        | 92kB 3.4MB/s eta 0:00:01[K     |█████████                       | 102kB 3.3MB/s eta 0:00:01[K     |█████████▉                      | 112kB 3.3MB/s eta 0:00:01[K     |██████████▊                     | 122kB 3

'2.4.0-rc1.dev'

In [None]:
# Getting the delaney dataset
!wget https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/delaney-processed.csv
from deepchem.utils.save import load_from_disk
dataset_file= "delaney-processed.csv"

# Loading the data from the CSV file
loader = deepchem.data.CSVLoader(tasks=["ESOL predicted log solubility in mols per litre"], 
                                 smiles_field="smiles", 
                                 featurizer=deepchem.feat.ConvMolFeaturizer())
# Featurizing the dataset with ConvMolFeaturizer
dataset = loader.featurize(dataset_file)

--2020-08-17 16:23:03--  https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/delaney-processed.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96699 (94K) [text/plain]
Saving to: ‘delaney-processed.csv.6’


2020-08-17 16:23:04 (3.67 MB/s) - ‘delaney-processed.csv.6’ saved [96699/96699]



smiles_field is deprecated and will be removed in a future version of DeepChem. Use feature_field instead.


In [None]:
    # Splitter splits the dataset 
    # In this case it's is an equivalent of train_test_split from sklearn
    splitter = deepchem.splits.RandomSplitter()
    # frac_test is 0.01 because we only use a train and valid as an example
    train, valid, _ = splitter.train_valid_test_split(dataset,
                                                      frac_train=0.7,
                                                      frac_valid=0.29,
                                                      frac_test=0.01)
    # Normalizer will normalize y values in the dataset
    normalizer = deepchem.trans.NormalizationTransformer(transform_y=True, 
                                                         dataset=train, 
                                                         move_mean=True)
    train = normalizer.transform(train)
    test = normalizer.transform(valid)

In [None]:
print(f"Size of the training data: {len(train.ids)}")
print(f"Size of the validation data: {len(valid.ids)}")
print(test)

Size of the training data: 789
Size of the validation data: 327
<DiskDataset X.shape: (327,), y.shape: (327, 1), w.shape: (327, 1), ids: ['CC(C)=CCCC(C)=CC(=O)' 'Clc1cc(Cl)c(c(Cl)c1)c2c(Cl)cccc2Cl'
 'ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl' ... 'CC(C)C(C)(C)C'
 'CCC(C)C' 'COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl'], task_names: ['ESOL predicted log solubility in mols per litre']>


In [None]:
# GraphConvModel is a GNN model based on 
# Duvenaud, David K., et al. "Convolutional networks on graphs for
# learning molecular fingerprints."
from deepchem.models import GraphConvModel
graph_conv = GraphConvModel(1,
                            batch_size=50,
                            mode="regression")
# Defining metric. Closer to 1 is better
metric = deepchem.metrics.Metric(deepchem.metrics.pearson_r2_score)


In [None]:
# Fitting the model
graph_conv.fit(train, nb_epoch=10)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


0.16994555791219076

In [None]:
# Reversing the transformation and getting the metric scores on 2 datasets
train_scores = graph_conv.evaluate(train, [metric], [normalizer])
valid_scores = graph_conv.evaluate(valid, [metric], [normalizer])
print(f"Train Scores: {train_scores}")
print(f"Validation Scores: {valid_scores}")

n_samples is a deprecated argument which is ignored.
n_samples is a deprecated argument which is ignored.


Train Scores: {'pearson_r2_score': 0.6639371596233589}
Validation Scores: {'pearson_r2_score': 0.4961263318150533}
