# Contents

In this notebook, we will learn how to load the PKIS2 dataset into memory using a `DatasetProvider` object. Then we will apply some standard featurization to obtain ML-compatible representations, and use a simple model to compute some activity predictions.

1. [X] Loading the data
2. [X] Featurizing the data
3. [X] Exporting the featurized data to PyTorch
4. [X] Build and train the model
5. [ ] Analyze results nicely

Disable stereochemistry _warnings_ generated by `openforcefield`.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import warnings
warnings.simplefilter("ignore") 
import logging
logging.basicConfig(level=logging.ERROR)
import numpy as np

# 1. Loading the data as a DatasetProvider

In [4]:
from kinoml.datasets.kinomescan.pkis2 import PKIS2DatasetProvider



Let's initialize the PKIS2 dataset provider. Instead of a regular `ClassName()` instantiation, we need to use `.from_source()` (with convenient default arguments).

__Why?__

This due to the design of `BaseDatasetProvider.__init__`, which expects a list of `BaseMeasurement` objects (or subclasses of). 

* A `BaseMeasurement` class is normally subclassed by more specific measurements (like `BaseMeasurement`), but the design is the same. It takes:
  * `values`: an array of numeric values (single measurements are _arrayfied_ into single-element arrays). This can be replicates of the value for statistical purposes, or under different concentrations of a reactant?
  * `conditions`: instance of `AssayConditions`. This class provides all the properties required to reproduce the experiment (say `pH`, `temperature`, `concentration`, etc). This should be paired (somehow) to the dimensionality of `values`, but I haven't though much of that yet.
  * `system`: instance of a `System` class or subclass. The subclasses can restrict which type of `MolecularComponent` objects are allowed (e.g. `ProteinLigandComplex` only takes a `Protein` and a `Ligand`).
* A `System` is abstract enough to not impose any restrictions on the composition, but its subclasses can be. 
  * This is the case of the `Complex` object, which requires at least one `BaseProtein` and one `BaseLigand` objects.
* The `MolecularComponent` class is the base object all proteins and ligands, regardless their representation (e.g. sequence vs 3D structure, smiles vs molecular graph). `MolecularComponent` is immediately subclassed by:
  * `BaseProtein`, the abstract model which is subclassed by more concrete classes, like `AminoAcidSequence` and `ProteinStructure`.
  * `BaseLigand`, the abstract model which is subclassed by more concrete classes, like `Ligand` (based on `openforcefield.topology.Molecule`).


Anyway, all those details are not needed to start using the provider. Right now there's no lazy behavior, so it will take _a bit_ to build all sequences and ligands. In my machine, it's about 12 seconds for all 260K datapoints.

In [5]:
%%time
pkis2 = PKIS2DatasetProvider.from_source()

CPU times: user 19.4 s, sys: 828 ms, total: 20.2 s
Wall time: 20.2 s


You can export a convenient dataframe with this method. Take into account this is just using the default implementation in the base class, which relies on the different `__repr__` methods and `.name` attributes of the objects involved. For prettier dataframes, one can always subclass `to_dataframe` to provide a better presentation.

In [6]:
df = pkis2.to_dataframe()
df

Unnamed: 0,Systems,n_components,PercentageDisplacementMeasurement
0,AAK1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,14.0
1,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,28.0
2,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,20.0
3,ABL2 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,5.0
4,ACVR1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C...,2,0.0
...,...,...,...
261865,ZAP70 & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc12)...,2,0.0
261866,p38-alpha & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0
261867,p38-beta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc...,2,0.0
261868,p38-delta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0


Although we have 260K measurements, there are roughly 250K _systems_, comprised of a reduced number of ligands and proteins

In [7]:
print("Measurements:", len(pkis2.measurements))
print("Systems:", len(pkis2.systems))
print("Ligands:",len(set([s.ligand for s in pkis2.systems])))
print("Proteins:", len(set([s.protein for s in pkis2.systems])))

Measurements: 261870
Systems: 257920
Ligands: 640
Proteins: 403


Notice how the string representations try to be a bit informative.

In [8]:
pkis2

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems>

In [9]:
one_random_system = next(iter(pkis2.systems))
one_random_system

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=DYRK1A>, <Ligand name=Nc1n[nH]c2cc(ccc12)-c1cccc(N)c1>)>

Some areas do need improvement... this is how you get all the entries that have wild-type kinases (all of them in PKIS2). Maybe some Django-style queries?

In [10]:
wt = [ms for ms in pkis2.measurements if ms.system.protein.metadata["mutations"] is None]
wt_provider = PKIS2DatasetProvider(measurements=wt)
wt_provider

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems>

#  2. Featurizing the data

We will be using:

- MorganFingerprint n=2048 bits, r=2
- OneHotEncoding of protein sequence
- ... or composition of binding site

An isolated featurizer takes one system and returns the raw data:

In [11]:
from kinoml.features.ligand import MorganFingerprintFeaturizer
morgan_featurizer = MorganFingerprintFeaturizer(nbits=1024, radius=2)
syst = morgan_featurizer.featurize(next(iter(pkis2.systems)))
print(syst.featurizations[morgan_featurizer.name].shape, *syst.featurizations[morgan_featurizer.name][:75], "...")

(1024,) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ...


In the context of a dataset provider, each system will store that raw data in an internal dictionary (`.featurizations`) for _each_ system. Without caching, this would take ~30 minutes, given the huge amount of duplication in the dataset. Thanks to LRU caching at the `featurizer` level, each `Ligand` is only featurized once!

In [12]:
from kinoml.features.protein import OneHotEncodedSequenceFeaturizer
sequence_featurizer = OneHotEncodedSequenceFeaturizer()
syst = sequence_featurizer.featurize(next(iter(pkis2.systems)))
print(syst.featurizations[sequence_featurizer.name].shape, *syst.featurizations[sequence_featurizer.name][0], "...")

(20, 381) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.

Since sequence length can differ along the kinase set, we need to pad to the maximum length.

In [13]:
max_length = len(pkis2.systems[0].protein.sequence)
best = pkis2.systems[0]
for system in pkis2.systems[1:]:
    this_length = len(system.protein.sequence)
    if this_length > max_length:
        max_length = this_length
        best = system
print(best, "has", max_length, "aas")

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=PIK3C2B>, <Ligand name=c1cc2c(ccnc2[nH]1)-c1ccccc1>)> has 1634 aas


We could use annotations at UniProt to clip the useful domains. It might be too aggressive, but it's a good start.

We can also try to align to Dunbrack's MSA: https://www.nature.com/articles/s41598-019-56499-4

In [14]:
from kinoml.features.core import PadFeaturizer, Pipeline
padded_featurizer = PadFeaturizer((max_length,), key=sequence_featurizer.__class__.__name__)
padded_sequence_featurizer = Pipeline([sequence_featurizer, padded_featurizer])
syst = padded_sequence_featurizer.featurize(next(iter(pkis2.systems)))
print(syst.featurizations[padded_sequence_featurizer.name].shape, *syst.featurizations[padded_sequence_featurizer.name][0][:100], "...")

(1634, 1995) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...


In [15]:
syst.featurizations

{'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 'OneHotEncodedSequenceFeaturizer': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 'Pipeline([OneHotEncodedSequenceFeaturizer, PadFeaturizer])': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])}

For this exercise, we will use a simpler protein featurization, like hashed name, for easy concatenation.

In [16]:
from kinoml.features.core import HashFeaturizer

hashed_sequence_featurizer = HashFeaturizer(("protein", "sequence"))
syst = hashed_sequence_featurizer.featurize(next(iter(pkis2.systems)))
print(syst.featurizations[hashed_sequence_featurizer.name])

[0.26205074]


In [17]:
from kinoml.features.core import Concatenated
concat_featurizers = Concatenated([morgan_featurizer, hashed_sequence_featurizer], axis=0)

In [18]:
pkis2.featurize(concat_featurizers)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=257920.0), HTML(value='')))

Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/jaime/.conda/envs/kinoml-ci/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/jaime/.conda/envs/kinoml-ci/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jaime/.conda/envs/kinoml-ci/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/jaime/devel/py/openkinome/kinoml/kinoml/datasets/core.py", line 150, in _featurize_one
    featurizer.featurize(system, inplace=True)
  File "/home/jaime/devel/py/openkinome/kinoml/kinoml/features/core.py", line 37, in featurize
    features = self._featurize(system)
  File "/home/jaime/devel/py/openkinome/kinoml/kinoml/features/core.py", line 144, in _featurize
    features = [f._featurize(system_or_array) for f in self.featurizers]
  File "/home/jaime/devel/py/openkinome/kinoml/kinoml/features/core.py", 




KeyboardInterrupt: 

In [None]:
print(pkis2.systems[0])
print(pkis2.systems[0].featurizations)

# 3. Export to PyTorch

In [None]:
dataset = pkis2.to_pytorch()
dataset

This pytorch dataset implements the `Dataset` protocol and provides two attributes: `measurements` and (featurized) `systems`:

In [None]:
dataset.measurements

In [None]:
print(dataset.systems[0].shape)

**TODO**: Look into specifying datatypes per featurizer to use memory more efficiently.

## 3.1 Ensuring we have an adequate observation model
The underlying measurement type common to _all_ measurements contains an `observation_model` method that returns a dispatched callable, configurable per backend (default=pytorch).

In [None]:
pct_displacement_model = pkis2.measurement_type.observation_model(backend="pytorch")
pct_displacement_model??

The `observation_model` function expects native objects to their dataset (or numpy arrays):

# 4. Building and training the model

- `DNNModel`

## 4.1 Optimization loop

In [None]:
import torch
from kinoml.ml.torch_models import NeuralNetworkRegression

# Use DataLoader for minibatches
loader = dataset.as_dataloader(batch_size=512)

model = NeuralNetworkRegression(input_size=dataset.systems[0].shape[0])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_function = torch.nn.MSELoss() # Mean squared error

print(f"{model} has {sum(param.numel() for param in model.parameters())} parameters")

nb_epoch = 100
loss_timeseries = []
for epoch in range(nb_epoch):
    
    cumulative_loss = 0
    for i, (x, y) in enumerate(loader, 1):
        # Clear gradients
        
        optimizer.zero_grad()
        
        # Obtain model prediction given model input
        # x.requires_grad_()
        delta_g = model(x)

        # with observation model
        # with torch.no_grad():
        prediction = pct_displacement_model(delta_g)
        loss = loss_function(prediction, y)

        # Obtain loss for the predicted output
        cumulative_loss += loss.item()

        # Gradients w.r.t to parameters
        loss.backward()

        # Optimizer
        optimizer.step()
    loss_timeseries.append(cumulative_loss)
    if epoch % 10 == 0:
        print(f"epoch {epoch} : loss {loss_timeseries[-1]}")
print("Done!")

# 5. Analyze results

In [None]:
from matplotlib import pyplot as plt
plt.plot(loss_timeseries)