# Featurize a dataset

Any machine learning model will expect tensorial representations of the chemical data. This notebooks provides a workflow to achieve such goal.

`kinoml.dataset.DatasetProvider` objects need to be available to deal with your collection of raw measurements for protein:ligand systems. These objects are, roughly, a list of `kinoml.core.BaseMeasurement`, each containing a set of `.values` and a some extra metadata, like the `system` objects to be featurized here.

In ligand-based models, protein information is only considered marginally, and most of the action happens at the ligand level. Usually starting with a string representation such as SMILES, or a database identifier such as a PubChem ID, these are promoted to (usually) RDKit objects and then transformed into a tensor of some form (e.g. fingerprints, molecular graph as an adjacency matrix, etc).

Available featurizers can be found under `kinoml.features`.

## How to use

Run `python run_notebook.py --help` for more information.

In [1]:
# If this is the template file (and not a copy) and you are introducing changes,
# update VERSION with the current date (YYYY.MM.DD)
VERSION = "2020.12.07" 

## ✏ Define hyper parameters

In [2]:
# TEMPLATE VALUES -- these are overriden (see below if executed) by papermill using a YAML or Python file as input
DATASET_CLS = "import.path.to.DatasetProvider"
DATASET_KWARGS = {"option": "value", "option2": "value2"}

PIPELINE = {
    "someuniquekey": [
        ("import.path.to.SomeFeaturizer", {"option": "value", "option2": "value2"}),
        ("import.path.to.SomeOtherFeaturizer", {"option": "value", "option2": "value2"}),
    ]
}
PIPELINE_AGG = "kinoml.features.core.Concatenated"
PIPELINE_AGG_KWARGS = {}

FEATURIZE_KWARGS = {"processes": 1}

GROUPS = [
    ("kinoml.datasets.groups.CallableGrouper", {"function": "lambda something: something.attribute"}),
    ("kinoml.datasets.groups.CallableGrouper", {"function": "lambda otherthing: otherthing.attribute2"})
]

TRAIN_TEST_VAL_KWARGS = {"idx_train": 0.8, "idx_test": 0.1, "idx_val": 0.1}

## IGNORE THIS ONE
HERE = _dh[-1]

In [3]:
# Parameters
DATASET_CLS = "kinoml.datasets.kinomescan.pkis2.PKIS2DatasetProvider"
DATASET_KWARGS = {}
PIPELINE = {
    "ligand": [
        ["kinoml.features.ligand.SmilesToLigandFeaturizer", {"style": "rdkit"}],
        [
            "kinoml.features.ligand.MorganFingerprintFeaturizer",
            {"nbits": 512, "radius": 2},
        ],
    ]
}
PIPELINE_AGG = "kinoml.features.core.Concatenated"
PIPELINE_AGG_KWARGS = {}
FEATURIZE_KWARGS = {"processes": 1}
GROUPS = [
    [
        "kinoml.datasets.groups.CallableGrouper",
        {"function": "lambda measurement: measurement.system.protein.name"},
    ],
    [
        "kinoml.datasets.groups.CallableGrouper",
        {"function": "lambda measurement: type(measurement).__name__"},
    ],
]
TRAIN_TEST_VAL_KWARGS = {"idx_train": 0.8, "idx_test": 0.1, "idx_val": 0.1}
HERE = "/home/jaime/devel/py/openkinome/experiments-binding-affinity/features/ligand-only-morgan512"


⚠ From here on, you should _not_ need to modify anything else 🤞

---

Define key paths for data and outputs:

In [4]:
from pathlib import Path

HERE = Path(HERE)
for parent in HERE.parents:
    if next(parent.glob(".github/"), None):
        REPO = parent
        break

# Generate paths for this pipeline
featurizer_path = []
for name, branch in PIPELINE.items():
    featurizer_path.append(name)
    for clsname, kwargs in branch:
        clsname = clsname.rsplit(".", 1)[1]
        kwargs = [f"{k}={''.join(c for c in str(v) if c.isalnum())}" for k,v in kwargs.items()]
        featurizer_path.append("_".join([clsname] + kwargs))

OUT = HERE / "_output"  / "__".join(featurizer_path) / DATASET_CLS.rsplit('.', 1)[1]
OUT.mkdir(parents=True, exist_ok=True)

print(f"This notebook:           HERE = ~/{HERE.relative_to(Path.home())}")
print(f"This repo:               REPO = ~/{REPO.relative_to(Path.home())}")
print(f"Outputs in:               OUT = ~/{OUT.relative_to(Path.home())}")

This notebook:           HERE = ~/devel/py/openkinome/experiments-binding-affinity/features/ligand-only-morgan512
This repo:               REPO = ~/devel/py/openkinome/experiments-binding-affinity
Outputs in:               OUT = ~/devel/py/openkinome/experiments-binding-affinity/features/ligand-only-morgan512/_output/ligand__SmilesToLigandFeaturizer_style=rdkit__MorganFingerprintFeaturizer_nbits=512_radius=2/PKIS2DatasetProvider


In [5]:
# Nasty trick: save all-caps local variables (CONSTANTS working as hyperparametrs) so far in a dict to save it later
_hparams = {key: value for key, value in locals().items() if key.upper() == key and not key.startswith(("_", "OE_"))}

## Setup is finished, start working

In [6]:
from warnings import warn
import os
import sys
from pathlib import Path
from datetime import datetime

import numpy as np

from kinoml.utils import seed_everything, import_object
seed_everything();
print("Run started at", datetime.now())

Run started at 2020-12-07 12:00:14.946374


## Load raw data

In [7]:
dataset = import_object(DATASET_CLS).from_source(**DATASET_KWARGS)
dataset



<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems (AminoAcidSequence=403, SmilesLigand=640)>

In [8]:
df = dataset.to_dataframe()
df

Unnamed: 0,Systems,n_components,PercentageDisplacementMeasurement
0,AAK1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,14.0
1,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,28.0
2,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,20.0
3,ABL2 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,5.0
4,ACVR1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C...,2,0.0
...,...,...,...
261865,ZAP70 & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc12)...,2,0.0
261866,p38-alpha & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0
261867,p38-beta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc...,2,0.0
261868,p38-delta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0


## Featurize

In [9]:
# build pipeline
from kinoml.features.core import Pipeline

featurizers = []
for key, featurizer_instructions in PIPELINE.items():
    featurizers.append(Pipeline([import_object(import_str)(**kwargs) for import_str, kwargs in featurizer_instructions]))
featurizer = import_object(PIPELINE_AGG)(featurizers, **PIPELINE_AGG_KWARGS)

In [10]:
# prefeaturize everything
dataset.featurize(featurizer, **FEATURIZE_KWARGS);

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=257920.0), HTML(value='')))




## Filter

Remove systems that couldn't be featurized. Successful featurizations are stored in `measurement.system.featurizations['last']` so we test for that key existence.

In [11]:
from kinoml.datasets.groups import CallableGrouper, RandomGrouper
grouper = CallableGrouper(lambda measurement: 'invalid' if 'last' not in measurement.system.featurizations else 'valid')
grouper.assign(dataset, overwrite=True, progress=False)
groups = dataset.split_by_groups()
if "invalid" in groups:
    _invalid = groups.pop("invalid")
    warn(f"{len(_invalid)} entries could not be featurized!. Possible errors:")
    warn(f"{_invalid[0].system.featurizations}")

## Groups

Cumulatively apply groups.

In [12]:
groups[("valid",)] = groups.pop("valid")
for grouper_str, grouper_kwargs in GROUPS:
    grouper_cls = import_object(grouper_str)
    ## We need this because lambda functions are not JSON-serializable
    if issubclass(grouper_cls, CallableGrouper):
        for k, v in list(grouper_kwargs.items()):
            if k == "function" and isinstance(v, str):
                grouper_kwargs[k] = eval(v)  # sorry :)
    ## End of lambda hack
    grouper = grouper_cls(**grouper_kwargs)        
    for group_key in list(groups.keys()):
        grouper.assign(groups[group_key], overwrite=True, progress=False)
        for subkey, subgroup in groups.pop(group_key).split_by_groups().items():
            groups[group_key + (subkey,)] = subgroup
print("10 groups to show keys:", *list(groups.keys())[:10], sep="\n")

10 groups to show keys:
('valid', 'AAK1', 'PercentageDisplacementMeasurement')
('valid', 'ABL1-nonphosphorylated', 'PercentageDisplacementMeasurement')
('valid', 'ABL2', 'PercentageDisplacementMeasurement')
('valid', 'ACVR1', 'PercentageDisplacementMeasurement')
('valid', 'ACVR1B', 'PercentageDisplacementMeasurement')
('valid', 'ACVR2A', 'PercentageDisplacementMeasurement')
('valid', 'ACVR2B', 'PercentageDisplacementMeasurement')
('valid', 'ACVRL1', 'PercentageDisplacementMeasurement')
('valid', 'ADCK3', 'PercentageDisplacementMeasurement')
('valid', 'ADCK4', 'PercentageDisplacementMeasurement')


## Write tensors to disk

Output files are written to `_output/<PIPELINE>/<DATASET>/<GROUP>.npz` files.

Each `npz` will contain two `np.ndarray` objects: `X` (featurized systems) and `y` (associated measurements), plus the train/test/validation indices.

In [13]:
random_grouper = RandomGrouper(TRAIN_TEST_VAL_KWARGS)

for group, ds in sorted(groups.items(), key=lambda kv: len(kv[1]), reverse=True):
    indices = random_grouper.indices(ds)
    X = np.asarray(ds.featurized_systems())
    y = ds.measurements_as_array()
    np.savez(OUT / f"{'__'.join([g for g in group if g != 'valid'])}.npz", X=X, y=y.astype("float32"), **indices)

In [14]:
print("Run finished at", datetime.now())

Run finished at 2020-12-07 12:09:14.766504


# Reproducibility logs

In [15]:
from kinoml.utils import watermark
w = watermark()

Watermark
---------
numpy 1.18.5
last updated: 2020-12-07 12:09:15 CET 2020-12-07T12:09:15+01:00

CPython 3.7.8
IPython 7.18.1

compiler   : GCC 7.5.0
system     : Linux
release    : 4.19.128-microsoft-standard
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
host name  : jrodriguez
Git hash   : 0c851d0b018bd4259c68517889ccd7f8c3dda212
watermark 2.0.2

conda
-----


sys.version: 3.7.6 | packaged by conda-forge | (defau...
sys.prefix: /opt/miniconda
sys.executable: /opt/miniconda/bin/python
conda location: /opt/miniconda/lib/python3.7/site-packages/conda
conda-build: /opt/miniconda/bin/conda-build
conda-convert: /opt/miniconda/bin/conda-convert
conda-debug: /opt/miniconda/bin/conda-debug
conda-develop: /opt/miniconda/bin/conda-develop
conda-env: /opt/miniconda/bin/conda-env
conda-index: /opt/miniconda/bin/conda-index
conda-inspect: /opt/miniconda/bin/conda-inspect
conda-metapackage: /opt/miniconda/bin/conda-metapackage
conda-render: /opt/miniconda/bin/conda-render
conda-server: /opt/miniconda/bin/conda-server
conda-skeleton: /opt/miniconda/bin/conda-skeleton
conda-smithy: /opt/miniconda/bin/conda-smithy
user site dirs: ~/.local/lib/python3.8

CIO_TEST: <not set>
CONDA_DEFAULT_ENV: kinoml-ci
CONDA_EXE: /opt/miniconda/bin/conda
CONDA_PREFIX: /home/jaime/.conda/envs/kinoml-ci
CONDA_PREFIX_1: /opt/miniconda
CONDA_PREFIX_2: /home/jaime/.conda/envs/teach

# packages in environment at /home/jaime/.conda/envs/kinoml-ci:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
_py-xgboost-mutex         2.0                       cpu_0    conda-forge
absl-py                   0.10.0                   pypi_0    pypi
alabaster                 0.7.12                   pypi_0    pypi
amberlite                 16.0                     pypi_0    pypi
ambertools                20.9                     pypi_0    pypi
ansiwrap                  0.8.4                      py_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
argon2-cffi               20.1.0           py37h8f50634_2    conda-forge
arpack                    3.7.0                hc6cf775_2    conda-forge
ase                       3.20.1                   pypi_0    pypi
astroid                   

In [16]:
%%capture cap --no-stderr
w = watermark()

In [17]:
import json

with open(OUT/ "watermark.txt", "w") as f:
    f.write(cap.stdout)

with open(OUT / "hparams.json", "w") as f:
    json.dump(_hparams, f, default=str, indent=2)