#  Molecular Fingerprints

- It is a very simple representation that often works well for small drug-like molecules.

- [DeepChem](https://github.com/deepchem/deepchem/tree/master/examples/tutorials)

In [1]:
!pip install --pre deepchem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchem
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[K     |████████████████████████████████| 608 kB 18.1 MB/s 
Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[K     |████████████████████████████████| 36.8 MB 36 kB/s 
Installing collected packages: rdkit-pypi, deepchem
Successfully installed deepchem-2.6.1 rdkit-pypi-2022.3.5


In [2]:
import deepchem as dc
dc.__version__

'2.6.1'

# What is a Fingerprint?

- Many (but not all) types of models require their inputs to have a fixed size.  This can be a challenge for molecules, since different molecules have different numbers of atoms.  

- Fingerprints are designed to address these problems.  A fingerprint is a fixed length array, where different elements indicate the presence of different features in the molecule.  If two molecules have similar fingerprints, that indicates they contain many of the same features

- "Extended Connectivity Fingerprint", or "ECFP"
 - They also are sometimes called "circular fingerprints".  
 - The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds.  Each unique pattern is a feature.  
- For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature
- It then iteratively identifies new features by looking at larger circular neighborhoods.  - One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it.  This continues for a fixed number of iterations, most often two.

Let's take a look at a dataset that has been featurized with ECFP.

In [6]:
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

<DiskDataset X.shape: (6264, 1024), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>


- The feature array `X` has shape (6264, 1024)- the label array `y` has shape (6264, 12): this is a multitask dataset.  
 - Tox21 contains information about the toxicity of molecules.  12 different assays were used to look for signs of toxicity. 

Let's also take a look at the weights array.

In [7]:
train_dataset.w

array([[1.04502242, 1.03632599, 1.12502653, ..., 1.05576503, 1.17464996,
        1.05288369],
       [1.04502242, 1.03632599, 1.12502653, ..., 1.05576503, 1.17464996,
        1.05288369],
       [1.04502242, 1.03632599, 1.12502653, ..., 1.05576503, 0.        ,
        1.05288369],
       ...,
       [1.04502242, 0.        , 1.12502653, ..., 1.05576503, 6.7257384 ,
        1.05288369],
       [1.04502242, 1.03632599, 1.12502653, ..., 1.05576503, 6.7257384 ,
        1.05288369],
       [1.04502242, 1.03632599, 1.12502653, ..., 0.        , 1.17464996,
        1.05288369]])

- Notice that some elements are 0.  The weights are being used to indicate missing data.  
 - Not all assays were actually performed on every molecule.  Setting the weight for a sample or sample/task pair to 0 causes it to be ignored during fitting and evaluation.  
 - It will have no effect on the loss function or other metrics.

- Most of the other weights are close to 1, but not exactly 1.  This is done to balance the overall weight of positive and negative samples on each task.  
 - When training the model, we want each of the 12 tasks to contribute equally, and on each task we want to put equal weight on positive and negative samples.  
 - Otherwise, the model might just learn that most of the training samples are non-toxic, and therefore become biased toward identifying other molecules as non-toxic.

# Training a Model on Fingerprints

- Because fingerprints are so simple, just a single fixed length array, we can use a much simpler type of model.

In [8]:
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])

- `MultitaskClassifier` is a simple stack of fully connected layers (i.e. MLP).  
- In this example we tell it to use a single hidden layer of width 1000.  

- it turns out that training a single model for multiple tasks often works better.  

Let's train and evaluate the model.

In [9]:
import numpy as np

model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'roc_auc_score': 0.95858052248688}
test set score: {'roc_auc_score': 0.6862637477187774}


Not bad performance for such a simple model and featurization.  More sophisticated models do slightly better on this dataset, but not enormously better.