# Applicability Domain Workflow

### Introduction

Workflow for estimation of Applicability Domain (AD) based on Euclidean distances. 

The AD of a QSAR model must be defined to flag compounds in the test set for which predictions may be unreliable. In this workflow, similarity measurements are used to define the AD of the model based on the Euclidean distances among all the compounds of the training set. Then, we estimate if compounds of the test are inside the AD of the training set. The distance of a test compound to its nearest neighbor in the training set is compared to the calculated AD Threshold (ADT) of the training set. If the similarity is beyond this threshold, the prediction is considered unreliable.

ADT is calculated as follows: 

ADT = D + Zσ 

where where Z is a similarity threshold parameter defined by the user (default is 0.5), and D and σ are the average and standard deviation, respectively, of all Euclidian distances in the multidimensional descriptor space between each compound and its nearest neighbors for all compounds in the training set.

This method has been defined by A. Golbraikh, M. Shen, Z. Xiao, Y.-D. Xiao, K.-H. Lee, A. Tropsha. J. Comput. Aided. Mol. Des. 2003, 17 (2–4), 241–253.

In [1]:
import pandas as pd
import numpy as np
from rdkit.Chem import PandasTools
from rdkit.Chem.AllChem import GetMorganFingerprintAsBitVect
from sklearn.model_selection import train_test_split
from scipy.spatial import distance_matrix

### ADT function

In [2]:
def calc_training_dist_matrix(training_descriptors):
    return np.sort(distance_matrix(training_descriptors,training_descriptors),axis=1)[:,1:]

def calc_d_cutoff(distance_matrix, user_cutoff=0.5):
    average_dist = np.mean(distance_matrix, axis=None)
    std_dev = np.std(distance_matrix, axis=None)
    return (average + user_cutoff * std_dev)

def calc_test_distances(testing_descriptors,training_descriptors):
    return distance_matrix(testing_descriptors,training_descriptors)

def test_against_cutoff(distances_for_compound,threshold):
    test_value = calc_d_cutoff(distances_for_compound)
    if test_value > threshold:
        return True
    return False

### Import curated datasets

In [3]:
train = PandasTools.LoadSDF('training_set.sdf')
train.head(1)

Unnamed: 0,Activity,ID,MOL,Outcome,ROMol,Set
0,-0.12,,ms_1,Inactive,,ms


In [4]:
test = PandasTools.LoadSDF('test_set.sdf')
test.head(1)

Unnamed: 0,Activity,ID,MOL,Outcome,ROMol,Set
0,-0.16,,es1_1,Inactive,,es1


### Calculate descriptors

In [6]:
# Training set
def calcfp(mol,funcFPInfo=dict(radius=3, nBits=2048, useFeatures=False, useChirality = False)):
    fp = GetMorganFingerprintAsBitVect(mol, **funcFPInfo)
    fp = pd.Series(np.asarray(fp))
    fp = fp.add_prefix('bit_')
    return fp

morgan_train = train.ROMol.apply(calcfp)
morgan_train.shape

(644, 2048)

In [7]:
# Test set
def calcfp(mol,funcFPInfo=dict(radius=3, nBits=2048, useFeatures=False, useChirality = False)):
    fp = GetMorganFingerprintAsBitVect(mol, **funcFPInfo)
    fp = pd.Series(np.asarray(fp))
    fp = fp.add_prefix('bit_')
    return fp

morgan_test = test.ROMol.apply(calcfp)
morgan_test.shape

(449, 2048)

### Prepare data for modeling


In [8]:
y_train = train['Activity']
X_train = morgan_train

y_test = test['Activity']
X_test = morgan_test

### Estimate the ADT of the model

In [9]:
distances = calc_training_dist_matrix(X_train)

In [None]:
D_cutoff = calc_d_cutoff(distances, std_dev)
print(D_cutoff)

### Calculate if moleles in the test set are in the AD of the model