## Generating QSAR Datasets using ComptoxAI
### Author: Joseph D. Romano, PhD (joseph.romano@pennmedicine.upenn.edu)

ComptoxAI makes it easy to assemble datasets for use in Quantitative Structure-Activity Relationship modeling. We currently provide chemical fingerprints in the form of MACCS keys (a 160 bit array of binary features describing presence or absence of a large number of structural characteristics), but will likely add other fingerprint formats in the near future. Activity endpoints can be defined dynamically, and can be based on either specific chemical features (e.g., ontology annotations) or relationships (edges) to other entities in ComptoxAI's knowledge graph.

Here, we generate a dataset of water contaminants and train a model to 

In [19]:
from comptox_ai.db import GraphDB

import pandas as pd
import numpy as np

db = GraphDB(hostname="comptox.ai")

Traceback (most recent call last):
  File "_pydevd_bundle/pydevd_cython.pyx", line 1078, in _pydevd_bundle.pydevd_cython.PyDBFrame.trace_dispatch
  File "_pydevd_bundle/pydevd_cython.pyx", line 297, in _pydevd_bundle.pydevd_cython.PyDBFrame.do_wait_suspend
  File "C:\Users\jdr2160\Miniconda3\envs\comptox_ai\lib\site-packages\debugpy\_vendored\pydevd\pydevd.py", line 1976, in do_wait_suspend
    keep_suspended = self._do_wait_suspend(thread, frame, event, arg, suspend_type, from_this_thread, frames_tracker)
  File "C:\Users\jdr2160\Miniconda3\envs\comptox_ai\lib\site-packages\debugpy\_vendored\pydevd\pydevd.py", line 2011, in _do_wait_suspend
    time.sleep(0.01)
KeyboardInterrupt


KeyboardInterrupt: 

In [2]:
# EXample Query:
#MATCH (n3:ChemicalList)-[r2]-(n1:Chemical)-[r1]-(n2:Assay)
#WHERE type(r1) = "CHEMICALHASACTIVEASSAY" AND n1.maccs IS NOT NULL AND n3.listAcronym = "EPAPCS"
#RETURN n2.commonName, count(*) AS ct
#ORDER BY ct DESC;

from comptox_ai.db import GraphDB
from comptox_ai.db.io import TabularExporter

g = GraphDB(hostname="comptox.ai")
exporter = TabularExporter(g)
df = exporter.stream_tabular_dataset()

Writing Cypher transaction: 
  CALL apoc.meta.stats();
Writing Cypher transaction: 
  CALL apoc.meta.graph();
Writing Cypher transaction: 
  CALL apoc.meta.graph();
CYPHER QUERY:
MATCH (sf:ChemicalList)-[r2]-(s:Chemical)-[r1]-(t:Assay) WHERE s.maccs IS NOT NULL AND sf.listAcronym = "EPAPCS" AND t.commonName = "tox21-pxr-p1" RETURN s.commonName AS name, s.maccs AS maccs, type(r1) AS rel_type;
Writing Cypher transaction: 
  MATCH (sf:ChemicalList)-[r2]-(s:Chemical)-[r1]-(t:Assay) WHERE s.maccs IS NOT NULL AND sf.listAcronym = "EPAPCS" AND t.commonName = "tox21-pxr-p1" RETURN s.commonName AS name, s.maccs AS maccs, type(r1) AS rel_type;


In [39]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

clf = RandomForestClassifier()
X = np.array(df.drop('y', axis=1))
y = np.array(df['y'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

print(score)

0.8658536585365854


In [40]:
from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision: ", precision_score(y_test, clf.predict(X_test)))
print("Recall:    ", recall_score(y_test, clf.predict(X_test)))
print("F-1:       ", f1_score(y_test, clf.predict(X_test)))

Precision:  0.8541666666666666
Recall:     0.7321428571428571
F-1:        0.7884615384615384


## Make discovery dataset

In [8]:
df_discovery = exporter.stream_tabular_dataset(make_discovery_dataset=True)

CYPHER QUERY:
MATCH (sf:ChemicalList)-[r2]-(s:Chemical) WHERE s.maccs IS NOT NULL AND sf.listAcronym = "EPAPCS" AND NOT (s)-[]-(:Assay { commonName: "tox21-pxr-p1" }) RETURN s.commonName AS name, s.maccs AS maccs;
Writing Cypher transaction: 
  MATCH (sf:ChemicalList)-[r2]-(s:Chemical) WHERE s.maccs IS NOT NULL AND sf.listAcronym = "EPAPCS" AND NOT (s)-[]-(:Assay { commonName: "tox21-pxr-p1" }) RETURN s.commonName AS name, s.maccs AS maccs;


In [42]:
discovery_pred_probs = clf.predict_proba(df_discovery)[:,1]
print(discovery_pred_probs)

names_with_probs = pd.DataFrame(
  list(zip(df_discovery.index, discovery_pred_probs)),
  columns=['Chemicals', 'Probability']
)
print(names_with_probs.sort_values('Probability', ascending=False)[:10])


[0.34       0.51       0.09833333 ... 0.06       0.27       0.        ]
                      Chemicals  Probability
520            Tetrachlorophene         1.00
544           zeta-Cypermethrin         0.98
1176  D-trans-beta-Cypermethrin         0.98
750     2-Chloro-6-phenylphenol         0.96
85    3,5-Dichloro-2-biphenylol         0.96
572                   Desmetryn         0.96
304                Cyanofenphos         0.95
377                 Azaconazole         0.95
507      Gardona (trans-isomer)         0.95
310          Phenmedipham-ethyl         0.94
