# CYP3A4 Substrate, Carbon-Mangels et al.

### Dataset Description: CYP3A4 is an important enzyme in the body, mainly found in the liver and in the intestine. It oxidizes small foreign organic molecules (xenobiotics), such as toxins or drugs, so that they can be removed from the body. TDC used a dataset from [1], which merged information on substrates and nonsubstrates from six publications.

### Task Description: Binary Classification. Given a drug SMILES string, predict if it is a substrate to the enzyme.

### Dataset Statistics: 667 drugs.

### Metric: AUPRC

## Leaderboard

| Rank | Model                       | Contact           | Link          | #Params   | AUROC         |
|------|-----------------------------|-------------------|---------------|-----------|---------------|
| 1    | CNN (DeepPurpose)           | Kexin Huang       | GitHub, Paper | 226,625   | 0.662 ± 0.031 |
| 2    | MapLight                    | Jim Notwell       | GitHub, Paper | N/A       | 0.650 ± 0.006 |
| 3    | MapLight + GNN              | Jim Notwell       | GitHub, Paper | N/A       | 0.647 ± 0.008 |
| 4    | SimGCN                      | Suman Kalyan Bera | GitHub, Paper | 1,103,000 | 0.640 ± 0.016 |
| 5    | RDKit2D + MLP (DeepPurpose) | Kexin Huang       | GitHub, Paper | 633,409   | 0.639 ± 0.012 |
| 6    | Morgan + MLP (DeepPurpose)  | Kexin Huang       | GitHub, Paper | 1,477,185 | 0.633 ± 0.013 |
| 7    | ZairaChem                   | Gemma Turon       | GitHub, Paper | N/A       | 0.630 ± 0.008 |
| 8    | Euclia ML model             | Euclia            | GitHub, Paper | 50        | 0.629 ± 0.027 |
| 9    | Chemprop-RDKit              | Kyle Swanson      | GitHub, Paper | N/A       | 0.619 ± 0.030 |
| 10   | ContextPred                 | Kexin Huang       | GitHub, Paper | 2,067,053 | 0.609 ± 0.025 |
| 11   | Basic ML                    | Nilavo Boral      | GitHub, Paper | N/A       | 0.605 ± 0.000 |
| 12   | Chemprop                    | Kyle Swanson      | GitHub, Paper | N/A       | 0.596 ± 0.018 |
| 13   | GCN                         | Kexin Huang       | GitHub, Paper | 191,810   | 0.590 ± 0.023 |
| 14   | AttrMasking                 | Kexin Huang       | GitHub, Paper | 2,067,053 | 0.582 ± 0.021 |
| 15   | NeuralFP                    | Kexin Huang       | GitHub, Paper | 480,193   | 0.578 ± 0.020 |
| 16   | AttentiveFP                 | Kexin Huang       | GitHub, Paper | 300,806   | 0.576 ± 0.025 |

In [2]:
import pandas as pd
from deepmol.pipeline import Pipeline

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
2024-01-24 12:56:56.421675: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-24 12:56:56.464360: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-24 12:56:56.464408: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-24 12:56:56.4

In [5]:
# read results
results = pd.read_csv('cyp3a4_substrate_test_set.csv', sep=',', header=None)
# set columns
results
# order res

Unnamed: 0,0,1,2
0,Trial,Average,Std
1,74.0,0.655,0.003
2,75.0,0.654,0.002
3,76.0,0.649,0.003
4,77.0,0.656,0.002
5,44.0,0.654,0.003
6,72.0,0.656,0.003
7,73.0,0.653,0.003
8,91.0,0.657,0.002
9,71.0,0.65,0.002


In [6]:
# order results by mean (std in case of tie)
pipeline = Pipeline.load('cyp3a4_substrate/trial_74/')

[12:58:00] Initializing Normalizer


In [7]:
pipeline.steps

[('standardizer',
  <deepmol.standardizer.custom_standardizer.CustomStandardizer at 0x7eddb3d883d0>),
 ('featurizer',
  <deepmol.compound_featurization.rdkit_fingerprints.AtomPairFingerprint at 0x7eddb3d9e380>),
 ('scaler',
  <deepmol.base.transformer.PassThroughTransformer at 0x7effa8495750>),
 ('feature_selector',
  <deepmol.feature_selection.base_feature_selector.LowVarianceFS at 0x7eddb3cd9c90>),
 ('model',
  SklearnModel(model=BaggingClassifier(base_estimator=SVC(),
                                       bootstrap_features=True,
                                       n_estimators=450),
               model_dir='cyp3a4_substrate/trial_74/model/model.pkl'))]

In [5]:
# load best trial pipeline (rank #...)
best_trial_id = int(results.iloc[0]['trial_id'])
pipeline = Pipeline.load(f"cyp3a4_substrate/trial_{best_trial_id}/")

NameError: name 'results' is not defined