# Acute Toxicity LD50

### Dataset Description: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].

### Task Description: Regression. Given a drug SMILES string, predict its acute toxicity.

### Dataset Statistics: 7,385 drugs.

### Metric: MAE

## Leaderboard



| Rank | Model                         | Contact           | Link               | #Params    | MAE             |
|------|-------------------------------|-------------------|---------------------|------------|-----------------|
| 1    | BaseBoosting KyQVZ6b2         | David Huang       | GitHub, Paper      | N/A        | 0.552 ± 0.009   |
| 2    | MACCS keys + autoML           | Alexander Scarlat | GitHub, Paper      | N/A        | 0.588 ± 0.005   |
| 3    | Chemprop                      | Kyle Swanson      | GitHub, Paper      | N/A        | 0.606 ± 0.024   |
| 4    | MapLight                      | Jim Notwell       | GitHub, Paper      | N/A        | 0.621 ± 0.003   |
| 5    | QuGIN                         | Shuai Shi          | GitHub, Paper      | 1,797,506  | 0.622 ± 0.015   |
| 6    | Chemprop-RDKit                | Kyle Swanson      | GitHub, Paper      | N/A        | 0.625 ± 0.022   |
| 7    | MapLight + GNN                | Jim Notwell       | GitHub, Paper      | N/A        | 0.633 ± 0.003   |
| 8    | Basic ML                      | Nilavo Boral       | GitHub, Paper      | N/A        | 0.636 ± 0.001   |
| 9    | Euclia ML model               | Euclia            | GitHub, Paper      | 50         | 0.646 ± 0.011   |
| 10   | GCN                           | Kexin Huang       | GitHub, Paper      | 191,810    | 0.649 ± 0.026   |
| 11   | Morgan + MLP (DeepPurpose)    | Kexin Huang       | GitHub, Paper      | 1,477,185  | 0.649 ± 0.019   |
| 12   | NeuralFP                      | Kexin Huang       | GitHub, Paper      | 480,193    | 0.667 ± 0.020   |
| 13   | ContextPred                   | Kexin Huang       | GitHub, Paper      | 2,067,053  | 0.669 ± 0.030   |
| 14   | CNN (DeepPurpose)             | Kexin Huang       | GitHub, Paper      | 226,625    | 0.675 ± 0.011   |
| 15   | RDKit2D + MLP (DeepPurpose)   | Kexin Huang       | GitHub, Paper      | 633,409    | 0.678 ± 0.003   |
| 16   | AttentiveFP                   | Kexin Huang       | GitHub, Paper      | 300,806    | 0.678 ± 0.012   |
| 17   | AttrMasking                   | Kexin Huang       | GitHub, Paper      | 2,067,053  | 0.685 ± 0.025   |

In [1]:
import pandas as pd
from deepmol.pipeline import Pipeline

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
2024-02-12 16:59:06.806369: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-12 16:59:06.851569: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-12 16:59:06.851621: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-12 16:59:06.8

In [3]:
pipeline = Pipeline.load("ld50/trial_71")

In [4]:
pipeline.steps

[('standardizer',
  <deepmol.standardizer.custom_standardizer.CustomStandardizer at 0x7f0e387b2aa0>),
 ('featurizer',
  <deepmol.compound_featurization.rdkit_fingerprints.LayeredFingerprint at 0x7f0e387b2ef0>),
 ('scaler',
  <deepmol.base.transformer.PassThroughTransformer at 0x7eec409b3a60>),
 ('feature_selector',
  <deepmol.feature_selection.base_feature_selector.PercentilFS at 0x7eec409b3640>),
 ('model',
  SklearnModel(model=VotingRegressor(estimators=[('lr', LinearRegression()),
                                                 ('svr', SVR()),
                                                 ('rfr', RandomForestRegressor()),
                                                 ('gbr',
                                                  GradientBoostingRegressor()),
                                                 ('mlpr', MLPRegressor())],
                                     weights=[0.0003242488996986417,
                                              0.31322537497355585,
              

In [2]:
# read results
results = pd.read_csv('bioavailability/tdc_test_set_results.txt', sep=',', header=None, dtype={0: int, 1: float, 2: float})
# set columns
results.columns = ['trial_id', 'mean', 'std']
results
# order res

Unnamed: 0,trial_id,mean,std
0,0,0.532,0.000
1,1,0.521,0.026
2,2,0.500,0.000
3,5,0.500,0.000
4,6,0.500,0.000
...,...,...,...
60,95,0.504,0.015
61,96,0.535,0.019
62,97,0.632,0.020
63,98,0.534,0.035


In [3]:
# order results by mean (std in case of tie)
results = results.sort_values(by=['mean', 'std'], ascending=False)
results

Unnamed: 0,trial_id,mean,std
7,11,0.645,0.016
6,10,0.634,0.020
62,97,0.632,0.020
44,70,0.608,0.035
10,14,0.594,0.026
...,...,...,...
18,30,0.500,0.000
43,69,0.500,0.000
16,25,0.499,0.013
17,28,0.494,0.033


In [14]:
# load best trial pipeline (rank #8)
best_trial_id = int(results.iloc[0]['trial_id'])
pipeline = Pipeline.load(f"bioavailability/trial_{best_trial_id}/")

[14:18:31] Initializing Normalizer


FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpw20t292e/model.pkl'