# Lipophilicity, AstraZeneca

### Dataset Description: Lipophilicity measures the ability of a drug to dissolve in a lipid (e.g. fats, oils) environment. High lipophilicity often leads to high rate of metabolism, poor solubility, high turn-over, and low absorption. From MoleculeNet.

### Task Description: Regression. Given a drug SMILES string, predict the activity of lipophilicity.

### Dataset Statistics: 4,200 drugs.

### Metric: MAE

## Leaderboard

| Rank | Model                       | Contact      | Link          | #Params   | MAE           |
|------|-----------------------------|--------------|---------------|-----------|---------------|
| 1    | Chemprop-RDKit              | Kyle Swanson | GitHub, Paper | N/A       | 0.467 ± 0.006 |
| 2    | Chemprop                    | Kyle Swanson | GitHub, Paper | N/A       | 0.470 ± 0.009 |
| 3    | BaseBoosting KyQVZ6b2       | David Huang  | GitHub, Paper | N/A       | 0.479 ± 0.007 |
| 4    | MapLight + GNN              | Jim Notwell  | GitHub, Paper | N/A       | 0.525 ± 0.003 |
| 5    | ContextPred                 | Kexin Huang  | GitHub, Paper | 2,067,053 | 0.535 ± 0.012 |
| 6    | MapLight                    | Jim Notwell  | GitHub, Paper | N/A       | 0.539 ± 0.002 |
| 7    | GCN                         | Kexin Huang  | GitHub, Paper | 191,810   | 0.541 ± 0.011 |
| 8    | AttrMasking                 | Kexin Huang  | GitHub, Paper | 2,067,053 | 0.547 ± 0.024 |
| 9    | NeuralFP                    | Kexin Huang  | GitHub, Paper | 480,193   | 0.563 ± 0.023 |
| 10   | AttentiveFP                 | Kexin Huang  | GitHub, Paper | 300,806   | 0.572 ± 0.007 |
| 11   | RDKit2D + MLP (DeepPurpose) | Kexin Huang  | GitHub, Paper | 633,409   | 0.574 ± 0.017 |
| 12   | Basic ML                    | Nilavo Boral | GitHub, Paper | N/A       | 0.617 ± 0.003 |
| 13   | Euclia ML model             | Euclia       | GitHub, Paper | 50        | 0.621 ± 0.005 |
| 14   | Morgan + MLP (DeepPurpose)  | Kexin Huang  | GitHub, Paper | 1,477,185 | 0.701 ± 0.009 |
| 15   | CNN (DeepPurpose)           | Kexin Huang  | GitHub, Paper | 226,625   | 0.743 ± 0.020 |

In [1]:
import pandas as pd
from deepmol.pipeline import Pipeline

2024-02-01 08:54:06.760647: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-01 08:54:07.703439: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-01 08:54:07.703501: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-01 08:54:07.703556: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-01 08:54:08.007191: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-01 08:54:08.008009: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

In [2]:
pipeline = Pipeline.load("lipophilicity/trial_92")

[08:54:37] Initializing Normalizer


In [3]:
pipeline.steps

[('standardizer',
  <deepmol.standardizer.custom_standardizer.CustomStandardizer at 0x7f36a7163850>),
 ('featurizer',
  <deepmol.compound_featurization.rdkit_fingerprints.MorganFingerprint at 0x7f3435754ac0>),
 ('scaler',
  <deepmol.base.transformer.PassThroughTransformer at 0x7f34357a9cf0>),
 ('feature_selector',
  <deepmol.feature_selection.base_feature_selector.PercentilFS at 0x7f34357dd750>),
 ('model',
  SklearnModel(model=StackingRegressor(estimators=[('lr', LinearRegression()),
                                                   ('svr', SVR()),
                                                   ('rfr',
                                                    RandomForestRegressor()),
                                                   ('gbr',
                                                    GradientBoostingRegressor())],
                                       final_estimator=MLPRegressor()),
               model_dir='lipophilicity/trial_92/model/model.pkl'))]

In [2]:
# read results
results = pd.read_csv('lipophilicity/tdc_test_set_results.txt', sep=',', header=None, dtype={0: int, 1: float, 2: float})
# set columns
results.columns = ['trial_id', 'mean', 'std']
results
# order res

Unnamed: 0,trial_id,mean,std
0,0,0.955,0.004
1,1,0.937,0.035
2,4,0.903,0.004
3,5,0.656,0.036
4,8,1.012,0.067
5,12,0.663,0.035
6,13,0.674,0.042
7,16,0.676,0.023
8,23,0.639,0.028
9,27,0.642,0.016


In [4]:
# order results by mean (std in case of tie)
results = results.sort_values(by=['mean', 'std'], ascending=True)
results

Unnamed: 0,trial_id,mean,std
8,23,0.639,0.028
9,27,0.642,0.016
3,5,0.656,0.036
5,12,0.663,0.035
21,64,0.666,0.02
19,52,0.667,0.023
20,61,0.667,0.025
18,51,0.67,0.025
6,13,0.674,0.042
7,16,0.676,0.023


In [6]:
# load best trial pipeline (rank #14)
best_trial_id = int(results.iloc[0]['trial_id'])
pipeline = Pipeline.load(f"lipophilicity/trial_{best_trial_id}/")

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppdoiz22e/model.pkl'