# Obtaining BO results from different molecule representation

## Step 1: Load Datasets
In the present demonstration, we will use a model dataset *LIPO* correlating molecular structure with lipophilicity, a typical quantitative-structure property relationship (QSPR) modelling task.

Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361

In [1]:
import numpy as np
from data_helper import gen_data_feat,load_lipo_feat
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

featurizer_name = 'rdkit'
partition_ratio = 0.2

# Load from pre-featurized data
X, y = load_lipo_feat(filename='data/lipo_{}.csv'.format(featurizer_name))

# Split data into start training and candidate sets
X_train, X_candidate, y_train, y_candidate = train_test_split(
    X, y,
    test_size=1-partition_ratio,
    random_state=1,
    shuffle=True
)

# Standardize input data if needed
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_candidate = scaler.transform(X_candidate)

# Apply PCA to reduce dimensionality (optional)
# pca = PCA(n_components=50)
# X_train = pca.fit_transform(X_train)
# X_cadidate = pca.transform(X_cadidate)


## Step2: Prepare for BO

We here first import surrogates to do a preliminary modelling test

In [2]:
from sklearn.metrics import r2_score

from surrogates import Surrogate
from surrogates import RandomForestSurrogate
from surrogates import GPTanimotoSurrogate
from surrogates import GPRQSurrogate

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Define surrogate model.
my_surrogate = GPRQSurrogate()
my_surrogate.load_data(train_x=X_train, train_y=y_train)

# Fit surrogate model.
my_surrogate.fit()

# Get means and uncertainties from surrogate model.
means, uncertainties = my_surrogate.predict_means_and_stddevs(X_candidate)
print(f'Test shape: {X_candidate.shape}')
print(f'Mean shape: {means.shape}')
print(f'Uncertainty shape: {uncertainties.shape}')

# Report results of model fit.
print(f'R^2 Score on test set: {r2_score(y_candidate, means)}')

MultivariateNormal(loc: torch.Size([3360]))
Test shape: (3360, 198)
Mean shape: (3360,)
Uncertainty shape: (3360,)
R^2 Score on test set: 0.5528199295252667
