# Bayes Model on Fingerprints



In [1]:
import collections
import pdb
import torch
from matplotlib import pyplot as plt
import corner
import numpy as np
from tqdm import tqdm
import torch
from sklearn import linear_model
import sys
sys.path.append('../')

from bayes_vs import bayes_models
from bayes_vs import chem_ops



In [2]:
chkpt = torch.load('../scripts/trained_oracles.chkpt')
chkpt.keys()

dict_keys(['ground-truth', 'cheap-docking_state_dict', 'expensive-docking_state_dict', 'FEP_state_dict'])

In [3]:
len(chkpt['ground-truth'])

220613

In [4]:
rng = np.random.RandomState(4184189)

In [5]:
shuffled = rng.permutation(list(chkpt['ground-truth'].items()))
smiles, values = zip(*shuffled)
smiles = list(smiles)
values = np.array(values, dtype=np.float32)

In [6]:
test_set_size = 2500

smiles_train, smiles_test = smiles[:-test_set_size], smiles[-test_set_size:]
values_train, values_test = values[:-test_set_size], values[-test_set_size:]

In [7]:

def test_on_data(smiles_train, smiles_test, y_train, y_test):
    
    out_rows = []
    traing_set_size = len(smiles_train)
    fps_train = np.stack([chem_ops.morgan_fp_from_smiles(smi) for smi in tqdm(smiles_train, desc='smiles fp train')]).astype(np.float32)
    fps_test = np.stack([chem_ops.morgan_fp_from_smiles(smi) for smi in tqdm(smiles_test, desc='smiles fp test')]).astype(np.float32)
    
    bayes_embed = lambda x: x
    bayes_embed.fp_dim = fps_train.shape[1]
    
    # Dummy Gaussian
    mn = y_train.mean()
    mse = np.mean((y_test-mn)**2)
    ll = -0.5*np.mean(np.log(2*np.pi) + ((y_test-mn)**2))
    out_rows.append(['Dummy Gaussian (var=1)', traing_set_size, f'{mse:.2f}', f'{ll:.2f}'])

    # Linear regression with point estimate with weights
    lin = linear_model.LinearRegression(fit_intercept=False)
    lin.fit(fps_train, y_train)
    predicted_mn = lin.predict(fps_test)
    mse = np.mean((y_test-predicted_mn)**2)
    ll = -0.5*np.mean(np.log(2*np.pi) + ((y_test-predicted_mn)**2))
    out_rows.append(['Linear Regression/w Gaussian likelihood (var=1)', traing_set_size, f'{mse:.2f}', f'{ll:.2f}'])

    
    # Bayes regression
    bayes_r = bayes_models.BayesianRegression(bayes_embed, False)
    bayes_r.fit(torch.tensor(fps_train), torch.tensor(y_train[:, None]))
    mvn = bayes_r.predict(torch.tensor(fps_test))
    mse = np.mean((y_test-mvn.mean.detach().numpy())**2)
    var = torch.diag(mvn.covariance_matrix)
    ll =  -0.5 *torch.mean((torch.log(2*np.pi*var) + (torch.tensor(y_test)-mvn.mean)**2/var) )
    ll = ll.item()
    #pdb.set_trace()
    #ll = mvn.log_prob(torch.tensor(y_test)).detach().numpy().mean()
    out_rows.append(['Bayesian Regression', traing_set_size, f'{mse:.2f}', f'{ll:.2f}'])

    
    # Sklearn regression
    clf = linear_model.BayesianRidge(compute_score=True, fit_intercept=False)
    clf.fit(fps_train, y_train)
    predicted_mn, predicted_std = clf.predict(fps_test,return_std=True)
    ll = -0.5*np.mean(np.log(2*np.pi*predicted_std**2)  + ((y_test-predicted_mn)**2/predicted_std**2))
    mse = np.mean((y_test-predicted_mn)**2)
    out_rows.append(['Sklearn Bayesian Ridge Regression', traing_set_size, f'{mse:.2f}', f'{ll:.2f}'])

    # Bayes Regression with sklearn params
    bayes_r = bayes_models.BayesianRegression(bayes_embed, False)
    bayes_r.alpha = clf.lambda_
    bayes_r.beta = clf.alpha_
    bayes_r.fit(torch.tensor(fps_train), torch.tensor(y_train[:, None]))
    mvn = bayes_r.predict(torch.tensor(fps_test))
    mse = np.mean((y_test-mvn.mean.detach().numpy())**2)
    var = torch.diag(mvn.covariance_matrix)
    ll =  -0.5 *torch.mean(torch.log(2*np.pi*var) + (torch.tensor(y_test)-mvn.mean)**2/var )
    ll = ll.item()
    out_rows.append([f'Bayesian Regression with the sklearn \n learnt precisions (weights: {bayes_r.alpha:.3f},'
                     f'noise:{bayes_r.beta:.3f})', traing_set_size, f'{mse:.2f}', f'{ll:.2f}'])

    
    
    
    
    return out_rows
    



In [8]:
out = []
for train_size in [10, 20, 50, 100, 500, 1000, 2500, 5000, 7500, 10000]:
    out.extend(test_on_data(smiles_train[:train_size], smiles_test, values_train[:train_size], values_test))
    out.append([""] * len(out[-1]))
    out.append([""] * len(out[-1]))
    out.append([""] * len(out[-1]))
    out.append(["---"] * len(out[-1]))

smiles fp train: 100%|██████████| 10/10 [00:00<00:00, 1730.39it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 2974.87it/s]
smiles fp train: 100%|██████████| 20/20 [00:00<00:00, 3078.39it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 3010.34it/s]
smiles fp train: 100%|██████████| 50/50 [00:00<00:00, 2565.20it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 2984.23it/s]
smiles fp train: 100%|██████████| 100/100 [00:00<00:00, 2439.78it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 2953.69it/s]
smiles fp train: 100%|██████████| 500/500 [00:00<00:00, 3010.76it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 2781.53it/s]
smiles fp train: 100%|██████████| 1000/1000 [00:00<00:00, 3107.73it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 2975.65it/s]
smiles fp train: 100%|██████████| 2500/2500 [00:00<00:00, 3002.88it/s]
smiles fp test: 100%|██████████| 2500/2500 [00:00<00:00, 2976.96it/s]
smiles fp train: 100%|███████

In [9]:
import tabulate

In [10]:
print(tabulate.tabulate(out, headers=['Name', "Training set size", "MSE (↓)", "Avg Loglikelihood (↑)"]))

Name                                               Training set size    MSE (↓)    Avg Loglikelihood (↑)
-------------------------------------------------  -------------------  ---------  -----------------------
Dummy Gaussian (var=1)                             10                   23.69      -12.77
Linear Regression/w Gaussian likelihood (var=1)    10                   21.74      -11.79
Bayesian Regression                                10                   21.73      -3.01
Sklearn Bayesian Ridge Regression                  10                   20.83      -3.03
Bayesian Regression with the sklearn               10                   20.83      -2.95
 learnt precisions (weights: 7.076,noise:0.088)
---                                                ---                  ---        ---
Dummy Gaussian (var=1)                             20                   21.05      -11.44
Linear Regression/w Gaussian likelihood (var=1)    20                   19.98      -10.91
Bayesian Regression       