## BPMF using posterior propagation


### Downloading the data files

In these examples we use ChEMBL dataset for compound-proteins activities (IC50). The IC50 values and ECFP fingerprints can be downloaded using this smurff function:

In [3]:
import smurff
import logging

logging.basicConfig(level = logging.INFO)

ic50_train, ic50_test, ecfp = smurff.load_chembl()

In [3]:
ic50_train = ic50_train.tocsr()[:100,:]
ic50_test = ic50_test.tocsr()[:100,:]

### Running SMURFF

Finally we run make a BPMF training trainSession and call `run`. The `run` function builds the model and
returns the `predictions` of the test data.

In [5]:
trainSession = smurff.BPMFSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       num_latent = 16,
                       burnin     = 40,
                       nsamples   = 20,
                       verbose    = 1,
                       save_freq = 1,)

predictions = trainSession.run()

INFO:root:PythonSession {
  Data: {
    Type: ScarceMatrixData [with NAs]
    Component-wise mean: 6.47948
    Component-wise variance: 1.80874
    Noise: Fixed gaussian noise with precision: 5.00
    Size: 325 [100 x 346] (0.94%)
  }
  Model: {
    Num-latents: 16
    Dimensions: 100,346
  }
  Priors: {
    0: NormalPrior
    1: NormalPrior
  }
  Result: {
    Test data: 89 [100 x 346] (0.26%)
  }
  Config: {
      Iterations: 40 burnin + 20 samples
      Save model: every 1 iteration
      Output file: /tmp/tmpmp0cncz1/output.hdf5
  }
}

INFO:root:Burnin   1/ 40: RMSE: nan (1samp: 6.8910)  U:[1.26e+01, 4.33e+01, ] [took: 0.0s, total: 0.0s]
INFO:root:Burnin   2/ 40: RMSE: nan (1samp: 3.0736)  U:[1.82e+01, 5.55e+01, ] [took: 0.0s, total: 0.0s]
INFO:root:Burnin   3/ 40: RMSE: nan (1samp: 3.1355)  U:[2.08e+01, 6.73e+01, ] [took: 0.0s, total: 0.0s]
INFO:root:Burnin   4/ 40: RMSE: nan (1samp: 3.9048)  U:[2.18e+01, 7.53e+01, ] [took: 0.0s, total: 0.0s]
INFO:root:Burnin   5/ 40: RMSE: nan (1

In [6]:
import numpy as np

predict_session = trainSession.makePredictSession()

In [16]:
# collect U for all samples
Us = [ s.latents()[0] for s in predict_session.samples ]

# stack them and compute mean
Ustacked = np.stack(Us)
mu = np.mean(Ustacked, axis = 0)

# Compute covariance, first unstack in different way
Uunstacked = np.squeeze(np.split(Ustacked, Ustacked.shape[1], axis = 1))
Ucov = [ np.cov(u, rowvar = False) for u in Uunstacked ]
# restack
Ucovstacked = np.stack(Ucov, axis = 2)
# reshape correctly
Lambda = Ucovstacked.reshape(Ucovstacked.shape[0], Ucovstacked.shape[1]*Ucovstacked.shape[2])

mu.shape, Lambda.shape

((100, 16), (16, 1600))

In [14]:
Ustacked.shape
mu = np.mean(Ustacked, axis = 0)
mu.shape

(100, 16)

In [8]:
session2 = smurff.BPMFSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       num_latent = 16,
                       burnin     = 40,
                       nsamples   = 20,
                       verbose    = 1,
                       save_freq = 1,
                       )
session2.addPropagatedPosterior(0, mu, Lambda)
predictions = session2.run()

RuntimeError: /dev/shm/smurff_1617879756731/work/cpp/SmurffCpp/Configs/Config.cpp:325 in function: validate
assert: mu of propagated posterior in mode 0 should have same number of columns as train in mode