# Machine learning example

In this notebook I will show all of the machine learning steps that are necessary for simulation based metabolic flux inference.

### Generating our dataset

In [1]:
from sbmfi.models.small_models import spiro, multi_modal
from sbmfi.core.simulator import DataSetSim
from sbmfi.inference.priors import UniNetFluxPrior



In the cell below, we create the spiro model. We also automatically create a simulator that simulates labelling for 2 different labelling states named `'A'` and `'B'`. The simulator includes a boundary observation model for the boundary fluxes `['bm', 'd_out', 'h_out']` with errors drawn from a multivariate Gaussian. Note that in this incarnation of the model, we do not check whether the noisy boundary fluxes lie in the flux polytope.

In [2]:
model, kwargs = spiro(
    backend='torch',
    auto_diff=False,
    batch_size=1,
    add_biomass=True,
    v2_reversible=True,
    ratios=True,
    build_simulator=True,
    add_cofactors=True,
    which_measurements='lcms',
    seed=2,
    measured_boundary_fluxes = ('h_out', ),
    which_labellings=['A', 'B'],
    include_bom=True,
    v5_reversible=False,
    n_obs=0,
    kernel_basis='svd',
    basis_coordinates='rounded',
    logit_xch_fluxes=False,
    L_12_omega = 1.0,
    clip_min=None,
    transformation='ilr',
)

Set parameter Username
Academic license - for non-commercial use only - expires 2025-07-27


  _C._set_default_tensor_type(t)


Displayed below are the reactions of the model

In [3]:
for reaction in model.reactions:
    print(reaction, reaction.bounds)

a_in:  --> A/ab (10.0, 10.0)
d_out: D/abc -->  (0.0, 100.0)
f_out: F/a -->  (0.0, 100.0)
h_out: H/ab -->  (0.0, 100.0)
v1: A/ab --> B/ab (0.0, 100.0)
v2: B/ab ==> E/ab (0.0, 100.0)
v3: B/ab + E/cd --> C/abcd + cof (0.0, 100.0)
v4: E/ab --> H/ab (0.0, 100.0)
v5: F/a + D/bcd <-- C/abcd (-100.0, 0.0)
v6: D/abc --> E/ab + F/c (0.0, 100.0)
v7: F/a + F/b --> H/ab (0.0, 100.0)
bm: 0.3 H/. + 0.6 B/. + 0.5 E/. + 0.1 C/. -->  (0.05, 1.5)
EX_cof: cof -->  (0.0, 1000.0)


These are the measurements that we assume to have access to for both labelling conditions.  

In [4]:
print(f"number of LC-MS signals for labelling condition A: {kwargs['annotation_df']['A'].shape}, and B {kwargs['annotation_df']['B'].shape}")

number of LC-MS signals for labelling condition A: (14, 9), and B (10, 9)


In [5]:
kwargs['measurements']

labelling_id,A,A,A,A,A,A,A,A,A,B,B,B,B,B,BOM,BOM
data_id,ilr_C_0,ilr_C_1,ilr_D_0,ilr_D_1,ilr_H_0,ilr_L_0,ilr_L_1,ilr_L_2,"ilr_L|[1,2]_0",ilr_C_0,ilr_D_0,ilr_H_{M+Cl}_0,ilr_H_0,"ilr_L|[1,2]_0",h_out,bm
0,-2.029316,-1.868853,-2.29619,-1.680012,-0.174556,-1.611885,-2.14425,-2.907779,-1.470387,-4.533702,-2.677548,-0.37377,-0.37377,-1.509012,7.6,1.5


In [6]:
kwargs['annotation_df']['B']

Unnamed: 0,met_id,nC13,adduct_name,mz,rt,sigma,omega,total_I,formula
0,C,0,M-H,157.018955,4.0,0.02,,700000.0,C4H6N4OS
1,C,3,M-H,160.02902,4.0,0.02,,700000.0,C4H6N4OS
2,D,0,M-H,37.008374,5.0,0.01,,100000.0,C3H2
3,D,2,M-H,39.015083,5.0,0.01,,100000.0,C3H2
4,H,0,M-H,25.008374,1.0,0.01,,3000.0,C2H2
5,H,1,M-H,26.011728,1.0,0.01,,3000.0,C2H2
6,H,0,M+Cl,60.985051,1.0,0.03,,2000.0,C2H2
7,H,1,M+Cl,61.988406,1.0,0.03,,2000.0,C2H2
8,"L|[1,2]",0,M-H,136.972776,6.0,0.01,1.0,40000.0,C2H2O7
9,"L|[1,2]",1,M-H,137.976131,6.0,0.01,1.0,40000.0,C2H2O7


we will sample `n` fluxes from a uniform prior and simulate `n_obs=3` observations per sampled flux-vector.

In [7]:
n = 20000

bbs = kwargs['basebayes']
sdf = kwargs['substrate_df']
simulator = DataSetSim(model, sdf, bbs._obmods, bbs._bom, num_processes=3)
prior = UniNetFluxPrior(model, cache_size=n)

In [None]:
theta = prior.sample((n,))

In [None]:
# result = simulator.simulate_set(
#     theta,
#     n_obs=3,
#     fluxes_per_task=None,
#     what='all',
#     break_i=-1,
#     close_pool=True,
#     show_progress=False,
#     save_fluxes=True,
# )

here we save the results

In [None]:
hdf = 'spiro_mdvae_test_NEW.h5'
dataset_id = 'test1'
# simulator.to_hdf(
#     hdf=hdf,
#     result=result,
#     dataset_id=dataset_id,
#     append=True,
#     expectedrows_multiplier=10,
# )

## Representing labelling measurements in a reduced latent space

As a back-of-the-envelope calculation, we can imagine that by LC-MS we can measure around 40 CCM metabolites in *E.coli*. Furthermore, lets imagine that on average we can measure 3 mass isotopomers per metabolite per labelling experiment. If we then do 3 labelling experiments (different substrate labellings), we have a total of `40 * 3 * 3 = 360` numbers to represent the labelling state that we use for inference. 

The first thing that we should notice is that MDVs are an inefficient way of representing labelling data. To represent the labelling state of acetate, `ac`, as an MDV we need three numbers `[ac+0, ac+1, ac+2]`. Since by definition an MDV is a point on a probability simplex, there are actually only 2 degrees of freedom for the acetate MDV, since we know it sums to 1. By applying the isometric log-ratio transform to the MDV, we can represent the labelling state using only 2 real (i.e. $\mathbb{R}$) numbers without any loss of information.

By applying the ilr to all metabolites, we can now represent the labelling data with `40 * (3-1) * 3 = 240` numbers, and on top of that, these are uncorrelated real numbers unlike when using the MDV representation.

Another inefficiency is that different metabolites within a labelling experiment carry similar information. For example, Alanine is made from pyruvate and thus has a similar MDV as pyruvate. Differences can occur because of the functioning of the LC-MS. For instance `ala+1` might not be measured whereas `pyr+1` could be or there are vastly different noise levels between the two signals.

Generally, if we try to infer 20 free fluxes across many labelling experiments resulting in hundreds of independent mass isotopomer measurements, we should try to compress the data to roughly 20 dimensions.

Except for labelling measurements, we typically also have access to measurements of some boundary fluxes such as growth rate (i.e. biomass flux) and uptake of substrate / excretion of some fermentation products.

In [None]:
from sbmfi.inference.mdvae import MDVAE_Dataset, ray_train_MDVAE
from sbmfi.core.simulator import _BaseSimulator

import math
import os
import numpy as np
import pandas as pd
from scipy.stats import random_correlation, loguniform

import torch
from torch.utils.data import Dataset, DataLoader, random_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from torch import nn

import time
import tqdm

create training and validation data-sets

In [None]:
DENOISE = True  # whether to feed denoised data (data without observation model noise added)

if not simulator._la.backend == 'torch':
    raise ValueError
mdvs = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='mdv') if DENOISE else None
data = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='data')
theta = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='theta')
mu = simulator.simulate(theta=theta, mdvs=mdvs, n_obs=0) if DENOISE else None
if (simulator._bom is not None):
    mu = mu[..., :-simulator._bomsize] if DENOISE else None
    data = data[..., :-simulator._bomsize]
dataset = MDVAE_Dataset(data, mu)

n_validate = math.ceil(0.10 * len(dataset))  # 10 % of the data are keps as validation

train_ds, val_ds = random_split(
    dataset,
    lengths=(len(dataset) - n_validate, n_validate),
    generator=simulator._la._BACKEND._rng
)

from sbmfi.settings import BASE_DIR
torch.save(train_ds, os.path.join(BASE_DIR, 'train_ds.pt'))
torch.save(val_ds, os.path.join(BASE_DIR, 'val_ds.pt'))

In [None]:
mdvae, losses = ray_train_MDVAE({}, cwd=BASE_DIR, show_progress=True)

In [37]:

if not simulator._la.backend == 'torch':
    raise ValueError
mdvs = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='mdv') if denoising_dataset else None
data = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='data')
theta = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='theta')
mu = simulator.simulate(theta=theta, mdvs=mdvs, n_obs=0) if denoising_dataset else None
if (simulator._bom is not None) and not include_boundary_fluxes:
    mu = mu[..., :-simulator._bomsize] if denoising_dataset else None
    data = data[..., :-simulator._bomsize]
dataset = MDVAE_Dataset(data, mu)

n_validate = math.ceil(0.10 * len(dataset))

train_ds, val_ds = random_split(
    dataset,
    lengths=(len(dataset) - n_validate, n_validate),
    generator=simulator._la._BACKEND._rng
)

train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

    if 'n_latent' not in kwargs:
        kwargs['n_latent'] = len(simulator.theta_id)

    mdvae = MDVAE(
        n_data=data.shape[-1],
        **kwargs,
    )

    loss_f = nn.MSELoss()
    optimizer = torch.optim.Adam(mdvae.parameters(), lr=lr, weight_decay=weight_decay)

    # print(f'MSE between all mu and data: {loss_f(*dataset[:]).numpy().round(4)}')

    pbar = tqdm.tqdm(total=n_epochs * len(train_loader), ncols=100)
    prr = lambda x: x.to('cpu').data.numpy().round(4)
    losses = []
    try:
        for epoch in range(n_epochs):
            for i, (x, y) in enumerate(train_loader):
                x_hat, mean, log_var = mdvae.forward(x)
                reconstruct = loss_f(x_hat, y)
                KL_div = - 0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
                loss = reconstruct + KL_div
                optimizer.zero_grad()
                if ~(torch.isnan(loss) | torch.isinf(loss)):
                    loss.backward()
                    optimizer.step()
                pbar.update()
                if i % 50 == 0:
                    pbar.set_postfix(loss=prr(loss), KL_div=prr(KL_div),  mse=prr(reconstruct))
            with torch.no_grad():
                x_val, y_val = val_ds[:]
                x_val_hat, mean, log_var = mdvae.forward(x_val)
                reconstruct_val = loss_f(x_val_hat, y_val)
                KL_div_val = - 0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
                loss_val = reconstruct_val + KL_div_val
                print(f'loss: {prr(loss_val)}, KL_div: {prr(KL_div_val)}, MSE: {prr(reconstruct_val)}', flush=True)
    except KeyboardInterrupt:
        pass
    finally:
        pbar.close()
        return mdvae


In [83]:
mdvae = train_mdv_encoder(hdf, simulator, dataset_id, denoising_dataset=True)

 33%|██████▉              | 1686/5064 [01:15<01:33, 36.29it/s, KL_div=0.0153, loss=0.146, mse=0.131]

loss: 3.1619, KL_div: 3.0272, MSE: 0.1347


 67%|████████████▋      | 3375/5064 [02:26<00:46, 36.36it/s, KL_div=0.0055, loss=0.0969, mse=0.0915]

loss: 1.2142, KL_div: 1.1298, MSE: 0.0845


100%|██████████████████▉| 5063/5064 [03:34<00:00, 31.90it/s, KL_div=0.0027, loss=0.0453, mse=0.0426]

loss: 0.5907, KL_div: 0.513, MSE: 0.0778


100%|███████████████████| 5064/5064 [03:34<00:00, 23.65it/s, KL_div=0.0027, loss=0.0453, mse=0.0426]


In [84]:
denoising_dataset = True
include_boundary_fluxes = False

mdvs = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='mdv')
data = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='data')
theta = simulator.read_hdf(hdf=hdf, dataset_id=dataset_id, what='theta')
mu = simulator.simulate(theta=theta, mdvs=mdvs, n_obs=0) if denoising_dataset else None
if (simulator._bom is not None) and not include_boundary_fluxes:
    mu = mu[..., :-simulator._bomsize] if denoising_dataset else None
    data = data[..., :-simulator._bomsize]
dataset = MDVAE_Dataset(data, mu)

n_validate = math.ceil(0.10 * len(dataset))

train_ds, val_ds = random_split(
    dataset,
    lengths=(len(dataset) - n_validate, n_validate),
    generator=simulator._la._BACKEND._rng
)

In [101]:
x_in = val_ds[8][0]
x_hat, mean, log_var = mdvae.forward(x_in)

In [122]:
x_hat, mean, log_var = mdvae.forward(x_in)
x_hat

tensor([-0.0839, -0.1568, -0.1453,  0.0938,  1.5149,  0.3483, -0.2005, -0.9988,
         0.2527, -1.9048, -1.1830,  1.2056,  1.2408,  0.0537],
       grad_fn=<ViewBackward0>)

In [103]:
x_in

tensor([ 0.2128, -0.0636, -0.1091,  0.1319,  1.6744,  0.4989,  0.0603, -0.9097,
         0.2834, -2.2376, -1.3587,  1.0790,  0.9984, -0.1415])

In [23]:
x_in[None, :].shape

NameError: name 'x_in' is not defined

In [180]:
simulator.to_partial_mdvs(x_in[None, :], pandalize=True)

labelling_id,A,A,A,A,A,A,A,A,A,A,...,B,B,B,B,B,B,B,BOM,BOM,BOM
data_id,C+0,C+3,C+4,D+0,D+2,D+3,H_{M+Cl}+0,H_{M+Cl}+1,H_{M+F}+0,H_{M+F}+1,...,H+1,L+0,L+1,L+2,L+5,"L|[1,2]+0","L|[1,2]+1",d_out,h_out,bm
0,0.196942,0.41339,0.389667,0.213277,0.444776,0.341946,0.873687,0.126313,0.884723,0.115277,...,0.171433,0.155488,0.133845,0.546081,0.164585,0.466329,0.533671,-1.086872,0.270122,-0.09538


In [174]:
simulator.to_partial_mdvs(x_hat[None, :], pandalize=True)

labelling_id,A,A,A,A,A,A,A,A,A,A,...,B,B,B,B,B,B,B,BOM,BOM,BOM
data_id,C+0,C+3,C+4,D+0,D+2,D+3,H_{M+Cl}+0,H_{M+Cl}+1,H_{M+F}+0,H_{M+F}+1,...,H+1,L+0,L+1,L+2,L+5,"L|[1,2]+0","L|[1,2]+1",d_out,h_out,bm
0,0.195169,0.363419,0.441413,0.219752,0.388493,0.391755,0.853166,0.146834,0.844307,0.155693,...,0.226241,0.142586,0.168176,0.619806,0.069433,0.451554,0.548446,-1.132424,1.095021,-0.137457


In [130]:
from sbmfi.inference.normflows_patch 

SyntaxError: invalid syntax (2744263525.py, line 1)