# General remarks

This document serves to demonstrate the use of the `sbmfi` (simulation based metabolic flux inference) `python` package that I have developed over the course of my PhD. The folder structure of the package is shown below:
```
sbmfi
|___ core
|___ inference
|___ models
|___ lcmsanalysis
```
The `core` folder contains all the algorithms that are necessary to specify a simulator that takes as input fluxes and outputs mass distribution vectors (MDV) of metabolites in a steady-state $^{13}C$ carbon labelling experiment (CLE). The `inference` folder contains all the algorithms necessary to perform $^{13}C$ metabolic flux analysis, such as four different priors and a host of different posterior sampling algorithms such as Markov Chain Monte Carlo (MCMC), Sequential Monte Carlo (SMC) and Neural Density Inference (NDI). The `models` folder is self-explanatory. Last, the `lcmsanalysis` contains scripts that I've used to analyze my LC-MS experiments, but this is very dirty coding and should generally be ignored, except for the `formula` file that specifies a convenient class for chemical formulas.


### Constructing a simulator: `sbmfi.core`

The simulator of `sbmfi` inherits its design from the popular [`cobrapy` package](https://cobrapy.readthedocs.io/en/latest/). It is highly recommended to first familiarize one-self with `cobrapy` before working with this simulator.

All the computations for `sbmfi` are performed by one of two possible backends: `numpy` and `torch`. The reason for doing this is that `numpy` and the `scipy` extension are the standard scientific numerics packages in python. It is possible to use the simulator without having `torch` as a dependency. The inclusion of `torch` as a back-end has two reasons. First, `torch` is the backend that is used for all the machine learning related inference algorithms, such as the ones used by the [`sbi` package](https://www.mackelab.org/sbi/). Second, because of the automatic differentiation abilities of torch, it is trivial to compute Jacobian matrices, such as the Jacobian of MDVs w.r.t. fluxes. It turns out that automatic differentiation is very slow in this case and thus that analytical computation is much preferred, however the automatic differentiation capabilities are very useful sanity checks. 

The different backends are specified in `sbmfi.core.linalg.LinAlg`. Linalg has the following functions:
- setting a global seed for random number generation
- specifying a device where all tensors are stored (e.g. for use of `CUDA` on GPU)
- setting of `kwargs` of functions
- setting the `batch_size` parameter, which is the number of fluxes that are processed in batch during simulations

In [1]:
import pandas as pd
import torch
import numpy as np

In [2]:
from sbmfi.core.linalg import LinAlg

np_la = LinAlg(backend='numpy', seed=1, batch_size=2)  
tr_la = LinAlg(backend='torch', seed=1, batch_size=2)

a = np_la.get_tensor(shape=(3,3))
print(a, 'a numpy array')

b = tr_la.get_tensor(shape=(3,3))
print(b, 'a torch tensor')

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]] a numpy array
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]) a torch tensor


### Constucting a model from string input

There are different ways of constructing a model. Shown below is the construction of a model completely from string inputs. 

It is also possible is to augment an [SBML model](https://sbml.org/) with atom-transition mappings. This way, we can use models from the [BiGG repository](http://bigg.ucsd.edu/models), which can be read by the `cobra.io` package. This way of model construction is used for the *E. coli* model that I use throughout my thesis. 

In [3]:
from sbmfi.core.model import EMU_Model

reaction_kwargs = {
    'a_in': {
        'upper_bound': 10.0, 
        'lower_bound': 10.0,
        'atom_map_str': '∅ --> A/abc'
    },
    'co2_out': {
        'upper_bound': 100.0,
        'atom_map_str': 'co2/a --> ∅'
    },
    'e_out': {
        'upper_bound': 100.0,
        'atom_map_str': 'E/ab --> ∅'
    },
    'v1': {
        'upper_bound': 100.0,
        'atom_map_str': 'A/abc --> B/ab + D/c'
    },
    'v2': {
        'upper_bound': 100.0,
        'atom_map_str': 'A/abc --> C/bc + D/a'
    },
    'v3': {
        'upper_bound': 100.0,
        'atom_map_str': 'B/ab + D/c --> E/ac + co2/b'
    },
    'v4': {
        'upper_bound': 100.0,
        'atom_map_str': 'C/ab + D/c --> E/cb + co2/a'
    },
}
metabolite_kwargs = {
    'E': {'formula': 'C2H4O2'},
}

model = EMU_Model(id_or_model='bimodal', linalg=tr_la)

model.add_reactions(
    reaction_kwargs=reaction_kwargs,
    metabolite_kwargs=metabolite_kwargs
)

Set parameter Username
Academic license - for non-commercial use only - expires 2023-11-09


Now we set the $^{13}C$-labelling of the substrate metabolite, which in this case is metabolite `A`. The convention is that 0s indicate a $^{12}C$ atom and 1s indicate a $^{13}C$ atom

In [4]:
input_labelling = pd.Series({f'A/011': 1.0,}, name='input')
model.set_input_labelling(input_labelling)

Last, we set the metabolite(-fragments) that we can measure given our experimental capabilities. This is important information for the EMU algorithm.

In [5]:
model.set_measurements(measurement_list=['E']) 

As in `cobra`, our model has reaction objects, though these have been augmented in order to handle labelling simulations:

In [6]:
for reaction in model.reactions:
    print(reaction, reaction.bounds, type(reaction))

a_in:  --> A/abc (10.0, 10.0) <class 'sbmfi.core.reaction.EMU_Reaction'>
co2_out: co2/a -->  (0.0, 100.0) <class 'sbmfi.core.reaction.EMU_Reaction'>
e_out: E/ab -->  (0.0, 100.0) <class 'sbmfi.core.reaction.EMU_Reaction'>
v1: A/abc --> B/ab + D/c (0.0, 100.0) <class 'sbmfi.core.reaction.EMU_Reaction'>
v2: A/abc --> C/bc + D/a (0.0, 100.0) <class 'sbmfi.core.reaction.EMU_Reaction'>
v3: B/ab + D/c --> E/ac + co2/b (0.0, 100.0) <class 'sbmfi.core.reaction.EMU_Reaction'>
v4: C/ab + D/c --> E/cb + co2/a (0.0, 100.0) <class 'sbmfi.core.reaction.EMU_Reaction'>


In the cell below, we build the simulator. There are a couple of very important arguments in this function:
- `kernel_basis` $\in$ {`rref`, `svd`}: kernel of the null-space of the flux polytope
- `basis_coordinates` $\in$ {`rounded`, `transformed`}: whether the input to the simulator is in transformed or rounded coordinates
- `free_reaction_id` specifies which fluxes are considered free when using the row reduced echelon form (RREF) kernel basis and transformed coordinates
- `logit_xch_fluxes`: whether the input to the simulator has logit transformed exchange flux coordinates

In [7]:
model.build_simulator(
    free_reaction_id=['v4'],
    kernel_basis='rref',
    basis_coordinates='transformed',
    logit_xch_fluxes=False,
    verbose=False,
)

fcm = model._fcm
type(fcm)

sbmfi.core.polytopia.FluxCoordinateMapper

The `fcm` variable above is where fluxes are mapped between different coordinate systems. The simulator takes as input labelling fluxes, whereas the thermodynamic coordinate system is more convenient and interpretable and is what is typically reported in literature on fluxes. Sampling a polytope uniformely is done in a rounded full-dimensional polytope, which is again a different coordinate system for the same variables. When building our simulator, it has a `sbmfi.core.polytopia.FluxCoordinateMapper` in the back where all the relevant coordinate mappings are specified. For the `model` built above, we chose to use a `rref` null-space kernel and `transformed` coordinates, which corresponds to using free fluxes as our input coordinate system. 

`fcm` has three `sbmfi.core.polytopia.LabellingPolytope` attributes: `fcm._Fn`, `fcm._F` and `fcm._Ft` which represent the net-flux, labelling-flux and thermodynamic flux polytopes respectively. Below we see that the `net_flux_polytope` object has attributes `S` and `h` which represent the stoichiometric matrix and the equality constraints and attributes `A` and `b` which represent the inequality constraints.

In [8]:
net_flux_polytope = fcm._Fn
print(net_flux_polytope.__dict__.keys())
pd.concat([net_flux_polytope.S, net_flux_polytope.h], axis=1)  # last column is thus the equality constraints!

dict_keys(['A', 'b', 'shift', 'transformation', 'inequality_only', 'S', 'h', '_mapper', '_objective', '_cvx_result', '_nlr'])


Unnamed: 0,v1,v2,v3,co2_out,e_out,v4,a_in,eq
A,-1.0,-1.0,0.0,0.0,0.0,0.0,1.0,0
co2,0.0,0.0,1.0,-1.0,0.0,1.0,0.0,0
E,0.0,0.0,1.0,0.0,-1.0,1.0,0.0,0
B,1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0
D,1.0,1.0,-1.0,0.0,0.0,-1.0,0.0,0
C,0.0,1.0,0.0,0.0,0.0,-1.0,0.0,0


Net fluxes are sampled from a affinely transformed and rounded polytope that is constructed in a `sbmfi.core.polytopia.PolytopeSamplingModel` object. This object is stored as the `fcm._sampler` attribute. The rounded polytope from which variables are sampled is stored in the `fcm._sampler._F_round` attribute, which only contains inequality constraints. 

In [9]:
rounded_net_flux_polytope = fcm._sampler._F_round
pd.concat([rounded_net_flux_polytope.A, rounded_net_flux_polytope.b], axis=1)  # last column is thus the inequality constraints!

Unnamed: 0,B_v4,ineq
v3|lb,5.0,5.0
v4|lb,-5.0,5.0


Our model only has a single free flux which we set to be `v4` and thus if we pick some values for 'v4' we can map it to fluxes. To figure out what values of `v4` are allowed, we can run Flux Variability Analysis (FVA):

In [10]:
from sbmfi.core.polytopia import fast_FVA
fast_FVA(fcm._Fn)

Unnamed: 0,min,max
v1,-0.0,10.0
v2,-0.0,10.0
v3,-0.0,10.0
co2_out,10.0,10.0
e_out,10.0,10.0
v4,-0.0,10.0
a_in,10.0,10.0


We chose `v4` as our free flux, and we see from the FVA results above that its range is `0.0 - 10.0`. Below we manually pick 4 free fluxes within this range:

In [11]:
free_fluxes = pd.DataFrame([4.0, 6.0, 5.0, 8.0], columns=['v4'])
free_fluxes

Unnamed: 0,v4
0,4.0
1,6.0
2,5.0
3,8.0


Many of the functions in `sbmfi` take the `pandalize=bool` kwarg which turns the output of that function into a `pd.DataFrame` with sensible axis labels. Below we map the free fluxes to labelling fluxes via the coordinate mapper:

In [12]:
fluxes = fcm.map_theta_2_fluxes(free_fluxes, pandalize=True)
fluxes

Unnamed: 0,v1,v2,v3,co2_out,e_out,v4,a_in
0,6.0,4.0,6.0,10.0,10.0,4.0,10.0
1,4.0,6.0,4.0,10.0,10.0,6.0,10.0
2,5.0,5.0,5.0,10.0,10.0,5.0,10.0
3,2.0,8.0,2.0,10.0,10.0,8.0,10.0


To now simulate mass distribution vectors for the measured metabolites (which in this case is only metabolite `E`), we set the labelling fluxes and then call `model.cascade()` to run the simulation. Note that we MUST set only 2 fluxes at a time, since that is the `batch_size` that we set in the `LinAlg` object above!

In [13]:
model.set_fluxes(fluxes.loc[[0, 1]])
mdvs_1 = model.cascade(pandalize=True).copy()
model.set_fluxes(fluxes.loc[[2, 3]])
mdvs_2 = model.cascade(pandalize=True).copy()
pd.concat([mdvs_1, mdvs_2])

mdv_id,E+0,E+1,E+2
0,0.24,0.52,0.24
1,0.24,0.52,0.24
2,0.25,0.5,0.25
3,0.16,0.68,0.16


In the cell above we see that for flux vecors `[0, 1]`, the MDVs are exactly the same, thus highlighting the issue of unidentifiability!

The fundamental calculation of the EMU simulation algorithm is the solution of a cascade of linear systems: $\pmb{X}_w = \pmb{A}_w^{-1}\cdot \pmb{B}_w \cdot \pmb{Y}_w$, where the subscript $w$ indicates the size of the EMUs in that step of the cascade. For more information on the algorithm, please read [the publication](https://www.sciencedirect.com/science/article/pii/S109671760600084X?via%3Dihub). We can inspect the matrices that are used for the simulation by calling `model.pretty_cascade(weight=...)`, which returns a `dict` with all the matrices of that size. Note since `batch_size=2`, we have one set of matrices per parsed flux-vector (in this case vectors `[2, 3]`)!

In [14]:
EMU_matrices = model.pretty_cascade(2)
print(EMU_matrices.keys())
EMU_matrices['B']

dict_keys(['A', 'B', 'X', 'Y'])


Unnamed: 0,Unnamed: 1,B|[0] ∗ D|[0],C|[1] ∗ D|[0]
2,"E|[0,1]",-5.0,-5.0
3,"E|[0,1]",-2.0,-8.0


For now, I think this is enough information. The next steps in the simulation process are to simulate actual measurements via an observation model, which are specified in `sbmfi.core.observation`. This I will leave for a follow-up once you are familiarized with this part of the package. 