# Size Effect Normalization Tutorial

## 1. General Information

In this tutorial we explain the use of the ```size_effect_normalization``` package that comes along with our publication "Probabilistic quotient's work & pharmacokinetics' contribution: countering size effect in metabolic time series measurements". A preprint of the manuscript is available on bioRxiv [![DOI:10.1101/2022.01.17.476591](https://zenodo.org/badge/DOI/10.1007/978-3-319-76207-4_15.svg)](https://doi.org/10.1101/2022.01.17.476591).

The package is divided into three submodules:
1. ```size_effect_normalization.extended_model``` contains model classes for PKM and MIX model.
2. ```size_effect_normalization.normalization``` continas wrapper that call model classes and optimize them.
3. ```size_effect_normalization.synthetic_data_generation``` contains functions for synthetic data generation.

## 2. Installation

For installation of the package clone the git and run ```python setup.py install``` in the base folder. We recommend to use a virtual environment with Python 3.7 and all packages listed in ```requirements.txt```. Subsequently all modules can be imported.

In [1]:
from size_effect_normalization import extended_model
from size_effect_normalization import synthetic_data_generation
from size_effect_normalization import normalization

Docstrings of all functions can be accessed with ```?<function>```.
Other required imports for this tutorial are:

In [2]:
import numpy as np
# Set seed of RNG
np.random.seed(13)

## 3. Generate Synthetic Data

For this tutorial instead of real data we use synthetically generated data as described in the original manuscript.

### 3.1 Definition of Data Parameters

In [3]:
# We assume that the first four metabolites have a describable kinetic over time.
# Definition of basic toy model kinetic parameters.
toy_parameters = np.array([[2,.1,1,0,.1],
                           [2,.1,2,0,.1],
                           [2,.1,3,0,.1],
                           [2,.1,.5,0,.1]])
n_known_metabolites = toy_parameters.shape[0] # i.e. 4
# Definition of time points of toy model
timepoints = np.linspace(0,15,20) # i.e. 20 equidistant time points from 0 to 15 h
n_timepoints = len(timepoints)
# Definition of bounds of pharmacokinetic parameters.
bounds_per_metabolite  = [3,3,5,15,3]
# Definition of experimental error size (SD/Mean)
error_sigma = .2
# Definition of the total number of metabolites in the data set.
n_metabolites = 60
# number of replicates
n_replicates = 1

### 3.2 Sampling

In [4]:
## SAMPLE VOLUMES (i.e. size effects).
v_tensor, v_list = synthetic_data_generation.generate_sweat_volumes(n_replicates,
                                                                    n_metabolites,
                                                                    n_timepoints)
# volume_tensor is the expanded version of shape (n_replicates,n_metabolites,n_timepoints) 
# of volume_list with the shape (n_replicates,n_timepoints).
assert (v_tensor[:,0,:] == v_list[:,:]).all()
print(v_tensor.shape)

## SAMPLE EXPERIMENTAL ERRORS
e_tensor = synthetic_data_generation.generate_experimental_errors(n_replicates=n_replicates,
                                                                  n_metabolites=n_metabolites,
                                                                  n_timepoints=n_timepoints,
                                                                  error_sigma=error_sigma)
# In contrast v_tensor, e_tensor does not have repetitive elements in the n_metabolites dimension.
print(e_tensor.shape)

# SAMPLE MEASURED DATA
# Simulation v1 from the manuscript
c_tensor = synthetic_data_generation.generate_random_kinetic_data(n_known_metabolites,
                                                                  n_metabolites,
                                                                  toy_parameters,
                                                                  timepoints,
                                                                  bounds_per_metabolite)
# Simulation v2 from the manuscript
c_tensor = synthetic_data_generation.generate_completely_random_data(n_known_metabolites,
                                                                     n_metabolites,
                                                                     toy_parameters,
                                                                     timepoints,
                                                                     bounds_per_metabolite)
# Simulation v3 from the manuscript
c_tensor = synthetic_data_generation.generate_random_from_real_data(n_known_metabolites,
                                                                    n_metabolites,
                                                                    toy_parameters,
                                                                    timepoints,
                                                                    bounds_per_metabolite)
print(c_tensor.shape)

(1, 60, 20)
(1, 60, 20)
(60, 20)


```v_tensor``` and ```e_tensor``` have the shape ```(n_replicates, n_metabolites, n_timepoints)```. ```v_tensor``` has duplicate elements along the ```n_metabolites``` axis.
```c_tensor``` has the shape (n_metabolites,n_timepoints). 
To calculate the synthetic measured mass table they are multiplied.

In [5]:
# calculate M_tilde
m_tensor = c_tensor * v_tensor[0,:,:] * e_tensor[0,:,:]
print(m_tensor.shape)

(60, 20)


## 4. Size Effect Normalization

 For this tutorial PQN, PKM<sub>minimal</sub> as well as MIX<sub>minimal</sub> normalization was performed with optimization parameters as described in the manuscript.

In [6]:
## CREATE BOUNDS FOR THE MODEL
# mini model
mini_lb = np.concatenate((np.zeros(5*n_known_metabolites),np.ones(n_timepoints)*.05))
mini_ub = np.concatenate((bounds_per_metabolite*n_known_metabolites,np.ones(n_timepoints)*4))
# the bounds for known parameteres are set to the true values + precision of the optimization function.
for p in [2,3,4]:
    mini_lb[:n_known_metabolites*5][p::5] = toy_parameters[:,p]
    mini_ub[:n_known_metabolites*5][p::5] = toy_parameters[:,p]+10e-8
    
## OPTIMIZATION PARAMETERS
n_cpu = 1
n_mc_replicates = 10
mini_lambda = 1/(n_known_metabolites+1)
    
## NORMALIZE FOR SWEAT VOLUME
# PQN
v_pqn                      = normalization.calculate_pqn(m_tensor)
# PKM
v_pkm_mini, pkm_mini_model = normalization.calculate_pkm(m_tensor[:n_known_metabolites,:], # only the first 4 metabolites are used for kinetic fitting: mini model
                                                         mini_lb,mini_ub,                  # parameter bounds
                                                         timepoints,                       # time point vector
                                                         n_known_metabolites,              # number of known metabolites, i.e 4
                                                         n_cpu,                            # number of CPUs to use, i.e 1
                                                         n_mc_replicates,                  # number of Monte Carlo replicates for optimization
                                                         'max_cauchy_loss',                # Loss name
                                                         'none',                           # transformation function
                                                         mini_lambda)                      # lambda
# MIX
v_mix_mini, mix_mini_model = normalization.calculate_mix(m_tensor[:4,:],                   # only the first 4 metabolites are used for kinetic fitting: mini model
                                                         v_pqn,                            # PQN results
                                                         mini_lb,mini_ub,                  # parameter bounds
                                                         timepoints,                       # time point vector
                                                         n_known_metabolites,              # number of known metabolites, i.e 4
                                                         n_cpu,                            # number of CPUs to use, i.e 1
                                                         n_mc_replicates,                  # number of Monte Carlo replicates for optimization
                                                         'cauchy_loss',                    # Loss name
                                                         'log10',                          # transformation function
                                                         'standard',                       # scaling function
                                                         mini_lambda)                      # lambda

100%|██████████| 10/10 [00:01<00:00,  6.23it/s]
100%|██████████| 10/10 [00:02<00:00,  3.95it/s]


```<MIX_model>.get_sweat_volumes()``` can be called to get the estimated size effect volumes.

In [7]:
mix_mini_model.get_sweat_volumes()

array([0.72004742, 2.15850848, 1.1595772 , 1.64176852, 2.9131256 ,
       1.76553713, 3.05393397, 2.2016172 , 2.90195239, 0.51357463,
       0.63392759, 0.41319098, 1.50371188, 0.98762766, 2.3255772 ,
       1.42501868, 1.33349867, 1.91792571, 1.06824277, 0.56653971])

General information of the model is printed by calling ```<MIX_model.info()```.

In [8]:
mix_mini_model.info()

Unnamed: 0,MIX Model
n_metabolites,4
n_timepoints,20
pkm_fun,bateman
trans_fun,log10
scale_fun,standard
parameters,40
bounds,True
measured data,True
metabolite names,False
is optimized,True


Normalized (i.e. C) values for the metabolites used for modeling are generated by calling ```<MIX_model>.get_C_df()```. (Setting Metabolite names is required beforehand).

In [9]:
mix_mini_model.set_metabolite_names([f'Metabolite {i+1}' for i in range(4)])
mix_mini_model.get_C_df()

Unnamed: 0,time,Metabolite 1,Metabolite 2,Metabolite 3,Metabolite 4
0,0.0,0.1,0.1,0.1,0.1
1,0.789474,0.940477,1.472053,1.588787,0.4316
2,1.578947,0.972801,1.76208,2.188761,0.506321
3,2.368421,0.917613,1.753107,2.370393,0.505039
4,3.157895,0.856888,1.666075,2.359192,0.482549
5,3.947368,0.799657,1.563002,2.262942,0.455351
6,4.736842,0.746637,1.460997,2.13287,0.42832
7,5.526316,0.697621,1.364497,1.993057,0.402813
8,6.315789,0.65232,1.274385,1.854574,0.379126
9,7.105263,0.610452,1.190567,1.722276,0.357243


## 5 Analysis

The true synthetic and the estimated size effect volumes can now be compared with the two comparison metrics described in the manuscript:

In [10]:
def RMSE(true,fit):
    return np.sqrt(np.sum((true-fit)**2)/len(true))

def rRMSE(true,fit):
    y = fit/true
    return np.std(y/np.mean(y))

print('---------------------------')
print('METHOD   |   RMSE  |  rRMSE')
print('---------------------------')
print('PQN      |  {:6} | {:6.3f}'.format('   -- ',rRMSE(v_list[0],v_pqn)))
print('PKM_mini |  {:6.3f} | {:6.3f}'.format(RMSE(v_list[0],v_pkm_mini),rRMSE(v_list[0],v_pkm_mini)))
print('MIX_mini |  {:6.3f} | {:6.3f}'.format(RMSE(v_list[0],v_mix_mini),rRMSE(v_list[0],v_mix_mini)))
print('---------------------------')

---------------------------
METHOD   |   RMSE  |  rRMSE
---------------------------
PQN      |     --  |  0.059
PKM_mini |   0.271 |  0.106
MIX_mini |   0.108 |  0.057
---------------------------


MXI<sub>minimal</sub> outperforms PQN and PQN<sub>minimal</sub> in RMSE as well as rRMSE.