In [1]:
import logging
logger = logging.getLogger('cmdstanpy')
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.CRITICAL)

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

  register_cmap("cet_"+name, cmap=cm[name])
  register_cmap("cet_"+name, cmap=cm[name])
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)


## Data input and preprocessing

In [4]:
# Read in data set
df = pd.read_csv("20231117_aggregateGliding.csv", index_col=0)

# Relevant experimental conditions
conds = ["Motor Type", "Motor_Conc_uM", "P_Conc_uM", "ADP_Conc_uM", "ATP_Conc_uM"]

# Drop rows with NaNs
df = df.dropna(subset=conds)

# Unique identifier for each condition
df['condition'] = df.groupby(conds).ngroup()

# For convenience, sort by identifier
df = df.sort_values(by='condition').reset_index(drop=True)

# Compute mean speeds for plotting purposes
df_mean = df.groupby(conds)["speed (nm/s)"].mean().reset_index()

FileNotFoundError: [Errno 2] No such file or directory: '20231117_aggregateGliding.csv'

## Fit for a single experiment

We will consider a single experiment with [ADP] = [P] = 0 and [motor] = 10 nM.

In [3]:
# Pull out experiment with [ADP] = [P] = 0 and [motor] = 10 nM
inds = (
    (df["ADP_Conc_uM"] == 0) & (df["P_Conc_uM"] == 0) & (df["Motor_Conc_uM"] == 0.001)
)
df_exp = df.loc[inds, :]

inds_mean = (
    (df_mean["ADP_Conc_uM"] == 0) & (df_mean["P_Conc_uM"] == 0) & (df_mean["Motor_Conc_uM"] == 0.001)
)
df_mean_exp = df_mean.loc[inds_mean, :]

NameError: name 'df' is not defined

Let's take a look at a plot.

In [4]:
# Plot speed versus ATP conc
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=200,
    x_axis_label="[ATP] (µM)",
    y_axis_label="speed (nm/s)",
    x_axis_type="log",
#    y_axis_type='log',
)
p.circle(source=df_exp, x="ATP_Conc_uM", y="speed (nm/s)", alpha=0.03)
p.circle(source=df_mean_exp, x="ATP_Conc_uM", y="speed (nm/s)", color='tomato')

bokeh.io.show(p)

### Distribution for each condition

Let's make ECDFs for each condition to see how the variation in speed changes for each.

In [5]:
p = iqplot.ecdf(
    df_exp,
    q='speed (nm/s)',
    cats='condition',
    x_axis_type='log',
    show_legend=False,
    palette=['#1f77b4'],
)

bokeh.io.show(p)

These look approximately log normal, so we will model them as such.

## Generative model

Let $i$ index the condition specified by motor type and concentration of motors, ATP, ADP, and Pi.Let $j$ index the repeat of an experiment for a given condition. Thus, $v_{ij}$ is the speed measured for condition $i$ in the $j$th measurement. Let $t_i$, $d_i$ and $p_i$ be the ATP, ADP, and Pi concentration, respectively, for condition $i$.

The theoretical mean speed is

\begin{align}
V(t, d, p; k_\mathrm{cat}, K_T, K_D, K_I) = k_\mathrm{cat}\,\frac{t/K_T}{1 + t/K_T + d/k_D + p/K_P + dp / K_{DP}}.
\end{align}

Note that $k_\mathrm{cat}$ is defined in units of speed. For brevity, we define $\theta = (k_\mathrm{cat}, K_T, K_D, K_I)$ and $x = (t, d, p)$ so that we may write, e.g., $V(x;\theta)$. Let $V_i$ be the typical speed for a given set of conditions $x_i$. We model this as

\begin{align}
\ln V_i \sim \text{Norm}(\ln V(x_i;\theta), \sigma_m)\;\forall i.
\end{align}

That is, the speed varies log-normally from the theoretical speed with the same standard deviation in log units. There is reason to suspect this is the case, since the plug-in estimates for the standard deviation of the log speed is about the same for all conditions, as can be seen in the calculation below.

In [6]:
df_exp.groupby('condition')['speed (nm/s)'].apply(lambda x: np.std(np.log(x)))

condition
2     0.654434
3     0.513325
4     0.426814
5     0.516340
6     0.478018
7     0.405987
8     0.611179
9     0.504707
10    0.465614
11    0.635622
12    0.416573
13    0.414808
14    0.456701
15    0.386691
Name: speed (nm/s), dtype: float64

We will also assume homoscedasticity for the measured velocites.

\begin{align}
v_{ij} \sim \text{Norm}(V_i, \sigma)\;\forall i,j.
\end{align}

To finish model specification, we are left to specify priors for the parameter $\theta$, $\sigma_m$ and $\sigma$. Here are some weakly informative priors I chose, where velocity units are in nm/s and concentration units are in µM.

\begin{align}
&k_\mathrm{cat} \sim \text{HalfNorm}(1000), \\[1em]
&\log_{10}K_T \sim \text{Norm}(3, 1.5),\\[1em]
&\log_{10}K_D \sim \text{Norm}(3, 1.5),\\[1em]
&\log_{10}K_P \sim \text{Norm}(3, 1.5),\\[1em]
&\log_{10}K_T \sim \text{Norm}(6, 3),\\[1em]
&\sigma_m \sim \text{HalfNorm}(10),\\[1em]
&\sigma \sim \text{HalfNorm}(100).
\end{align}

With those priors, the model is complete and we can sample out of it using Stan.

### Sampling with Stan

For use in Stan, we need to relabel the conditions starting with 1.

In [7]:
cond_dict = {cond: i + 1 for i, cond in enumerate(df_exp['condition'].unique())}

with pd.option_context('mode.chained_assignment', None):
    df_exp['condition_stan'] = df_exp['condition'].apply(lambda x: cond_dict[x])

We should also make a arrays with the conditions.

In [8]:
df_cond = df_exp.drop_duplicates('condition_stan')

Now, we can make a data dictionary for Stan.

In [9]:
data = {
    'N': len(df_exp),
    'N_conditions': len(df_cond),
    'condition': df_exp['condition_stan'].values,
    'atp': df_cond['ATP_Conc_uM'].values,
    'adp': df_cond['ADP_Conc_uM'].values,
    'p': df_cond['P_Conc_uM'].values,
    'speed': df_exp['speed (nm/s)'].values,
}

Let's look at the Stan model implementing this.

In [10]:
sm = cmdstanpy.CmdStanModel(stan_file='gliding_fit.stan')

print(sm.code())

functions {
  real V_theor(
    real atp,  
    real adp,
    real p,
    real kcat,
    real KT,
    real KD,
    real KP,
    real KDP) {
    // Theoretical value of velocity based on M-M kinetics
    real result = kcat * atp / KT / (1.0 + atp / KT + adp / KD + p / KP + adp * p / KDP);

    return result <= 0 ? 0.00001 : result;
  }
}

data {
  int N;
  int N_conditions;
  array[N] int condition;
  array[N_conditions] real atp;
  array[N_conditions] real adp;
  array[N_conditions] real p;
  array[N] real speed;
}



parameters {
  real<lower=0> kcat;        // units of nm/s
  real log_KT;               // KT in units of uM
  real log_KD;               // KD in units of uM
  real log_KP;               // KP in units of uM
  real log_KDP;              // KDP in units of uM^2
  real<lower=0> sigma_m;     // units of nm/s
  real<lower=0> sigma;       // units of nm/s

  vector[N_conditions] ln_V; // V in units of nm/s
}


transformed parameters {
  real KT = 10^log_KT;
  real KD = 10^log

Now to sample!

In [11]:
samples = sm.sample(data=data)
samples = az.from_cmdstanpy(samples)

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                


First, a quick check of sampler diagnostics.

In [12]:
bebi103.stan.check_all_diagnostics(samples)

Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


0

Everything looks good!

The samples for $K_D$, $K_P$, and $K_{DP}$ will come out of their priors, since they do not enter into the likelihood at all for this case, so we will ignore them. Let's look at our samples of the other parameters.

In [13]:
bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=[
            ("kcat", "kcat (nm/s)"),
            ("KT", "KT (µM)"),
            ("sigma", "σ (nm/s)"),
            ("sigma_m", "σm (nm/s)"),
        ],
        xtick_label_orientation=np.pi / 4,
    )
)

We'll check quickly to make sure $K_D$, $K_P$, and $K_{DP}$ are coming from their priors.

In [14]:
kwargs = dict(x_axis_type='log', frame_height=150)
p1 = iqplot.ecdf(samples.posterior.KD.values.flatten(), x_axis_label='KD (µM)', **kwargs)
p2 = iqplot.ecdf(samples.posterior.KP.values.flatten(), x_axis_label='KP (µM)', **kwargs)
p3 = iqplot.ecdf(samples.posterior.KDP.values.flatten(), x_axis_label='KDP (µM)', **kwargs)

bokeh.io.show(bokeh.layouts.column(p1, p2, p3))

Yep. Just the priors.

We can get 95% credible intervals on $k_\mathrm{cat}$ and $K_T$, our primary parameters of interest. First for $k_\mathrm{cat}$, recalling that the units are nm/s.

In [15]:
kcat_interval = np.percentile(samples.posterior.kcat.values, [2.5, 97.5])
KT_interval = np.percentile(samples.posterior.KT.values, [2.5, 97.5])

print(f"""kcat credible interval: {kcat_interval} nm/s
KT credible interval: {KT_interval} µM""")

kcat credible interval: [121.1839  170.40015] nm/s
KT credible interval: [38.445275  77.2647975] µM


### The entire experiment

And now for the entire experiment. We remake out inputs and then let 'er rip!

In [16]:
cond_dict = {cond: i + 1 for i, cond in enumerate(df['condition'].unique())}

with pd.option_context('mode.chained_assignment', None):
    df['condition_stan'] = df['condition'].apply(lambda x: cond_dict[x])

df_cond = df.drop_duplicates('condition_stan')

data = {
    'N': len(df),
    'N_conditions': len(df_cond),
    'condition': df['condition_stan'].values,
    'atp': df_cond['ATP_Conc_uM'].values,
    'adp': df_cond['ADP_Conc_uM'].values,
    'p': df_cond['P_Conc_uM'].values,
    'speed': df['speed (nm/s)'].values,
}

samples = sm.sample(data=data)
samples = az.from_cmdstanpy(samples)

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                


First, diagnostics.

In [17]:
bebi103.stan.check_all_diagnostics(samples)

Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


0

All good.

Suspecting the $K_D$ and $K_{DP}$ will not be identifiable, we will plot the samples for the other parameters.

In [18]:
bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=[
            ("kcat", "kcat (nm/s)"),
            ("KT", "KT (µM)"),
            ("KP", "KP (µM)"),
            ("sigma", "σ (nm/s)"),
            ("sigma_m", "σm (nm/s)"),            
        ],
        xtick_label_orientation=np.pi / 4,
    )
)

To verify that $K_D$ and $K_{DP}$ are not identifiable, we will plot their samples and show that they match the prior.

In [19]:
p1 = iqplot.ecdf(samples.posterior.KD.values.flatten(), x_axis_label='KD (µM)', **kwargs)
p2 = iqplot.ecdf(samples.posterior.KDP.values.flatten(), x_axis_label='KDP (µM²)', **kwargs)

bokeh.io.show(bokeh.layouts.column(p1, p2))

They do not quite match the prior. They are both skewed right, suggesting that they do not play any role in the kinetics.

Finally, let's get credible intervals on the parameters we care about.

In [20]:
kcat_interval = np.percentile(samples.posterior.kcat.values, [2.5, 97.5])
KT_interval = np.percentile(samples.posterior.KT.values, [2.5, 97.5])
KP_interval = np.percentile(samples.posterior.KP.values, [2.5, 97.5])

print(f"""kcat credible interval: {kcat_interval} nm/s
KT credible interval: {KT_interval} µM
KP credible interval: {KP_interval} µM""")

kcat credible interval: [102.55335  133.910425] nm/s
KT credible interval: [14.647795 33.262845] µM
KP credible interval: [ 4826.23225 16995.14   ] µM


$K_P$ is also very large, but apparently identifiable in this experiment. It seems that inorganic phosphate, like ADP, has little effect on the kinetics.