<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">

# Supplementary Information: Holmes *et al.* 2017

# 4. *etpD* knockouts and complementation

This notebook describes raw data import, cleaning and QA, then modelling of the *etpD* knockout and complementation experiments.

## Table of Contents

 1. [Experiment design](#design)
 2. [Data import](#import_data)
    1. [Problematic probes](#problem_probes)
    2. [Interpolation for problematic probes](#interpolation)   
 3. [Model definition](#definition)
 4. [Wide to long form](#wide_to_long)
 5. [Probe matches to Sakai and DH10B](#probe_matches)
 6. [Write data](#write)

## Python imports

In [3]:
%pylab inline

import warnings
warnings.filterwarnings('ignore')

Populating the interactive namespace from numpy and matplotlib


## 1. Experiment design <a id="design"></a>

The experiments comprise four variants of *E. coli* Sakai:

1. Wild Type (`WT`)
2. *etpD* knockouts: ΔetpD (`KO`)
3. empty plasmid pSE380: (`empty`)
4. plasmid pSE380 carrying *etpD* complement: (`complement`)

These four variants are each and separately exposed to two spinach plant tissues: *leaf* and *root*, and logCFU determined as a proxy for adherence, as described in the manuscript.

The questions at hand are:

1. Is there a difference in logCFU for `KO` with respect to `WT`? If so, what is the direction and magnitude of change?
2. Is there a difference in logCFU for `complement` with respect to `empty`? If so, what is the direction and magnitude of change?

For each tissue, the experiments were conducted such that each logCFU measurement was acquired on a single, independent plant or leaf. This means that there is **no natural pairing of specific `WT` and `KO`, or `empty` and `complement` measurements**. The natural comparison is within-batch, and pooled.

Measurements were made in batches of five. That is, on a particular half-day a batch of ten measurements were made: five `KO` and five `WT`; or five `empty` and five `complement`. There is therefore **no natural pairing of `WT`/`KO` to `empty`/`complement` measurements** as these were carried out in different batches at different times. We may assume that each batch is subject to specific effects that may bias the observed logCFU with respect to other batches.

## Data import <a id="import_data"></a>

<div class="alert alert-warning">
Raw data was previously converted to plain text comma-separated variable format from the `Excel` files `etpD_raw_data.xlsx`:

<ul>
<li> The file `leaves.csv` contains data from experiments on spinach leaves
<li> The file `roots.csv` contains data from experiments on spinach roots
</ul>
</div>

## Model definition <a id="definition"></a>

We assume that each measurement with index *i* measures the logCFU (proxy for extent of adherence/attachment) of a particular Sakai variant when recovered from plant tissue. We define this measurement as the output $y_{i}$, and assume that it represents the true value of logCFU ($\hat{y_{i}}$) plus some irreducible error in the measurement ($\epsilon$), which is assumed to be the same for all measurements, and normally-distributed with mean 0 and variance $\sigma^{2}_{y}$.

$$y_i = \hat{y_i} + \epsilon_i$$
$$\epsilon_i \sim N(0, \sigma_y^2) \implies y_i \sim N(\hat{y_i}, \sigma_y^2)$$

We assume that the 'true' value $\hat{y_{i}}$ is determined by a combination of factors:

1. The 'inherent' tendency of the `WT` wild-type variant to adhere to the tissue
2. Modification of (1) by the specific loss of *etpD*
3. Modification of (1) by the presence of plasmid pSE380
4. Modification of (1) by the presence of *etpD* on plasmid pSE380
5. Effects specific to the batch

We treat each of these factors as, essentially, the additive results of categorical effects.

(1) is essentially an intercept, as all observations are of either unmodified or modified WT Sakai, and can be represented as the parameter $\alpha$

(2)-(4) can be coded as `1/0` integer values for a measurement with index $i$ ($t_i$, $u_i$, $v_i$), each with its own parameter ($\beta$, $\gamma$, $\delta$) representing combinations of influences. Splitting the factors in this way enables 'borrowing' of data from the plasmid-bearing variants for the estimate of the effect due to loss of *etpD*. Similarly, it enables the direct estimation of the effect of reintroducing *etpD* as a complement. With this interpretation, the parameters have the meanings:

* $\beta$ - the change in logCFU due to loss of *etpD* with respect to the wild type
* $\gamma$ - the change in logCFU due to incorporation of the pSE380 plasmid. Note that as there is no experiment in which the wild-type Sakai carries this plasmid, this parameter only estimates the effect in a $\Delta etpD$ background.
* $\delta$ - the change in logCFU due to incorporation of *etpD* on the pSE380 plasmid. Note again that this parameter only estimates the effect in a $\Delta etpD$ Sakai background, due to the experiment structure.

(5) Can be represented as an array of parameters, $\phi_{j_{i}}$, where $j \in {1, 2, \ldots, n}$ and $n$ is the number of batches. Each value of $\phi_{j_{i}}$ represents the effect due to batch $j_{i}$ - the batch to which measurement $i$ belongs.

We choose to model each of these parameters as an additive effect acting upon the baseline adherence $\alpha$. Parameters estimated to have negative values diminish adherence; parameters estimated to have positive values enhance adherence. Parameters whose credibility interval span zero will not have been shown to modify baseline adherence.

For each parameter's prior we choose a Cauchy distribution:

$$\alpha \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$
$$\beta \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$
$$\gamma \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$
$$\delta \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$

where the variance of each parameter's prior can be drawn from a weak prior. We choose an exponential distribution with mean of unity:

$$\sigma_\alpha \sim \textrm{Exp}(1)$$
$$\sigma_\beta \sim \textrm{Exp}(1)$$
$$\sigma_\gamma \sim \textrm{Exp}(1)$$
$$\sigma_\delta \sim \textrm{Exp}(1)$$

We define batch effects for batch $j_i$ (the batch from which measurement $i$ is drawn) in the same way:

$$j_i \in {1, 2, \ldots, 8}$$
$$\phi_{j_{i}} \sim Cauchy(\mu_{\phi_{j_{i}}}, \sigma_{\phi_{j_{i}}}^2)$$
$$\sigma_{\phi_{j_{i}}} \sim \textrm{Exp}(1)$$

<div class="alert-success">
<b>We therefore construct the following model of the experiment, for a given tissue:</b>

$$y_i = \hat{y_i} + \epsilon_i$$
$$\hat{y_i} = \alpha + \beta t_i + \gamma u_i + \delta v_i + \phi_{j_{i}}$$

$$j_i \in {1, 2, \ldots, 8}$$

$$\alpha \sim Cauchy(\mu_{\alpha}, \sigma_{\alpha}^2)$$
$$\beta \sim Cauchy(\mu_{\beta}, \sigma_{\beta}^2)$$
$$\gamma \sim Cauchy(\mu_{\gamma}, \sigma_{\gamma}^2)$$
$$\delta \sim Cauchy(\mu_{\delta}, \sigma_{\delta}^2)$$
$$\phi_{j_{i}} \sim Cauchy(\mu_{\phi_{j_{i}}}, \sigma_{\phi_{j_{i}}}^2)$$

$$\sigma_\alpha \sim \textrm{Exp}(1)$$
$$\sigma_\beta \sim \textrm{Exp}(1)$$
$$\sigma_\gamma \sim \textrm{Exp}(1)$$
$$\sigma_\delta \sim \textrm{Exp}(1)$$
$$\sigma_{\phi_{j_{i}}} \sim \textrm{Exp}(1)$$

<ul>
<li> $y_i$: measured logCFU for a single application $i$ of bacteria to a plant
<li> $\hat{y_i}$: 'true' logCFU for a single application $i$ of bacteria to a plant
<li> $\epsilon_i$: measurement error in logCFU for a single application of $i$ of bacteria to a plant
<li> $\alpha$: the expected logCFU for wild-type (`WT`) Sakai to a plant
<li> $\mu_\alpha$: mean logCFU for `WT` Sakai on the plant
<li> $\sigma_\alpha$: variance logCFU for `WT` Sakai on the plant
<li> $\beta$: the expected modification of logCFU w.r.t. `WT` as the result of deletion of *etpD*
<li> $\mu_\beta$: mean change in logCFU for `KO` w.r.t. `WT` Sakai on the plant
<li> $\sigma_\alpha$: variance for change in logCFU for `KO` w.r.t. `WT` Sakai on the plant
<li> $t_i$: 0/1 pseudovariable indicating whether the strain used for $i$ is $\Delta etpD$ (is `KO`)
<li> $\gamma$: the expected modification of logCFU w.r.t. `KO` as the result of incorporation of plasmid pSE380
<li> $\mu_\gamma$: mean change in logCFU for `empty` w.r.t. `KO` Sakai on the plant
<li> $\sigma_\gamma$: variance for change in logCFU for `empty` w.r.t. `KO` Sakai on the plant
<li> $u_i$: 0/1 pseudovariable indicating whether the strain used for $i$ is $\Delta etpD$ and carrying pSE380 (is `empty`)
<li> $\delta$: the expected modification of logCFU w.r.t. `empty` as the result of complementation with *etpD*
<li> $\mu_\delta$: mean change in logCFU for `complement` w.r.t. `empty` Sakai on the plant
<li> $\sigma_\delta$: variance for change in logCFU for `complement` w.r.t. `empty` Sakai on the plant
<li> $v_i$: 0/1 pseudovariable indicating whether the strain used for $i$ is $\Delta etpD$, carrying pSE380, and includes the complemented *etpD* (is `complement`)
<li> $j_i$: the batch from which $i$ is drawn, $j \in {1, 2, \ldots, 8}$
<li> $\phi_{j_{i}}$: the expected modification of logCFU for batch $j_i$ w.r.t. `WT`
<li> $\mu_{\phi_{j_{i}}}$: mean change in logCFU for batch $j_i$ w.r.t. `WT`
<li> $\sigma_{\phi_{j_{i}}}$: variance for change in logCFU for batch $j_i$ w.r.t. `WT`
</div>