## The Math Behind the Models

### Markov Chain Monte Carlo

Our preferred method for determining the positions of putative TF binding sites, as well as to understand how mutations impact gene expression, is to perform Markov Chain Monte Carlo (MCMC) inference. MCMC gets its name from two processes, Monte Carlo and Markov Chain. Monte Carlo is a method for estimating features of a distribution by randomly drawing samples from the distribution. For example, one could estimate the mean or standard deviation of a distribution by drawing random samples and computing the mean and standard deviation for those samples. MCMC methods are often used when functions are not amenable to analytical solutions or calculations. MCMC methods allow the expectation value of a given parameter, and its uncertainty without requiring us to have full access to the underlying probability distribution. As with many cases in biology, the true underlying probability distribution is often complicated and difficult to access.

Our Reg-Seq method currently uses human intuition to determine binding sites (based on the characteristics of information footprints, discussed later in this Wiki protocol), but to validate our binding site choices, and to capture all regions of a sequence that are important for gene expression, we need to also computationally identify regions where gene expression is changed significantly up or down by mutation (p < 0.01), and discard any potential sites which do not fit this criteria. We infer the effect of mutation using MCMC and we use the distribution of parameters from the inference to form a 99 % confidence interval for the average effect of mutation across a 15 base pair region. We include binding sites that are statistically significant at the 0.01 level in any of the tested growth conditions.

One difficulty with estimating the mutual information from model predictions is that base pair identity A, C, G, T and the gene expression level µ are both discrete variables, while binding energy predictions from the model (x) is
a continuous variable. Formally, the mutual information is given by

\begin{align}
I(\mu, x) = \int_{- \infty}^{+ \infty} dx \sum_{u}^{} p(x, \mu) \log_{2} \left( \frac{p(x, \mu)}{p(x) p(\mu)} \right) \tag{1}
\end{align}


where µ is a measure of gene expression,

\begin{align}
\mu = \begin{cases}
  0, \quad \text{for sequencing reads from DNA library}\\
  1, \quad \text{for sequencing reads originating from mRNA} \tag{2}
\end{cases}
\end{align}

during Reg-Seq.

The probability distribution, $p$, is not one that we have full access to, as we only have a discrete set of predictions (one from each of the N unique DNA sequences in our data set). To compensate for the issues that can arise from estimating a continuous distribution from discrete data, we make use of the fact that any transformation that preserves the rank order (for instance multiplying all model predictions by a constant) the mutual information is unchanged.

We will define $z_q$ as the rank order in binding energy predictions of the $qth$ sequence. We estimate $I(\mu,z)$ by first calculating binding energy predictions

\begin{align}
x = \sum_{i=1}^L \sum_{j=A}^T \theta_{ij} \cdot \delta_{ij} ,\tag{3}
\end{align}

where $\delta_{ij}$ is the Kronecker-Delta and then converting them to a rank order predictions $z$.

We then discretize the energy predictions into 1000 "bins" and convolve with a Gaussian kernel to estimate the probability distribution $p(z, \mu)$. We can then calculate the mutual information with

\begin{align}
I(\mu,z)_{smoothed} = \sum_{z=1}^{1000} \sum{\mu} F(\mu,z) \log_2 \frac{F(\mu,z)}{F(z) \cdot F(\mu)} \tag{4}
\end{align}

where $F(\mu,z)$ is the probability distribution and $p(\mu,z)$ is estimated from finite data.

### Markov Chain Monte Carlo fitting procedure

As proven by [Justin Kinney (2008)](https://pdfs.semanticscholar.org/56ce/a3cb3609844a0df0554f99524dbb96479c2d.pdf) in his disseration, "Biophysical Models of Transcriptional Regulation from Sequence Data", the likelihood of a model that predicts gene output is

\begin{align}
L(\theta|{\mu_s}) \propto 2^{NI_{smooth}(\mu,z)} \tag{5}
\end{align}

where $N$ is the total number of independent sequences, $I_{smooth}$ is the smoothed mutual information between gene expression, $\mu$, and DNA sequence $z$.

The probability distributions are very difficult to handle analytically. The reason why we use MCMC is that you can estimate properties using the target probability distribution without needing to know the distribution. For example, we can estimate $\left\langle \theta \right \rangle$ by drawing many samples of $\theta$ using MCMC and taking the mean of the parameters.

We therefore need to construct a Markov Chain whose stationary distribution converges to the distribution of interest $p(\theta)$. A Markov chain is a sequence of values ${\theta_1, \theta_2,\theta_3,..., \theta_N}$ for $N$ steps. We can then find $\left \langle \theta \right \rangle$ with

\begin{align}
\left \langle \theta \right \rangle = \frac{\sum_{N=1}^{100} \theta_N}{N} \tag{6}
\end{align}

A Markov chain has no memory. That is the probability that the $N^{th}$ value in the chain takes a value $\theta_N$ depends only on the $N-1^{th}$ value in the chain. To make things a bit more concrete, let's leave aside $\theta$ for the time being. Imagine that we have a light switch, and we know the switch is "on" 25% of the time and "off" 75% of the time. For each "step" in our Markov chain, we can change the state of the switch; if the switch is on, we turn it off with some rate $k_{off}$, and if the switch is currently off, we turn it off with the rate $k_{on}$. A sequence of states will be generated. One example would be [off, off, off, on, on, on, off]. These states constitute a Markov chain and if the chain is continued for long enough, the stationary distribution will converge such that $p_{on}$ = 0.25 and $p_{off}$ = 0.75.

A Markov chain is stationary if detailed balance is satisfied between its states. The condition of detailed balance obtains if the total rate of transitions from on to off is the same as the total rate of transitions from off to on. Mathematically, this condition can be written as

\begin{align}
k_{on} \times p_{off} = k_{off} \times p_{on} \tag{7}
\end{align}

This equation allows us to calculate $k_{on}$ and $k_{off}$. It follows that 

\begin{align}
\frac{k_{on}}{k_{off}} = \frac{p_{on}}{p_{off}}  = \frac{1}{3} \tag{8}
\end{align}

As long as the ratio of transition rates is satisfied and enough steps are taken, then the stationary distribution will converge to the proper distribution.

For the far more complicated task of estimating $p(\theta)$, we can fall back on the standard Metroplis-Hastings sampling algorithm. $\theta$ will be a matrix where $\theta_{ij}$ will be the energetic contributions to the energy matrix for the $i^{th}$ position and the $j^{th}$ base pair where $i \in {1,2,3,...,L}$ and $j \in {A,C,G,T}$. We can then follow the procedure:

1. Start with a random energy matrix $\theta_0$.
2. Make a random perturbation $d \theta_0$ to $\theta_0$. This perturbation will have a small adjustment to each element of $\theta$.
3. Compute the model likelihoods $L(\theta_0)$ and $L(\theta_0 + d\theta$ using (Eqn. 5).
4. If $L(\theta_0)$ and $L(\theta_0 + d \theta > L(\theta$, accept the new parameter values $\theta_0$ + $d\theta$ as the next element in the Markov chain $\theta_1$. Otherwise, accept $\theta_0$ + $d \theta$ with probability $\frac{L(\theta_0 + d\theta)}{L(\theta_0)}$. If the step is rejected, the next element $\theta_1$ in the Markov chain is reset to its previous value $\theta_0$. The acceptance/rejection probabilities mean that detailed balance is satisfied between the states $\theta_0$ and $\theta_0 + d\theta$.
5. Repeat steps 2-4 of this procedure until the chain converges to the stationary distribution. In practice this can be determined by monitoring when the mutual information plateaus.
6. To be certain that the distribution has in fact converged to the proper stationary distribution, multiple Markov Chains should be run starting with step 1. If all the chains converge to the same distribution, then they have properly converged.

The end result of these model-fitting efforts is an optimized linear binding energy matrix. You can get a measure of the uncertainty in $\theta_{ij}$ by forming a confidence interval out of the distribution of parameters formed from the Markov Chain. This inference is performed using the MPAthic software.

Now that we have discussed MCMC, let's now turn to the other _potential_ method for data analysis: least squares analysis.

### Least Squares Analysis

In the Reg-Seq paper, we used exclusively MCMC analysis to interpret data emanating from sequencing experiments. However, it is still conceptually important to understand how Least Squares analysis works, as this offers an alternative method for analyzing the sequencing data.

Our least squares inference can be found in a paper by [Ireland and Kinney, _bioRxiv_](https://www.biorxiv.org/content/10.1101/054676v2).

Least squares provides a computationally simple inference procedure that overcomes the most onerous restrictions of enrichment ratio calculations. It can be used to infer any type of linear model, including both matrix models and neighbor models. It can also be used on data that consists of more than two bins. The idea behind the least squares approach is to
choose parameters $\theta^{LS}$ that minimize a quadratic loss function. Specifically, we use

\begin{align}
\theta^{LS} = argmin_{\theta} L(\theta) \tag{9}
\end{align}

where

\begin{align}
L(\theta) = \sum_M \sum_{n | M} \frac{[r (S^n, \theta) - \mu_M]^2 }{\sigma_M^2} + \alpha \sum_i \theta_i^2 \tag{10}
\end{align}

Here, $\mu_M$ is the assumed mean activity of sequences in bin $M$, $\sigma_M^2$ is the assumed variance in the activities of such sequences, $i$ indexes all parameters in the model, and $\alpha$ is the "ridge regression" regularization parameter. By using the objective function $L(\theta)$, one can rapidly compute values of the optimal parameters $\theta$ using standard algorithms.

One downside to least squares inference is the need to assume specific values for $\mu_M$ and for $\sigma_M^2$ for each bin $M$. MPAthic allows the user to manually specify these values. There is a danger here, since assuming incorrect values for $\mu_M$ and $\sigma_M^2$ will generally lead to bias in the inferred parameters $\theta^{LS}$. In practice, however, the default choice of $\mu_M = M$ and $\sigma_M^2 = 1$ often works surprisingly well when bins are arranged from lowest to highest average activity. Another downside to least squares is the need to assume that experimental noise – specifically, $p(R|M)$ - is Gaussian. Only in such cases does least squares inference correspond to a meaningful maximum likelihood calculation. In massively parallel assays, however, noise is often strongly non-Gaussian. In such situations, least squares inference cannot be expected to yield correct model parameters for any choice of $\mu_M = M$ and $\sigma_M^2 = 1$.

Now that we have discussed both MCMC and least squares analysis, we now turn our attention to the _visualization_ of our analyzed data. Now that we have coaxed the sequencing data into a usable format -- and performed inference on the data -- we get to the fun part: plotting the data! In the [original Reg-Seq](https://www.biorxiv.org/content/10.1101/2020.01.18.910323v3) paper, we visualize sequencing data in three ways: information footprints, sequence logos and energy matrices. We discuss the _mathematics_ behind each of these visual formats next, and then conclude this Wiki protocol with in-depth details on how we plot these three visual tools.

## The Math Behind the Plots

### Mathematics of Information Footprints

We use information footprints as a tool for hypothesis generation to identify regions which may contain transcription factor binding sites. In general, a mutation within a transcription factor site is likely to weaken that site. We look for groups of positions where mutation away from wild type has a large effect on gene expression. Our datasets consist of nucleotide sequences, the number of times we sequenced a given, specific mutated promoter in the plasmid library, and the number of times we sequenced its corresponding mRNA. A simplified dataset on a hypothetical 4 nucleotide sequence is shown in the table below.


| Sequence    | DNA Counts  | mRNA Counts |
| ----------- | ----------- | ----------- |
| ACTA        | 5           | 23          |
| ATTA        | 5           | 3           |
| CCTG        | 11          | 11          |
| TAGA        | 12          | 3           |
| GTGC        | 2           | 0           |
| CACA        | 8           | 7           |
| AGGC        | 7           | 3           |



One strategy to measure the impact of a given mutation on expression is to take all sequences which have base $b$ at position $i$ and determine the number of mRNAs produced per read in the sequencing library. By comparing the values for different bases we can determine how large of an effect a mutation has on gene expression. For example in the table above, for the second position ($i = 2$) those sequences that contain the wild type base A ($b = A$) have 20 sequencing counts out of 50 from the DNA library and 10 sequencing counts from the 50 mRNA reads. For all other sequences ($b = C,G,$ or $T$), there are 30 sequencing counts from the DNA library and 40 sequencing counts from mRNA. A measure of the effect of mutation on expression would be to compare the ratios $\frac{\mbox{mRNA counts}}{\mbox{library counts}}$  between mutated and wild type sequences. For the data in the table above, sequences with a wild type base at position 2 will have a ratio of $0.5$ and sequences with a mutated base at position 2 will have a ratio of approximately $1.3$.

    
While directly comparing ratios is one way to measure the effect on gene expression, this paper uses mutual information to quantify the effect of mutation, as [Kinney2010](https://www.pnas.org/content/107/20/9158) demonstrated could be done successfully. In the table above, the frequency of the nucleotide A in the library at position 2 is 40$\%$, as 20 out of 50 sequencing counts from the DNA library have an A at position 2. 16 out of 50 sequences have a C in the second position, implying that the frequency of C in the second position is 32$\%$. Similarly the library is 14$\%$ G as there are 7 out of 50 sequences with a G at the second position, and 14$\%$ T as the final 7 sequences have a T in the second position. Cytosine is enriched in the mRNA transcripts over the original library, as the sequences with C at position 2 have 34 sequencing reads out of 50 from mRNA , implying that C composes 68\% of all mRNA sequencing reads. A, G, and T only have 10, 3, and 3 mRNA reads out of 50 respectively. These numbers imply they compose only 20$\%$, 6$\%$, and 6$\%$ respectively of all mRNA sequencing reads. Large enrichment of some bases over others in the mRNA counts occurs when base identity is important for gene expression. We are interested in quantifying the degree to which mutation away from a wild type sequence affects expression. Although their are obviously 4 possible nucleotides, we can classify each base as either wild type or mutated so that $m$ in equation \ref{equ:MI2} represents only these two possibilities. Mutual information is given at position $i$ by 

\begin{align}
    I_i =  \sum_{m=0}^1  \sum_{\mu=0}^1 p(m,\mu)\log_2\left(\frac{p(m,\mu)}{p_{mut}(m)p_{expr}(\mu)}\right),
    \tag{11}\label{equ:MI2}
\end{align}

where $p_{expr}(\mu)$ is the ratio of the number of DNA ($\mu=0$) or mRNA ($\mu=1$) sequencing counts to the total number of counts,

\begin{align}
    p_{expr}(\mu) =
    \begin{cases}
      \sum{\mbox{mRNA counts}/\mbox{total counts}} & \text{if}\ \mu = 1 \\
      \sum{\mbox{Library Sequencing counts}/\mbox{total counts}}, & \text{if}\ \mu = 0.
    \end{cases}
    \tag{12}
\end{align}

From the example data in the table above we can calculate $p_{expr}$. To do so we sum up DNA counts ($5+5+11+12+2+8+7=50$) and mRNA counts ($23+3+11+3+0+7+3=50$) from all sequences and divide by the total number of counts ($50+50=100$) to obtain

\begin{equation}
    p_{expr}(\mu) =
    \begin{cases}
      0.5, & \text{if}\ \mu = 1 \\
      0.5, & \text{if}\ \mu = 0.
    \end{cases}
    \tag{13}
  \end{equation}
  
In addition, $p_{mut}(m)$ is the fraction of the total counts that either have a mutation ($m=1$) at the given position or the fraction that have a wild type base  ($m=0$) at the position. $p_{mut}$ has to be computed for each position individually. For position 1, the wild type base is an A, and we see that there are a total of 100 sequencing counts, of which 46 counts (DNA and mRNA combined) contain an A at position 1. Therefore  $p(m)$ can be calculated for position 1 as

\begin{equation}
    p_{mut}(m) =
    \begin{cases}
      0.46, & \text{if}\ m = 0 \\
      0.54, & \text{if}\ m = 1.
    \end{cases}
    \tag{14}
  \end{equation}
  
Lastly, the joint distribution $p(m,\mu)$ is the probability that a given sequencing read in the dataset will have expression level $\mu$ and mutation status $m$. $p(m, \mu)$ is calculated by dividing the number of sequencing reads at the chosen position with mutation status $m$ and expression status $\mu$ by the total number of sequencing reads. In the case of the example dataset and for $m = 0$ and $\mu = 0$, we sum the sequencing reads that are wild type at position 1 and also are in the DNA library. As there are 17 sequences that fit the criteria out of 100 total sequences, $p(m = 0,\mu = 0) = 0.17$. The other values of $p(m,\mu)$ can be calculated to be

\begin{equation}
p(m, \mu) =
\begin{cases}
  0.17, & \text{if}\ m = 0 \ \text{(wild type base)}\ \text{and}\ \mu = 0 \ \text{(DNA)}\\
  0.21, & \text{if}\ m = 1 \ \text{(mutated base)}\ \text{and}\ \mu = 1 \ \text{(RNA)}\\
  0.33, & \text{if}\ m = 1\ \text{and}\ \mu = 0\\
  0.29, & \text{if}\ m = 0\ \text{and}\ \mu = 1.
\end{cases}
\tag{15}
\end{equation}

The marginal distributions $p_{expr}$ and $p_{mut}$ can be obtained by summing over one of the two variables, i.e.,

\begin{align}
    p_{expr}(\mu) = \sum_m p(m,\mu),\\[1em]
    p_{mut}(m) = \sum_\mu p(m,\mu).
    \tag{16}
\end{align}

Plugging the values calculated above into equation (\ref{equ:MI2}) yields a mutual information value of 0.06 bits at position 1. The unit is bits because the mutual information is computed with a logarithm of base 2. Other bases can be chosen, however, that results in different units for the mutual information.

Mutual information is a measurement that quantifies how much the measurement of one of two variables reduces uncertainty of the other variable. For example, very low mutual information means that by knowing one variable one gains no information about the other variable, while on the other hand high mutual information means that by knowing one variable our knowledge about the others increases.
At a position where base identity matters little for expression level, there would be little difference in the frequency distributions for the library and mRNA transcripts. The entropy of the distribution would decrease only by a small amount when considering the two types of sequencing reads separately.
    
We seek to determine the effect on gene expression of mutating a given base. However, if mutation rates at each position are not fully independent such that $p(m_i,m_{i'}) \neq p(m_i)p(m_{i'})$, then the information value calculated in equation (\ref{equ:MI2}) will also encode the effect of mutation at correlated positions. For instance, if position $i$ is part of an activator binding site, mutating it will have a large effect on gene expression. If position $i'$ is not within the activator site, then mutating position $i'$ will have minimal true effect on gene expression. However, if mutations at the two bases are correlated, mutating position $i'$ will make it more likely for $i$, and therefore the activator binding site, to be mutated. Knowledge that $i'$ is mutated is predictive of overall expression, and so position $i'$ will have high mutual information according to equation (\ref{equ:MI2}), even though that position has no regulatory function. In our experiment we designed sequences to be synthesized such that each position had a probability of mutation that was independent of mutation at any other position. However, due to errors in the oligonucleotide synthesis process, additional mutations in the ordered sequences were introduced. Sequencing our DNA libraries reveals that mutation at a given base pair can make mutation at another base pair more likely by up to $10\%$, where neighboring base pairs are the most likely to have correlations between mutations. This is enough to cloud the signature of most transcription factors in an information footprint calculated using equation (\ref{equ:MI2}).

We need to determine values for $p_i(m|\mu)$ when mutations are independent, and to do this we need to fit these quantities from our data. We assert that 

\begin{equation}
    \left\langle C_\mathrm{mRNA} \right\rangle \propto e^{-\beta E_{eff}}
    \tag{17}
\end{equation}

is a reasonable approximation to make, which we will justify by considering a number of possible regulatory scenarios. $\left\langle C_\mathrm{mRNA} \right\rangle$ is the average number of mRNAs produced and $E_{eff}$ is an effective binding energy for the sequence that can be determined by summing contributions from each position in the sequence independently. While we will show that under reasonable assumptions this approach is useful for any of these regulatory architectures, let us first consider the simple case where there is only an RNAP site in the region under study. We can write down an expression for average gene expression per cell as

\begin{equation}
\left\langle C_\mathrm{mRNA} \right\rangle \propto p_{bound} \propto \frac{\frac{P}{N_{NS}}e^{-\beta E_P}}{1 + \frac{P}{N_{NS}}e^{- \beta E_P}},
\tag{18}\label{equ:18}
\end{equation}

where $p_{bound}$ is the probability that the RNAP is bound to DNA and is known to be proportional to gene expression in *E. coli* ([AckersandJohnson,1982](https://pubmed.ncbi.nlm.nih.gov/6461856/), [Buchleretal.,2003b](https://www.pnas.org/content/100/9/5136), [Garcia and Phillips 2011](https://www.pnas.org/content/108/29/12173)), $E_P$ is the energy of RNAP binding, $N_{NS}$ is the number of nonspecific DNA binding sites, and $P$ is the number of RNAP. If RNAP binds weakly then $\frac{P}{N_{NS}}e^{-\beta E_P} << 1$, and we can simplify equation (\ref{equ:18}) to 

\begin{equation}
    \left\langle C_\mathrm{mRNA} \right\rangle \propto e^{- \beta E_P}.
    \tag{19}
\end{equation}

Using this relation, we can compute the ratio of average mRNA counts in wild type $\left\langle C_\mathrm{mRNA}^{\mathrm{WT}_i} \right\rangle$ to average mRNA counts in a mutant $\left\langle C_\mathrm{mRNA}^{\mathrm{Mut}_i} \right\rangle$ as 

\begin{align}
\frac{\left\langle C_\mathrm{mRNA}^{\mathrm{WT}_i} \right\rangle}{\left\langle C_\mathrm{mRNA}^{\mathrm{Mut}_i} \right\rangle} =& \frac{e^{- \beta E_{P_{\mathrm{WT}_i}}}}{e^{- \beta E_{P_{\mathrm{Mut}_i}}}}, \\[1em]
\frac{\left\langle C_\mathrm{mRNA}^{\mathrm{WT}_i} \right\rangle}{\left\langle C_\mathrm{mRNA}^{\mathrm{Mut}_i} \right\rangle} =& e^{- \beta \left(E_{P_{\mathrm{WT}_i}} - E_{P_{\mathrm{Mut}_i}}\right)},
\tag{20}
\end{align}

where $E_{P_{\mathrm{WT}_i}}$ is the binding energy of RNAP to the wild type binding site and $E_{P_{\mathrm{Mut}_i}}$ is the binding energy of RNAP to the mutant binding site. Using the assumption that each position contributes independently to the binding energy, we can simplify the differences in energies to $E_{P_{\mathrm{WT}_i}} - E_{P_{\mathrm{Mut}_i}} = \Delta E_{P_i}$. We can now calculate the probability of finding a specific base in the expressed sequences. If the probability of finding a wild type base at position $i$ in the DNA library is $p_i(m=0|\mu=0)$, then 

\begin{align}
 p_i(m = 0|\mu = 1) = 
&  \frac{p_i(m=0|\mu=0) \frac{\left\langle C_\mathrm{mRNA}^{\mathrm{WT}_i} \right\rangle}{\left\langle C_\mathrm{mRNA}^\mathrm{Mut_i} \right\rangle}}{p_i(m=1|\mu=0)  + p_i(m=0|\mu=0) \frac{\left\langle C_\mathrm{mRNA}^{\mathrm{WT}_i} \right\rangle}{\left\langle C_\mathrm{mRNA}^{\mathrm{Mut}_i} \right\rangle}}, \\[1em]
p_i(m=0|\mu=1) =  
& \frac{p_i(m=0|\mu=0) e^{- \beta \Delta E_{P_i}}}{p_i(m=1|\mu=0)  + p_i(m=0|\mu=0) e^{- \beta \Delta E_{P_i}}}.
\label{21}
\end{align}

Under certain conditions, we can also infer a value for $p_i(m|\mu=1)$ using a linear model when there are any number of activator or repressor binding sites. We will demonstrate this in the case of a single activator and a single repressor, although a similar analysis can be done when there are greater numbers of transcription factors. Define $p = \frac{P}{N_{NS}}e^{- \beta E_P}$ and $a = \frac{A}{N_{NS}}e^{-\beta E_A}$ where $A$ is the number of activators, and ${E_A}$ is the binding energy of the activator. Also define $r = \frac{R}{N_{NS}}e^{-\beta E_R}$ where $R$ is the number of repressors and ${E_R}$ is the binding energy of the repressor. Then we can compute the average number of produced mRNA as
    
\begin{equation}
\left\langle C_\mathrm{mRNA} \right\rangle \propto p_{bound} \propto \frac{p + pae^{-\beta \epsilon_{AP}}}{1+a+p+r+pae^{-\beta \epsilon_{AP}}}.
\tag{22}
\end{equation}

One assumption we will make is that activators and RNAP bind weakly to their binding sites ($a << 1$ and $p << 1$) but interact strongly ($pae^{-\beta\epsilon_{AP}} >> p$). Under this assumption RNAP and associated activators are much more likely to bind DNA as a unit than separately. The binding energy measurements by [Forcier et al.](https://elifesciences.org/articles/40618) support this assumption in the case of CRP in the *lac* operon. The DNA-protein binding energy of CRP is measured to be -3.18 $k_BT$ and the interaction energy between CRP and RNAP ($\epsilon_{AP}$) is measured to be -6.56 $k_BT$. The copy number of CRP is $\approx$ 4000 ([Schmidt et al.](https://www.nature.com/articles/nbt.3418)), the copy number of RNAP is $\approx$ 2000 in slowly growing cells ([Bremer, Ha & Dennis, Patrick. (1996)](https://www.researchgate.net/publication/237130769_Modulation_of_Chemical_Composition_and_Other_Parameters_of_the_Cell_by_Growth_Rate)), and the RNAP binding energy for the wild type *lac* promoter is $\approx -5.2$ $k_BT$ ([Brewster, R. C., Jones, D. L., & Phillips, R. (2012)](https://pubmed.ncbi.nlm.nih.gov/23271961/)). As $N_{NS} \approx 4.6 \times 10^6$, the value of $a$ can be calculated to be $\frac{4000}{4.6\times 10^{6}}e^{3.18} \approx 0.02$. Similarly $p$ can be calculated to be $\frac{2000}{4.6\times 10^{6}}e^{5.2} \approx 0.08$. Lastly, $pae^{-\beta\epsilon_{AP}}$ can be calculated to be $pae^{6.56} \approx 1$. We can see that these numbers satisfy the assumptions $a << 1$, $p << 1$, and $pae^{-\epsilon_{AP}} >> p$. We can simplify equation (22) to

\begin{equation}
\left\langle C_\mathrm{mRNA} \right\rangle \propto p_{bound} \propto \frac{pae^{-\beta \epsilon_{AP}}}{1+r+pae^{-\beta \epsilon_{AP}}}.
\tag{23}
\end{equation}

The last assumption we make is that repressors bind very strongly ($r >> 1$  and $r >> pae^{-\epsilon_{AP}}$). To justify this assumption we can once again look to the lac operon. Wild type LacI copy number is $\approx 10$ and the wild type binding energy for the O1 operator is $\approx -16 k_BT$ ([Garcia and Phillips, 2011](https://www.pnas.org/content/108/29/12173.full)). $r$ can then be calculated to be $\frac{10}{4.6\times 10^{6}}e^{16} \approx 20$. We can simplify equation (23) to

\begin{align}
\left\langle C_\mathrm{mRNA} \right\rangle \propto\, & \frac{pae^{-\beta \epsilon_{AP}}}{r} \\
\left\langle C_\mathrm{mRNA} \right\rangle \propto\, & e^{-\beta(-E_P - E_A + E_R)} \label{24}
\end{align}

As we typically assume that RNAP binding energy, activator binding energy, and repressor binding can all be represented as sums of contributions from their constituent bases, the combination of the energies can be written as a total effective energy $E_{eff}$ which is a sum of independent contributions from all positions within the binding sites.

We fit the parameters for each base using Markov Chain Monte Carlo Method (MCMC). Two MCMC runs are conducted using randomly generated initial conditions. We require both chains to reach the same distribution to prove the convergence of the chains or we repeat the runs. During the analysis we artificially treat mutation rates at all positions as equal, as we do not wish for mutation rate to play a roll in mutual information calculations. The information values are smoothed by averaging with neighboring values.

### Mathematics of Sequence Logos

Sequence logos provide a simple way to visualize the sequence specificity of a transcription factor to DNA, as well as the amount of information present at each position. Here we describe how we generate them using either known genomic binding sites or the energy matrices from our Reg-Seq data. In each case we need to calculate a $4 \times L$ position weight matrix for a binding site of length $L$, which is used to estimate the position-dependent information content that will then be used to construct a sequence logo.

#### Generating Position Weight Matrices from Known Genomic Binding Sites.

To construct a position weight matrix using genomic binding sites, we must first align all the available binding site sequences and determine the nucleotide statistics at each position. Specifically, we count the number of each nucleotide, $N_{ij}$, at each position along the binding site. Here the subscript $i$ refers to the position, while $j$ refers to the nucleotide, A, C, G, or T. We can then calculate a position probability matrix (also $4 \times L$ where each entry is found by dividing these counts by the total number of sequences in our alignment,

\begin{align}
p_{ij} = \frac{N_{ij}}{N_g} \tag{24}
\end{align}

Note that in situations where the number of aligned sequences is small (e.g., less than five), we typically add 1 pseudocount sequence to regularize the probabilities of the counts in the calculation of position probabilities,

\begin{align}
p_{ij} = \frac{N_{ij} + B_p}{N_g + 4 \cdot B_p} \tag{25}
\end{align}

where $B_p$ is the value of the pseudocount. The argument for their use is that when selecting from a small number of binding site sequences, there is a chance that infrequent nucleotides will be absent, and assigning them a probability $p_{ij}$ of zero may be too stringent of a penalty. We let $B_p$ = 0.1. In the limit of zero binding site sequences (i..e with no sequences observed), this will result in probabilities $p_{ij}$ approximately equal to the background probability used in calculating the position weight matrix below (and a non-informative sequence logo).

Finally, the values of the position weight matrix are found by calculating the log probabilities relative to a background model,

\begin{align}
PW M_{ij} = log_2 \frac{p_{ij}}{b_j} \tag{26}
\end{align}

The background model reflects assumptions about the genomic background of the system under investigation. For instance, in many cases it may be reasonable to assume each base is equally likely to occur. Given that we know the base frequencies for _E. coli_, we choose a background model that reflects these frequencies $b_j$: A = 0.246, C = 0.254, G = 0.254, and T = 0.246 for strain MG1655; [BioNumbers
ID 100528](http://bionumbers.hms.harvard.edu). The value at the $i, j th$ position will be zero if the probability, $p_{ij}$, matches that of the background model, but non-zero otherwise. This reflects the fact that base frequencies matching the background model tell us nothing about the binding preferences of the transcription factor, while deviation from this background frequency indicates sequence specificity.

#### Generating Position Weight Matrices from Reg-Seq Data.

Next we construct a position weight matrix using our Reg-Seq data. Here, we appeal to the result from [Berg and von Hippel](https://www.ncbi.nlm.nih.gov/pubmed/3612791), that the logarithms of the base frequencies above should be proportional to their binding energy contributions. Berg and von Hippel considered a statistical mechanical system containing $L$ independent binding site positions, with the choice of nucleotide at each position corresponding to a change in the energy level by $\epsilon_{ij}$ relative to the lowest energy state at that position.

$\epsilon_{ij}$ corresponds to the energy entry from our energy matrix, scaled to absolute units, $\epsilon_{ij}$ = $A \dot \theta_{ij} \dot B$, where $\theta_{ij}$ is the $i, jth$ entry. An important assumption is that all nucleotide sequences that provide an equivalent binding energy will have equal probability of being present as a binding site. In this way, we can relate the binding energies considered here to the statistical distribution of binding sites in the previous section. The probability $p_{ij}$ of choosing a nucleotide at position $i$ will then be proportional to the probability that position $i$ has energy $\epsilon_{ij}$. Specifically, the probabilities will be given by their Boltzmann factors normalized by the sum of states for all nucleotides,

\begin{align}
p_{ij} = \frac{b_j \cdot e^{-\beta A \cdot \theta_{ij} \cdot s_{ij}}}{\sum_{j=A}^{T} b_j \cdot e^{-\beta A \cdot \theta_{ij} \cdot s_{ij}}} \tag{27}
\end{align}

where $\beta = 1 / k_B T$, $k_B$ is Boltzmann’s constant and $T$ is the absolute temperature. As above, $b_j$ refers to the background probabilities of each nucleotide. Note that the energy scaling factor B drops out of this equation since it is shared across each term. One difficulty that arises when we use energy matrices that are not in absolute energy units is that we are left with an unknown scale factor $A$, preventing calculation of $p_{ij}$. We appeal to the expectation that mismatches usually involve an energy cost of 1-3 $k_B T$. In other work within our group, we have
found this to be a reasonable assumption for _LacI_. Therefore, we approximate it such that the average
cost of a mutation is 2 $k_B T$.

#### Additional Details on Sequence Logos

With our position weight matrices in hand, we now construct sequence logos by calculating the average
information content at each position along the binding site. With our four letter alphabet there is a
maximum amount of information of 2 bits $[\log_2 4$ = 2 bits) at each position $i$. The information content
will be zero at a position when the nucleotide frequencies match the genomic background, and will have
a maximum of 2 bits only if a specific nucleotide is completely conserved. The total information content
at position $i$ is determined through calculation of the Shannon entropy, and is given by

\begin{align}
I_i = \sum_{j=A}^{T} p_{ij} \cdot \log_2 \frac{p_{ij}}{b_i} = \sum_{j=A}^{T} p_{ij} \cdot \mathrm{PWM}_{ij} \tag{28}
\end{align}

Here, $\mathrm{PWM}_{ij}$ refers to the $i, jth$ entry in the position weight matrix. The total information content contained in the position weight matrix is then the sum of information content across the length of the binding site.

To construct a sequence logo, the height of each letter at each position $i$ is determined by $\mathrm{SeqLogo}_{ij} = p{ij} \cdot I_i$, which is in units of bits. This causes each nucleotide in the sequence logo to be displayed as the proportion of the nucleotide expected at that position scaled by the amount of information contained at that position. To construct and plot sequence logos, we use custom Python code written by Justin Kinney; this code is discussed in a subsequent section of this protocol.

### Mathematics of Energy Matrices

Energy matrices can be inferred using Bayesian parameter estimation with an error-model-averaged likelihood, as
previously described by [Kinney & Atwal](https://www.pnas.org/content/111/9/3354).

Focusing on an individual putative transcription factor binding site, as revealed in an information footprint (which we discuss in the next section), we are next interested in developing a more fine-grained, quantitative understanding of how the underlying protein-DNA interaction is determined based on gene expression data. An energy matrix displays this information using a heat map format, where each column is a position in the putative binding site and each row displays the effect on binding that results from mutating to that given nucleotide (given as a change in the DNA-TF interaction energy upon mutation). These energy matrices are scaled such that the wild type sequence is colored in white, mutations that improve binding are shown in blue, and mutations that weaken binding are shown in red. These energy matrices encode a full quantitative picture for how we expect sequence to relate to binding for a given TF, such that we can provide a prediction for the binding energy of every possible binding site sequence as

\begin{align}
\sum_{i=1}^{N} \epsilon_i \tag{29}
\end{align}

where the energy matrix is predicated on an assumption of a linear binding model in which each base within the binding site region contributes a specific value ($\epsilon_i$ for the $i^{th}$ base in the sequence) to the total binding energy.

Energy matrices are either given in A.U. (arbitrary units), or if the gene has a simple repression or activation architecture with a single RNA polymerase (RNAP) site, are assigned $k_B T$ energy units following the procedure developed by [Kinney _et al_.](https://www.pnas.org/content/107/20/9158.long) and validated on the _lac_ operon from [Stephanie Barnes _et al._ paper in _PLoS Comp. Biol._](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006226).

\begin{align}
E = \sum^T_{i = A} \sum^L_{j = 1} \theta_{ij} \delta_{ij} \tag{30}\label{equ:energy}
\end{align}

is a non-linear model for creating energy matrices. However, what we want to find is the set of model parameters, $\theta_{ij}$, that maximizes the probability

\begin{align}
p(\{ \mu_s | \theta \}) \tag{31}\label{equ:prob}
\end{align}

where $\{ \mu_s \}$ is the set of Reg-Seq data for each sequence $s$.

The Reg-Seq measurements are independent, so

\begin{align}
p(\{ \mu_s | \theta \}) = \prod^N_{s=1} p(\mu_s | \theta) \tag{32}
\end{align}

Additionally, the model parameters $\theta$ only affects the probability in equation (\ref{equ:energy}), though the model predictions, in this case are given in equation(\ref{equ:prob}) by the energy prediction $E_s$.

\begin{align}
p(\{ \mu_s \} | \theta ) = p(\{ \mu_s \} | E_s ) \tag{33}
\end{align}

While we could calculate this probability by assuming an error model, [Kinney _et al._ (2008)](https://pdfs.semanticscholar.org/56ce/a3cb3609844a0df0554f99524dbb96479c2d.pdf) showed that averaging over error models leads to an expression for likelihood in terms of mutual information,

\begin{align}
p(\{ \mu_s \} | \theta ) \propto 2^{N I (\mu, E)} \tag{34}
\end{align}