# Phenology

We denote the ancestor sequence with $a$, the obe^servable leaf sequences with $s$, the internal branch lengths with $\tau$ and the lengths of the branches leading to the leaves with $t$.

Under the molecular clock theory (mutations accumulate at a constant rate), we have that $t_1 = t_2, t_1 + \tau = t_2 + \tau = t_3$

![Phenology](phenology_tree.png)

But often the molecular clock theory is not a good assumption, therefor we cannot constrain the branch length like this.

Instead, we want to calculate the likelihood of the sequence data under this model(assuming some tree structure $T$)

$$
P(\vec s^1, \vec s^2, \vec s^3, \vec a^1, \vec a^2 | t_1, t_2, t_3, \tau, T)
=
P(\vec a^1)P(\vec a^2 | \vec a^1, \tau) P(\vec s^3 | \vec a^1, t_3) P(\vec s^1 | \vec a^2, t_1) P(\vec s^2 | \vec a^2, t_2)
$$

We can assume independent probabilities as the sequences mutate independently from each-other. The probability of the tree $T$, is then just the root probability times the left and right subtree probability.
Assuming independence of the substitution at different positions $t$, we then have

$$
P(\vec s^1, \vec s^2, \vec s^3, \vec a^1, \vec a^2 | t_1, t_2, t_3, \tau, T)
=
\prod_i P(\vec a^1_i)P(\vec a^2_i | \vec a^1_i, \tau) P(\vec s^3_i | \vec a^1_i, t_3) P(\vec s^1_i | \vec a^2_i, t_1) P(\vec s^2_i | \vec a^2_i, t_2)
$$

Due to this independence we can just look at a single base position. Here we drop the index $i$ and assume that we look at a single position. We write $\alpha$ and $\beta$ as the bases of the internal nodes, as they are unknown to us. Due to not knowing the ancestral states, we sum over all the possibilities of the bases $\alpha$, $\beta$ to get the total probability of observing the given tree. ($q_\alpha$ here is the probability of having the base $\alpha$)

$$
P(s^1, s^2, s^3 | t_1, t_2, t_3, \tau, T)
=
\sum_{\alpha, \beta} q_{\alpha} P(\beta | \alpha, \tau) P(s^3 | \alpha, t_3) P(s^1 | \beta, t_1) P(s^2 | \beta, t_2)
$$

Given reversibility $q_\alpha P(\beta | \alpha, \tau) = q_\beta P(\alpha | \beta, \tau)$, i.e. it is equally likely to go from $\alpha$ to $\beta$ and in the other direction.
Given additivity $q_\alpha P(\beta | \alpha, \tau)P(s^3|\alpha, t_3) = q_\beta P(s^3 | \beta, t_3 + \tau)$, i.e. the can look directly at the probability of the base change $\beta$ to $s^3$ in time $t_3 + \tau$ instead of calculating the internal nodes of this path.

These properties allow us to rewrite the probability as

$$
\begin{align*}
    P(s^1, s^2, s^3 | t_1, t_2, t_3, \tau, T)
    &=
    \sum_{\alpha, \beta} q_{\alpha} P(\beta | \alpha, \tau) P(s^3 | \alpha, t_3) P(s^1 | \beta, t_1) P(s^2 | \beta, t_2) \\
    &=
     \sum_{\beta} \sum_{\alpha} \underbrace{q_{\alpha} P(\beta | \alpha, \tau)}_{\text{Reversibility}} P(s^3 | \alpha, t_3) P(s^1 | \beta, t_1) P(s^2 | \beta, t_2) \\
     &=
     \sum_{\beta} \sum_{\alpha} \underbrace{q_{\beta} P(\alpha | \beta, \tau) P(s^3 | \alpha, t_3)}_{\text{Additivity}} P(s^1 | \beta, t_1) P(s^2 | \beta, t_2) \\
     &=
     \sum_{\beta} \underbrace{\sum_{\alpha} q_\beta P(s^3 | \beta, t_3 + \tau) P(s^1 | \beta, t_1) P(s^2 | \beta, t_2)}_{=1} \\
     &=
     \sum_{\beta} q_\beta P(s^3 | \beta, t_3 + \tau) P(s^1 | \beta, t_1) P(s^2 | \beta, t_2) \\
\end{align*}
$$

Defining $t_3 + \tau \rightarrow t_3$ we get
$$
P(s^1, s^2, s^3 | t_1, t_2, t_3, \tau, T)
=
\sum_{\beta} q_\beta P(s^3 | \beta, t_3) P(s^1 | \beta, t_1) P(s^2 | \beta, t_2)
$$

We get a star topology for our tree given 3 sequences

![](tree_topology.png)

## Likelihood of an evolutionary hypothesis

![](T4_topology.png)

The set of hypotheses for this data set consists of
1. The tree topology $T$
2. The length of the branches $t_1, t_2, t_3, t_4, \tau$
3. The letters at the internal nodes $\alpha, \beta$

To calculate this, we would have intergate over the 5 unknown branch lengths, sum over the 16 possible letter combinations at the internal nodes and sum over possible topologies. This is not feasible mathematically.

Assuming a given topology, we would like to calculate

$$
P(s^1, s^2, s^3, s^4 | T) = \sum_{\beta, \gamma} \int dt_1 dt_2 dt_3 dt_4 d\tau P(s^1, s^2, s^3, s^4, t_1, t_2, t_3, t_4, \tau | T)
$$

We will then sum over the letters at the internal nodes (total prob. as these are unknown to us) and find the maximum likelihood branch lengths.

## Felsenstein's recursion relations





![](T4_topology_prob.png)

We denote $P(D_j | \alpha)$ as the probability of all the data below node $j$ in the tree, assuming that the letter at node $j$ is $\alpha$, i.e. is is the probability of the subtree rooted at the node $j$.

The probability for example for $P(D_2 | a^2)$ is given as

$$
P(D_2 | a^2) = \left[ \sum_{a^3} P(a^3 | a^2, \tau_3) P(D_3 | a^3) \right] P(s^3 | a^2, t_3)
$$

The sum over $a^3$ is again because it is an internal node, which makes the base unknown, where we then sum over all the bases.
In geneal, if $C(j)$ is the set of children of node $j$ and the letter at node $j$ is $\alpha$, we have

$$
P(D_j | \alpha) = \prod_{i \in C(j)} \left[ \sum_{a^i} P(a^i | \alpha, t_i)P(D_i | a^i) \right]
$$

Which is the product of all the subtrees of the root node $j$. If the child $i$ is a leaf in the tree, its contribution in is simplified to

$$
\sum_{a^i} P(a^i | \alpha, t_i)P(D_i | a^i) = P(s^i| \alpha, t_i)
$$

## Felsenstein's algorithm

*Initialization*
- Number the nodes in the tree 'top to bottom'
- Set the current node $k$ to $2n - 1$ ($n$ - number of leaves in the tree) (2n - 1 is last layer)

*Recursion*
- If $k$ is a leaf, set $P(D_k | \alpha) = \delta_{\alpha s^k}$
- If $k$ is a not leaf, set for each $\alpha$ $$ P(D_k | \alpha) = \prod_{j \in C(k)} \left[ \sum_{a^j} P(a^j | \alpha, t_j)P(D_j | a^j) \right] $$
- Reduce $k$ by 1 and if $k = 0$, go to termination

*Termination*
- Set final probability $P(D) = \sum_{\alpha} q_\alpha P(D_1 | \alpha)$

This algorithm is an efficient way to calculate the likelihood of the sequence data, assuming a certain tree topology and the length of the branches. For this algorithm, we do need the branch lengths, which we do through optimization

### Optimizing branch lengths

We would like to optimize the branch lengths of our tree. Because of reversibility, the likelihood is independent from the root, which allows us to restructure the tree such that the root is now the node at which we want to optimize the path length.
The probability of the tree is then

$$
P(D) = \sum_{\alpha, \beta} = P(D_l | \alpha) P(\alpha | \beta, t) q_\beta P(D_r | \beta)
$$

![](branch_opt_tree.png)

The full likelihood for the entire sequence is then just given as the product over all bases

$$
P(D) = \prod_{i=1}^L \left[ \sum_{\alpha_i, \beta_i} P(D_{l, i} | \alpha_i) P(\alpha_i | \beta_i, t) q_{\beta_i} P(D_{r, i} | \beta_i) \right]
$$

From the Jukes-Cantor model we have that $P(\alpha | \beta, t) = \frac{1}{4} + (\delta_{\alpha \beta} - \frac{1}{4})e^{- \frac{4 \mu t}{3}}$ and we have the probability that the letter remains unchanged $c = \frac{1}{4} (1 + 3 e^{- \frac{4 \mu t}{3}})$. Plugging this into our likelihood gives

$$
\begin{align*}
    P(D)
    &=
    \prod_{i=1}^L \left[ \sum_{\alpha_i, \beta_i} P(D_{l, i} | \alpha_i) P(\alpha_i | \beta_i, t) q_{\beta_i} P(D_{r, i} | \beta_i) \right] \\
    &=
    \prod_{i=1}^L \left[ \sum_{\alpha_i} \sum_{\beta_i} P(D_{l, i} | \alpha_i) \left( \frac{1}{4} + (\delta_{\alpha \beta} - \frac{1}{4})e^{- \frac{4 \mu t}{3}} \right) q_{\beta_i} P(D_{r, i} | \beta_i) \right] \\
    &=
    \prod_{i=1}^L \left[ \sum_{\alpha_i = \beta_i} P(D_{l, i} | \alpha_i) \left( \frac{1}{4} + \frac{3}{4}e^{- \frac{4 \mu t}{3}} \right) q_{\beta_i} P(D_{r, i} | \beta_i) + \sum_{\alpha_i \neq \beta_i} P(D_{l, i} | \alpha_i) \left( \frac{1}{4} - \frac{1}{4}e^{- \frac{4 \mu t}{3}} \right) q_{\beta_i} P(D_{r, i} | \beta_i) \right] \\
    &=
    \prod_{i=1}^L \left[ \underbrace{\sum_{\alpha_i} P(D_{l, i} | \alpha_i) c q_{\alpha_i} P(D_{r, i} | \alpha_i)}_{\text{Prob. that base stays the same}} + \underbrace{\sum_{\alpha_i \neq \beta_i} P(D_{l, i} | \alpha_i) \left( \frac{1 -c}{3} \right) q_{\beta_i} P(D_{r, i} | \beta_i)}_{\text{Prob. that base changes}} \right] \\
\end{align*}
$$

Setting $A_i = \sum_{\alpha_i} P(D_{l, i} | \alpha_i) P(D_{r, i} | \alpha_{i}) q_{\alpha_i}$ and $B_i = \frac{1}{3} \sum_{\alpha_i \neq \beta_i} P(D_{l, i} | \alpha_i) P(D_{r, i} | \beta_i) q_{\beta_i}$, we get

$$
P(D) = \prod_{i=1}^L A_i c + B_i (1 - c)
$$

The optimum occurs at $\frac{\partial P(D)}{\partial c} = 0$ or $\frac{\partial \log(P(D))}{\partial c} = 0$

$$
\frac{\partial \log(P(D))}{\partial c} = \sum_{i=1}^L \frac{A_i - B_i}{A_i c + B_i (1 - c)} = 0
$$

To find the solution to this we use the Expectation maximization algorithm.

## Expecation maximization

- E-Step (Expectation): Function for the expectation of the log-likelihood is created based on the data.
- M-Step (Maximization): Parameters that maximize the distribution are identified. Used lated to determine the distribution for the latent variables in the next step

The update of the parameter $c$ is given as

$$
\begin{align*}
    \sum_{i = 1}^L \frac{A_i - B_i}{A_i c + B_i(1 - c)}
    &=
    0 \\
    &=
    \sum_{i = 1}^L \frac{A_i c - B_i c}{A_i c + B_i(1 - c)} \\
    &=
    \sum_{i = 1}^L \frac{A_i c - B_i c + B_i - B_i}{A_i c + B_i(1 - c)} \\
    &=
    \sum_{i = 1}^L \frac{A_i c + B_i(1 - c) - B_i}{A_i c + B_i(1 - c)} \\
    &=
    \sum_{i = 1}^L 1 - \frac{B_i}{A_i c + B_i(1 - c)} \\
    \Leftrightarrow
    L &= \sum_{i = 1}^L \frac{B_i}{A_i c + B_i(1 - c)} \\
\end{align*}
$$

At the same time we have that

$$
\begin{align*}
    \sum_{i = 1}^L \frac{A_i - B_i}{A_i c + B_i(1 - c)}
    &=
    0 \\
    &\Rightarrow
    \sum_{i = 1}^L \frac{A_i}{A_i c + B_i(1 - c)} = \sum_{i = 1}^L \frac{B_i}{A_i c + B_i(1 - c)} = L
\end{align*}
$$

and

$$
\sum_{i = 1}^L \frac{A_i}{A_i c + B_i(1 - c)} = Lc
$$

which gives the update rule

$$
c^{new} = \frac{1}{L}\sum_{i = 1}^L \frac{A_i c^{old}}{A_i c^{old} + B_i(1 - c^{old})}
$$

1. Start with an initial set of branch lengths ($\mu t$ with corresponding $c$)
2. Calculate $P(D_n^i | \alpha_i)$ for all nodes $n$ and positions $i$
3. For each branch, calculate $A_i$ and $B_i$ depending on the current branch lengths
4. Update the $c$'s trying to maximize $P(D) = \prod_{i = 1}^L A_i c + B_i (1 - c)$
5. Determine the total amount $D = \sum \left| \frac{c' - c}{c' + c} \right|$ by which the branch lengths have changed
6. If $D$ is below a cutoff, stop. Else repeat from step 2

For our model we still need the topology. We need to find a way to search among tree topologies, find a good starting topology and generate and evaluate perturbations (changing the positions of branches, swap entire subtrees).

We will now look at the procedure for constructing the tree fo which the total evolutionary change along all branches is minimal.

## Estimating branch lengths from pairwise distances

From the Jukes-Cantor model we know the estimate for the evolutionary distance, for two pairs of leaves $(i, j)$

$$
t_{(i, j)}  = - \frac{3}{4 \mu } \log \left( 1 - \frac{4d_{(i, j)}}{3L} \right)
$$

In general, the evolutionary time computed from the pairwise differences should obey

$$
t_{(i, j)} = \sum_b t_b
$$

Where $t_b$ is the length of branch $b$ on the path between $i$ and $j$.

We define the branch matrix $B$ with

$$
B_{(i, j)b} =
\begin{cases}
    1 \ \text{When branch b lies on the path connecting (i, j)} \\
    0 \ \text{else}
\end{cases}
$$

A good guess for the branch lengths is minimizing

$$
\triangle^2 = \sum_{i \neq j} \left( t_{(i, j)} - \sum_b B_{(i, j)b} t_b \right)^2
$$

Which is the error in predicting the observed pairwise differences between the sequences.

To minimize thi we have to solve for all $k$'s

$$
\frac{\partial \triangle^2}{\partial t_k} = - 2 \sum_{i \neq j} (t_{(i, j)} - \sum_{b} B_{(i, j)b} t_b) B_{(i, j)k} t_k = 0
$$

These equations should give us the ML values for the branch lengths $t_k^*$ given the topology and assuming that branch lengths are additive.

Filling these values into $\triangle$ gives a measure of consistency between branch lengths and evolutionary distance inferred from differences between sequences.

## Neighbour Joining

We set to find a reasonable initial topology, one that minimizes

$$
\triangle^2 = \sum_{i \neq j} \left( t_{(i, j)} - \sum_b B_{(i, j)b} t_b \right)^2
$$

*Initialization*
- Construct a "start" topology tree with all nodes
- Compute pairwise distances between sequences
- Infer a set of branch lengths
- Calculate for each pair of nodes a new measure $T_{ij}$

*Iteration*
- Pick the pair $(i, j)$ for which $T_{ij}$ is minimal
- Create a new node $k$ as most recent common ancestor of $i$ and $j$ and use it in place of $i$ and $j$
- Calculate distances from $k$ to other nodes $m$
- Recalculate all $T_{ij}$'s

For each leaf node $i$ we define $r_i = \frac{1}{n-2} \sum_{j \neq i} t_{(i,j)}$, which has the form of average distances from $i$ to any other node in the tree.

Then we define

$$
T_{ij} = t_{(i,j)} - r_i - r_j
$$

The pair $(i,j)$ for which $T_{ij}$ is the smallest, are neighbours in the tree.
To prove this we will show a contridiction, that if $i$ and $j$ are not neighbours there then must be another node $k$ such that $T_{ik} < T_{ij}$.

Assuming $(i, j)$ are not neighbours, but that there is a node $k$ between them.

$$
T_{ij} - T_{ik} = t_{(i, j)} - r_i - r_j - t_{(i,k)} + r_i + r_k = t_{(i, j)} - t_{(i, k)} + r_k - r_j
$$

With

$$
r_j = \frac{1}{n - 2}(t_i + \tau + t_j + t_k + \tau + t_j + (n - 3)(t_j + \langle t_X \rangle))
$$

$$
r_k = \frac{1}{n - 2}(t_i + t_k + t_j + \tau + t_k + (n - 3)(t_k + \tau + \langle t_X \rangle))
$$

Which then gives

$$
r_k - r_j = t_k - t_j + \frac{n - 4}{n - 2}\tau
$$

which then in the end gives us

$$
T_{ij} - T_{ik} = t_{(i, j)} - t_{(i, k)} + t_k - t_j + \frac{n - 4}{n - 2} \tau
$$

Now with the branch lengths

$$
t_{(i, j)} = t_i + \tau + t_j \qquad t_{(i, k)} = t_i + t_k
$$

this becomes

$$
T_{ij} - T_{ik} = t_i + \tau + t_j - t_i - t_k + t_k - t_j + \frac{n - 4}{n - 2} \tau = \left( 1 + \frac{n-4}{n-2}\right) \tau > 0
$$

Thus because now $T_{ij} - T_{ik} > 0$, this implies that $T_{ij}$ is not the smallest.

## Neighbour Joining algorithm

*Initialization*
- Construct a "start" topology tree with all nodes
- Compute pairwise distances between sequences
- Calculate for each pair of nodes a new measure $T_{ij} = t_{(i, j)} - r_i - r_j$

*Iteration*
- Pick the pair $(i, j)$ for which $T_{ij}$ is minimal
- Define a new node $k$ with distances to other nodes $m$
    - $t_{(k, m)} = \frac{1}{2} (t_{(i, m)} + t_{(j,m)} - t_{(i, j)})$. $t_{(i,m)} = t_{(i, k)} + t_{(k,m)}$, $t_{(j,m)} = t_{(j, k)} + t_{(k,m)}$
- Set the distances to node $k$, $t_i = \frac{1}{2}(t_{(i,j)} + r_i - r_j)$ and $t_j = \frac{1}{2}(t_{(i,j)} + r_j - r_i)$
    - $t_{(i,m)} = t_{(i, k)} + t_{(k,m)}$, $t_{(j,m)} = t_{(j, k)} + t_{(k,m)}$
    - $t_{(i,k)} = \frac{1}{2}(t_{i,j} + t_{(i,m)} - t_{(j,m)})$, averaging over $m$ we get the $r$'s
- Recalculate all $T_{ij} = t_{(i, j)} - r_i - r_j$