# Local likelihood optimization

In the previous notebook, we went through the calculation of the likelihood function for an HMM and explicitly minimized it. Specifically, we looked at a situation where the transition rate and emission probability matrices took the form

$$ \boldsymbol{A} = \left[ \begin{matrix} 1 - \alpha & \alpha \\ \alpha & 1 - \alpha \end{matrix} \right] $$

and 

$$ \boldsymbol{B} = \left[ \begin{matrix} 1 - \beta & \beta \\ \beta & 1 - \beta \end{matrix} \right] $$

respectively, then we estimated the parameters $\alpha$ and $\beta$ given an observed sequence of states.

(THIS NEEDS TO BE UPDATED)
*A priori* we will not know what the hidden state sequence was (it is hidden, after all...) So, to deal with this we would need to calculate the likelihood over all possible hidden state sequences, and all parameter values in the $A$ and $B$ matrices. This problem will quickly become very high dimensional, and as a result we must resort to other methods. Specifically, we can perform a local optimization routine on the parameter space, for a given initial starting point.

In this notebook, we will first build a custom gradient descent algorithm (our own local likelihood optimizer) to track the evolution of the parameter vector over iterations, and then use these results to validate a more efficient implementation using a `scipy.optimize.minimize` function to do the heavy lifting.

#### Gradient descent

Given that we are trying to minimize the likelihood function $\mathcal{L}(\theta)$, for parameter vector $\theta$, the general form of a gradient descent algorithm will update the value of the parameter vector $\theta_i \to \theta_{i+1}$ as

$$ \theta_{i+1} = \theta_i - \eta \nabla \mathcal{L}(\theta_i) $$

where $\nabla$ is the gradient of the likelihood function and $\eta$ is the *learning rate* which determines how large the update steps are. For an initial implementation, we take the learning rate to be constant, but there are many schemes that optimize this parameter by dynamically updating it so that the solution converges towards a local minimum more rapidly.  In order to calculate the derivative of the likelhood function, we will use a simple first-order difference in the likelihood function, approximating each component of the gradient with a simple first-difference

$$ \partial_{\theta^k}\mathcal{L}(\theta_i) \approx \frac{1}{2 \Delta\theta^{k}} \left[ \mathcal{L}(\theta + \Delta\theta^{k}) - \mathcal{L}(\theta - \Delta\theta^{k}) \right] $$

where $\Delta\theta^k$ is chosen to be small.  This procedure requires two evaluations of the likelihood function per dimension of the parameter vector for each iterative update to $\theta$.