# Bayesian methods for machine learning - Week 1

## Maximum likelihood estimate

Given vector of model parameters $\theta$ and observed data $X = \{x_1, x_2, \ldots, x_N\}$, we define **likelihood** $P(X|\theta)$ as a function of the model parameters and reflects the probability of each of the values of $\theta$ of generating the observed data.

**Important**: likelihood is not a probability distribution over $\theta$ (e.g. integral does not sum one).

The **Maximum Likelihood estimate (MLE)** is the value of $\theta$ that maximizes the probability of observing the data (i.e. the value of $\theta$ with the highest likelihood).

In this notebook we are going to derive, following the course reading, the MLE for parameter $\mu$ for univariate and multivariate Gaussian distributions. I recommend doing this as an exercise, only looking at the notes to correct anything, if needed.

### MLE for univariate Gaussian

We know the likelihood for the Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$ given point $x_i$ is:

$$
\begin{align*}
\mathcal{N}(\mu, \sigma^2) & = P(x_i|\theta) \\
                           & = \frac{1}{\sigma \sqrt{2\pi}} \mathrm{\exp} \left[-\frac{1}{2} \left( \frac{x_i - \mu}{\sigma} \right)^2\right]
\end{align*}
$$

The likelihood for a set of points is the product of the individual likelihoods:

$$
\begin{align*}
P(X|\theta) & = \prod^N_{i=1} P(x_i |\theta) \\
            & = \prod^N_{i=1} \frac{1}{\sigma \sqrt{2\pi}} \mathrm{\exp} \left[ -\frac{1}{2} \left( \frac{x_i - \mu}{\sigma} \right)^2\right] \\
            & =  \left(\frac{1}{\sigma \sqrt{2\pi}} \right)^N\prod^N_{i=1} \mathrm{\exp} \left[ -\frac{1}{2} \left( \frac{x_i - \mu}{\sigma} \right)^2\right]
\end{align*}
$$

We want to get the maximum of the likelihood, which is equivalent to compute the maximum of the log likelihood (as it is easier to deal with summations over products):

$$
\mu_{MLE} =  \underset{\mu}{\mathrm{argmax}} \log P(X|\theta) \\
$$

$$
\begin{align*}
\log P(X|\theta) & = \log\left(\frac{1}{\sigma \sqrt{2\pi}} \right)^N\sum^N_{i=1} \mathrm{\exp} \left[-\frac{1}{2} \left( \frac{x_i - \mu}{\sigma} \right)^2 \right] \\
                 & = \log 1 - \log (\sigma \sqrt{2\pi})^N + \log \sum^N_{i=1} \mathrm{\exp} \left[-\frac{1}{2} \left( \frac{x_i - \mu}{\sigma} \right)^2 \right] \\
                 & = \log 1 - \log (\sigma \sqrt{2\pi})^N + \sum^N_{i=1} -\frac{1}{2} \left( \frac{x_i - \mu}{\sigma} \right)^2 \\
                 & = \log 1 - \log (\sigma \sqrt{2\pi})^N - \frac{1}{2\sigma^2} \sum^N_{i=1}  (x_i - \mu)^2
\end{align*}
$$

We can easily convert the problem into a minimization one:

$$
\begin{align*}
\mu_{MLE} & = \underset{\mu}{\mathrm{argmin}} - \log P(X|\theta) \\
          & = \underset{\mu}{\mathrm{argmin}} - \log 1 + \log (\sigma \sqrt{2\pi})^N + \frac{1}{2\sigma^2} \sum^N_{i=1}  (x_i - \mu)^2
\end{align*}
$$

Function is quadratic, so it has a single minimum which we can find by computing the roots of the derivative with respect to $\mu$:

$$
\begin{align*}
\frac{\partial}{\partial \mu} P(X|\theta) & = 0 \\
\frac{\partial}{\partial \mu} \left( \frac{1}{\sigma^2} \sum^N_{i=1} (x_i - \mu)^2 \right) & = 0 \\
\frac{1}{\sigma^2} \sum^N_{i=1} 2\mu - 2 x_i & = 0 \\
\frac{2}{\sigma^2} \left( \sum^N_{i=1} \mu - \sum^N_{i=1} x_i \right) & = 0 \\
\frac{2N\mu}{\sigma^2} & = \frac{2}{\sigma^2} \sum^N_{i=1} x_i \implies \mu_{MLE} = \frac{1}{N} \sum^N_{i=1} x_i
\end{align*}
$$

### MLE for multivariate Gaussian

We know the likelihood for the multivariate Gaussian distribution $\mathcal{N}(\mu, \Sigma)$ (i.e. $\mu \in \mathcal{R}^{k}$, $\Sigma \in \mathcal{R}^{k \times k}$) given point $x_i$ is:

$$
\begin{align*}
\mathcal{N}(\mu, \Sigma) & = P(x_i|\theta)  \\
                         & = (2\pi)^\frac{-k}{2} |\Sigma|^{-\frac{1}{2}} \mathrm{\exp} \left[ -\frac{1}{2} (x - \mu)^{\top} \Sigma^{-1}(x-\mu) \right]
\end{align*}
$$

The likelihood for a set of points is the product of the individual likelihoods:

$$
\begin{align*}
P(X|\theta) & = \prod^N_{i=1} P(x_i |\theta) \\
            & = \prod^N_{i=1} (2\pi)^\frac{-k}{2} |\Sigma|^{-\frac{1}{2}} \mathrm{\exp}\left[-\frac{1}{2} (x - \mu)^{\top} \Sigma^{-1}(x_i-\mu)\right] \\
            & = (2\pi)^\frac{-Nk}{2} |\Sigma|^{-\frac{N}{2}} \prod^N_{i=1} \mathrm{\exp}\left[-\frac{1}{2} (x_i - \mu)^{\top} \Sigma^{-1}(x-\mu)\right]
\end{align*}
$$

Then, the log likelihood is:

$$
\begin{align*}
\log P(X|\theta) & = - \frac{Nk}{2} \log (2\pi) - \frac{N}{2} \log |\Sigma| + \sum^N_{i=1} -\frac{1}{2} (x_i - \mu)^{\top} \Sigma^{-1}(x_i - \mu) \\
                 & = - \frac{Nk}{2} \log (2\pi) - \frac{N}{2} \log |\Sigma| - \frac{1}{2} \sum^N_{i=1} (x_i^{\top} \Sigma^{-1} - \mu^{\top} \Sigma^{-1}) (x_i-\mu) \\
                 & = - \frac{Nk}{2} \log (2\pi) - \frac{N}{2} \log |\Sigma| - \frac{1}{2} \sum^N_{i=1} (x_i^{\top} \Sigma^{-1}x_i - x_i^{\top} \Sigma^{-1} \mu - \mu^{\top} \Sigma^{-1} x_i + \mu^{\top} \Sigma^{-1} \mu)
\end{align*}
$$

We then define the MLE estimate for the $\mu$ vector as:

$$
\begin{align*}
\mu_{MLE} & = \underset{\mu}{\mathrm{argmax}} \log P(X|\theta) \\
          & = \underset{\mu}{\mathrm{argmin}} - \log P(X|\theta)
\end{align*}
$$

To find minimum, we find the roots of the derivative with respect to $\mu$:
    
$$
\begin{align*}
\frac{\partial}{\partial{\mu}} \left[- \log P(X|\theta) \right] & = 0 \\
\frac{1}{2} \sum^N_{i=1} (x_i^{\top} \Sigma^{-1}x_i - x_i^{\top} \Sigma^{-1} \mu - \mu^{\top} \Sigma^{-1} x_i + \mu^{\top} \Sigma^{-1} \mu) & = 0
\end{align*}
$$

Given matrix derivative rule $\frac{\partial}{\partial{x}} (x^{\top}Ax) = 2Ax$ for $x \in \mathcal{R}^{k}$, $A \in \mathcal{R}^{k \times k}$, we can express:

$$
\frac{1}{2} \sum^N_{i=1} - x_i^{T} \Sigma^{-1} - \Sigma^{-1} x_i + 2 \Sigma^{-1} \mu = 0
$$

As $\Sigma^{-1}$ is symmetric, we can express:

$$
-x_i^{\top}\Sigma^{-1} = (-x_i^{\top}\Sigma^{-1})^{\top} = \Sigma^{-1} x_i
$$

Then we can rewrite previous equation as:

$$
\begin{align*}
\frac{1}{2} \sum^N_{i=1} - 2 x_i^{T} \Sigma^{-1} + 2 \Sigma^{-1} \mu & = 0 \\
\sum^N_{i=1} \Sigma^{-1} \mu & = \sum^N_{i=1} x_i^{T} \Sigma^{-1} \\
\Sigma^{-1} \mu N & = \Sigma^{-1} \sum^N_{i=1} x_i  \implies \mu_{MLE} = \frac{1}{N} \sum^N_{i=1} x_i 
\end{align*}
$$