(maximum-likelihood-estimation)=
# Maximum Likelihood Estimation (MLE)
---

**Maximum likelihood estimation (MLE)** is a principle to find the best parameter for a probability distribution that **best explains** a sampled dataset. 
The instances in the dataset are supposed be independently sampled from the same distribution. 
MLE gives a relatively easy way to estimate parameters from the data.

## Maximum likelihood
---

### Likelihood function

Given the dataset $\mathcal{X} = \{ \mathbf{x}_{1}, \dots, \mathbf{x}_{n} \}$, the term **likelihood function** simply refers to the probability function of $\mathcal{X}$ with the distribution parameters for $\mathbf{X}$ being $\mathbf{\Theta}$

$$ 
f(\mathbf{\Theta}) = \mathbb{P}_{\mathbf{X}}(\mathcal{X} ; \mathbf{\Theta}), 
$$

which measures how likely are the parameters $\mathbf{\Theta}$ with respect to the data.

Notes:

- Since the dataset are known, the probability function is only dependent on the parameters $\mathbf{\Theta}$, and thus doesn't have the same shape as the probability density of the variable $\mathbf{X}$. 

- The semicolon indicate that $\mathbf{\Theta}$ is a parameter (fixed value) instead of a random variable.

### Procedures 

1. We collect a set of instances $\mathcal{X} = \{ \mathbf{x}_{1}, \dots, \mathbf{x}_{n} \}$ that we believe should be from the same distribution. 

1. We select a parametric model $\mathbb{P}_{\mathbf{X}}(\mathbf{x} ; \mathbf{\Theta})$ that we think can best explains the data.

1. We select the parameters $\mathbf{\Theta}^{*}$ to be the ones that maximize the probability of the data: 

$$ 
\begin{aligned}
\mathbf{\Theta}^{*} 
& = \arg\max_{\mathbf{\Theta}} \mathbb{P}_{\mathbf{X}}(\mathcal{X}; \mathbf{\Theta}) 
\\
& = \arg\max_{\mathbf{\Theta}} \prod_{i = 1}^{n} \mathbb{P}_{\mathbf{X}}(\mathbf{x}_{i} ; \mathbf{\Theta}) & [\mathcal{X} \text{ are i.i.d samples }] 
\\
& = \arg\max_{\mathbf{\Theta}} \ln \prod_{i = 1}^{n} \mathbb{P}_{\mathbf{X}}(\mathbf{x}_{i} ; \mathbf{\Theta}) & [\text{the log trick}] 
\\
& = \arg\max_{\mathbf{\Theta}} \sum_{i = 1}^{n} \ln \mathbb{P}_{\mathbf{X}}(\mathbf{x}_{i} ; \mathbf{\Theta}).
\end{aligned}
$$

### Optimization methods

If the likelihood function is concave, the best parameters $\mathbf{\Theta}^{*}$ that maximize the likelihood function are the ones that make the gradient of $f(\mathbf{\Theta})$ with respect to $\mathbf{\Theta}$ $0$ and at the same time has negative semidefinite hessian matrix.  

$$ 
\nabla_{\mathbf{\Theta}} f(\mathbf{\Theta}) = \nabla_{\mathbf{\Theta}} \mathbb{P}_{\mathbf{X}}(\mathcal{X} ; \mathbf{\Theta}) = 0 
$$

$$ 
\nabla_{\mathbf{\Theta}}^{2} f(\mathbf{\Theta}) \preceq 0 
$$

Otherwise, other numerical methods or algorithms might need to employed to solve the optimization problem. 

## Example: linear regression
---

> TODO