# Maximum Likelihood Estimation and Gradients

Estimating parameters of data-generating distributions

## Recap: Log-Likelihood of the Linear-Gaussian Model

Recap recap recap

## The Gradient of a Multivariate Function

<span class="theorem-title">**Definition 1**</span> Let $f:\mathbb{R}^p\rightarrow \mathbb{R}$ be a function which accepts a vector input $\mathbf{w}=(w_1,\ldots,w_p)^T\in \mathbb{R}^p$ and returns a scalar output $f(\mathbf{w})\in \mathbb{R}$. The **partial derivative** of $f$ with respect to the $j$-th coordinate $w_j$ is defined as the limit

$$
\begin{aligned}
    \frac{\partial f}{\partial w_i} &= \lim_{h \rightarrow 0} \frac{f(w_1,\ldots,w_i + h, \ldots w_p) - f(w_1,\ldots,w_i, \ldots w_p)}{h} \\ 
    &= \lim_{h \rightarrow 0} \frac{f(\mathbf{w}+ h\mathbf{e}_i) - f(\mathbf{w})}{h}\;,
\end{aligned}
$$

where $\mathbf{e}_i = (0,0,\ldots,1,\ldots,0,0)^T$ is the $i$-th standard basis vector in $\mathbb{R}^p$, i.e., the vector with a 1 in the $i$-th position and 0’s elsewhere.

Just like in single-variable calculus, it’s not usually convenient to work directly with the limit definition of the partial derivative. Instead we use the following heuristic:

<span class="theorem-title">**Proposition 1**</span> To compute $\frac{\partial f}{\partial w_i}$, treat all other variables $w_j$ for $j\neq i$ as constants, and differentiate $f$ with respect to $w_i$ using the usual rules of single-variable calculus (power rule, product rule, chain rule, etc.).

<span class="theorem-title">**Example 1**</span> Let $f:\mathbb{R}^3\rightarrow \mathbb{R}$ be defined by $f(x,y,z) = x^2\sin y + yz + z^3x$. Compute $\frac{\partial f}{\partial x}$, $\frac{\partial f}{\partial y}$, and $\frac{\partial f}{\partial z}$.

> **Solution**
>
> To compute $\frac{\partial f}{\partial x}$, we treat $y$ and $z$ as constants, which yields
>
> $$
> \frac{\partial f}{\partial x} = 2x \sin y + z^3\;.
> $$
>
> Similarly, we can compute $\frac{\partial f}{\partial y}$ and $\frac{\partial f}{\partial z}$:
>
> $$
> \begin{align}
>     \frac{\partial f}{\partial y} &= x^2 \cos y + z \\ 
>     \frac{\partial f}{\partial z} &= y + 3z^2 x\;.    
> \end{align}
> $$

<span class="theorem-title">**Example 2**</span> Suppose we have $n$ independent and identically-distributed samples $x_1,\ldots,x_n$ from a Gaussian distribution with unknown mean $\mu$ and known variance $\sigma^2$. The log-likelihood function for this data is

$$
\begin{aligned}
    \mathcal{L}(\mathbf{x};\sigma^2,\mu) &= \sum_{i=1}^n \log p(x_i;\sigma^2,\mu) \\
    &= \sum_{i=1}^n \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right) \\
    &= \sum_{i=1}^n \left( -\frac{1}{2} \log(2\pi \sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right) \\
    &= -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\;.
\end{aligned}
$$

Compute $\frac{\partial \mathcal{L}(\mathbf{x};\sigma^2,\mu)}{\partial \mu}$, the partial derivative of the log-likelihood with respect to the mean parameter $\mu$.

> **Solution**
>
> To perform this calculation, we treat $\sigma^2$ (and all the $x_i$’s) as constants and differentiate with respect to $\mu$. The first term in the log-likelihood does not depend on $\mu$, so its derivative is zero. For the second term, we need to use the chain rule:
>
> $$
> \begin{aligned}
>     \frac{\partial \mathcal{L}(\mathbf{x};\sigma^2,\mu)}{\partial \mu} &= -\frac{1}{2\sigma^2} \cdot \frac{\partial}{\partial \mu} \left( \sum_{i=1}^n (x_i - \mu)^2 \right) \\
>     &= -\frac{1}{2\sigma^2} \cdot  \left( \sum_{i=1}^n \frac{\partial}{\partial \mu}(x_i - \mu)^2 \right) \\
>     &= -\frac{1}{2\sigma^2} \cdot \sum_{i=1}^n 2(x_i - \mu)(-1) \\
>     &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)\;.
> \end{aligned}
> $$

<span class="theorem-title">**Definition 2**</span> Let $f:\mathbb{R}^p\rightarrow \mathbb{R}$ be a differentiable function which accepts a vector input $\mathbf{w}=(w_1,\ldots,w_p)^T\in \mathbb{R}^p$ and returns a scalar output $f(\mathbf{w})\in \mathbb{R}$. The **gradient** of $f$ at $\mathbf{w}$ is the vector of partial derivatives

<span class="theorem-title">**Example 3**</span>  

## Critical Points and Local Extrema

<span class="theorem-title">**Example 4 (Estimating the Mean of a Gaussian)**</span>  

## Gradient of the Linear-Gaussian Log-Likelihood

## A First Look: Gradient Descent for Maximum Likelihood Estimation

## Deriving and Checking Gradient Formulas

## A Complete Example: Gradient Descent for Laplace Regression