## 7.4

Derive expressions for the elements of the $2\times 2$ Hessian matrix w.r.t. the weight and bias parameters of a linear regression model with the following form and error function:
$$
y(x, w, b) = wx + b \\ \ \\
E(w, b) = \frac{1}{2}\sum_{n=1}^N \{y(x_n, w, b) - t_n \}^2
$$
Then show that the trace and determinant of this Hessian are both positive.

### Hessian
The Hessian is the matrix of second partial derivatives w.r.t. the parameters $w$ and $b$. So, let's begin by taking the first partial derivatives. We may do this directly with matrix calculus but they are easy to derive without it:
$$
\frac{\partial E}{\partial w} = \frac{1}{2} \sum_{n=1}^N \frac{\partial}{\partial w} \{y(x_n, w, b) - t_n \}^2 \\ \ \\
= \frac{1}{2} \sum_{n=1}^N \frac{\partial}{\partial y}\{y(x_n, w, b) - t_n \}^2 \frac{\partial y}{\partial w} \\ \ \\
= \sum_{n=1}^N \{y(x_n, w, b) - t_n \} x_n \\ \ \\
= \bf x^\intercal ( y - t ) 
$$
Similarly,
$$
\frac{\partial E}{\partial b} = \sum_{n=1}^N \{y(x_n, w, b) - t_n \} = \bf 1^\intercal ( y - t) 
$$
Where $\bf 1$ is a vector of $1$'s with length $N$.

Then the second partial derivatives:
$$
f_{ww} = \frac{\partial^2 E}{\partial w^2} = \frac{\partial}{\partial w} \sum_{n=1}^N \{y(x_n, w, b) - t_n \} x_n \\ \ \\
= \sum_{n=1}^N x_n^2 =  \mathbf{x}^\intercal \mathbf{x} \\ \ \\
f_{bb} = \frac{\partial^2 E}{\partial b^2} = \frac{\partial}{\partial b} \sum_{n=1}^N \{y(x_n, w, b) - t_n \} \\ \ \\
= \sum_{n=1}^N 1 = N \\ \ \\
f_{wb} = \frac{\partial^2 E}{\partial w \partial b} = \frac{\partial}{\partial b} \sum_{n=1}^N \{y(x_n, w, b) - t_n \} x_n \\ \ \\
= \sum_{n=1}^N x_n = \mathbf{x}^\intercal \mathbf{1} \\ \ \\
f_{bw} \frac{\partial^2 E}{\partial b \partial w} = \frac{\partial}{\partial w} \sum_{n=1}^N \{y(x_n, w, b) - t_n \} \\ \ \\
= \sum_{n=1}^N x_n = \mathbf{x}^\intercal \mathbf{1}
$$

So, the Hessian is:
$$
\mathbf{H} = 
\begin{bmatrix}
f_{ww} \ f_{wb} \\
f_{bw} \ f_{bb}
\end{bmatrix} = 
\begin{bmatrix}
\sum x_n^2 \ \sum x_n \\
\sum x_n \ \ \ \ \ N
\end{bmatrix} = 
\begin{bmatrix}
\mathbf{x}^\intercal \mathbf{x} \ \ \ \mathbf{x}^\intercal \mathbf{1} \\
\mathbf{x}^\intercal \mathbf{1} \ \ \ \ \ N
\end{bmatrix}
$$

### Determinant
$$
\text{det}(\mathbf{H}) = f_{ww}f_{bb} - f_{bw}f{wb} = N \mathbf{x}^\intercal \mathbf{x} - \mathbf{x}^\intercal \mathbf{1} \mathbf{x}^\intercal \mathbf{1} \\ \ \\
= N\sum x_n^2 - \big(\sum x_n \big)^2
$$
This is $N$ times the sum of squares of $\bf x$ minus the squared sum of $\bf x$. This may remind us of the variance:
$$
\mathbb{V}(\mathbf{x}) = \mathbb{E}\big[\mathbf{x}^2\big] - \mathbb{E}[\mathbf{x}]^2 = \frac{1}{N}\sum_{n=1}^N x_n^2 - \bigg(\frac{1}{N}\sum_{n=1}^N x_n\bigg)^2
\\ \ \\
\implies \text{det}(\mathbf{H}) = N^2 * \mathbb{V}(\mathbf{x})
$$


This means that the determinant is:
- Always non-negative since the variance is non-negative
- Zero *only* when all observations of the input variable $x_n$ are identical
- Increasing in the variance
    - And therefore, increasing in the spread of $x_n$ realizations and the magnitude of the $x_n$ values

The determinant may be geometrically interpreted as the "volume" of a tranformation resulting from a matrix. In this case, the volume of the transformation represented by the Hessian. As the Hessian is comprised of second partial derivatives w.r.t. the parameters, it may be thought of as describing how the gradient (i.e. the first partial derivative) changes in repsonse to a change in the parameters. A *larger* determinant, therefore, suggests a larger change in the gradients due to a change in the parameters. The magnitude of this change is dictated by the variance in the input variable $\bf x$ and the number of input observations $N$. When $\mathbb{V}(\bf x)$ is large, the determinant is large, indicating a steeper increase in the gradients w.r.t. changes in the parameters $w$ and $b$. This suggests that the parameters $w$ and $b$ have greater influence on the predictions. When variance is low, this influence is weaker. In the case of constant $x_n$, such that $\mathbb{V}(\bf x) = 0$, the determinant will be $0$, and thus the parameters $w$ and $b$ will ahve *no effect* of the loss function, and therefore no effect on the predictions. In this case, $w$ and $b$ are perfectly linearly dependent - many combinations of $w$ and $b$ can give the same predictions.

So, a larger determinant indicates that the curvature of the error surface is more pronounced (steeper); indicating that the optimization problem is more well conditioned.

Because the determinant is positive, the critical points must be either local maxima or minima.

The determinant of a matrix is **equal to the product of its eigenvalues**

### Trace
The trace of a square matrix is the sum of the elements on its main diagonal:
$$\text{tr}(\mathbf{A}) = a_{11} + a_{22} + \cdots + a_{nn}$$
The trace of a matrix is **equal to the sum of its eigenvalues**

$$\text{tr}\mathbf{H} = f_{ww} + f_{bb} = \sum_{n=1}^N x_n^2 + N$$
This is strictly positive, thus the sum of the Hessian's eigenvalues is positive.

Since the sum and product of the Hessian's eigenvalues are both positive, the crtical point of the error function **must be a minimum**