### Singular Value Decomposition (SVD)

The SVD of a matrix $X$ is given by:

$$ X = U \Sigma V^\top $$

where:
- $U$ is a matrix of the left singular vectors,
- $\Sigma$ is a diagonal matrix of singular values,
- $V^\top$ is the transpose of the matrix of right singular vectors.

#### Transpose of $X$

Taking the transpose of $X$ gives:

$$ X^\top = V \Sigma^\top U^\top $$

#### Correlation Matrix

The matrix $X^\top X$ represents the correlation matrix. This matrix is often used in principal component analysis (PCA) and related methods. We have:

$$ X^\top X = V \Sigma^\top U^\top U \Sigma V^\top $$

Since $U^\top U = I$ (the identity matrix, because $U$ is orthogonal), this simplifies to:

$$ X^\top X = V \Sigma^\top \Sigma V^\top = V \Sigma^2 V^\top $$

This is an important result because it shows that the matrix $X^\top X$ can be diagonalized using the matrix $V$. The diagonal elements of $\Sigma^2$ represent the squared singular values, which are also the eigenvalues of $X^\top X$.

#### Eigenvalue Decomposition of $X^\top X$

From the previous equation, we see that $X^\top X$ can be written as:

$$ X^\top X V = V \Sigma^2 $$

This shows that the columns of $V$ are the eigenvectors of $X^\top X$, and the diagonal elements of $\Sigma^2$ are the eigenvalues.

#### $XX^\top$ Matrix

Similarly, for the matrix $XX^\top$, we have:

$$ XX^\top = U \Sigma V^\top V \Sigma^\top U^\top $$

Again, since $V^\top V = I$, this simplifies to:

$$ XX^\top = U \Sigma^2 U^\top $$

This shows that the matrix $XX^\top$ is diagonalized by $U$, with eigenvalues given by the diagonal elements of $\Sigma^2$.

#### Summary

- $X^\top X = V \Sigma^2 V^\top$ shows that $V$ contains the eigenvectors of $X^\top X$, and $\Sigma^2$ contains the eigenvalues.
- $XX^\top = U \Sigma^2 U^\top$ shows that $U$ contains the eigenvectors of $XX^\top$, and $\Sigma^2$ contains the eigenvalues.

In both cases, the matrix $\Sigma^2$ represents the squared singular values, which correspond to the eigenvalues of both $X^\top X$ and $XX^\top$.


# Ridge & Lasso regression

### Ridge Regression
Ridge regression adds a regularization term to the ordinary least squares (OLS) loss function. The objective function is:

$$
\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \beta)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\}
$$

where:
- $y_i$ are the true values,
- $\mathbf{x}_i$ are the feature vectors,
- $\beta$ are the coefficients,
- $\lambda$ is the regularization parameter.

### Lasso Regression
Lasso regression uses an L1 regularization term, which encourages sparsity in the coefficients. The objective function is:

$$
\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \beta)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}
$$

where **the terms are the same as those in the Ridge Regression equation, but with an L1 norm** ($|\beta_j|$) for the regularization term.


## Remarks

**Lasso Regression** can exclude useless variation from equations, by tuning the weights **ALL THE WAY TO ZERO**

**Ridge regression** can only asymptotically tune the weights to zero

So Lasso is better than Ridge at **reducing variance** in models with a **lot of useless variables**

Ridge regression tends to better when all variable are useful

# Gauss-Markov Theorem

For any linear estimator $\hat{\theta} = c^\top \mathbf{y}$ that is unbiased for $a^\top \beta$, we have

$$
\mathbb{V}(a^\top \hat{\beta}) \leq \mathbb{V}(c^\top \mathbf{y}).
$$
C.f. exercise week 4-5 for proof


### Maximum A Posteriori (MAP) Estimator

The Maximum A Posteriori (MAP) estimator is given by:

$$ \hat{\theta}_{MAP} = \underset{\theta}{\mathrm{arg\,max}} \, P(\theta | \mathbf{x}) $$

Using Bayes' theorem, this can be rewritten as:

$$ \hat{\theta}_{MAP} = \underset{\theta}{\mathrm{arg\,max}} \, \left[ P(\mathbf{x} | \theta) P(\theta) \right] $$

where:

- $ P(\theta | \mathbf{x}) $ is the posterior probability of the parameter $ \theta $ given the data $ \mathbf{x} $,
- $ P(\mathbf{x} | \theta) $ is the likelihood of the data given $ \theta $,
- $ P(\theta) $ is the prior distribution of $ \theta $.

### Maximum Likelihood Estimator (MLE)

The Maximum Likelihood Estimator (MLE) is given by:

$$ \hat{\theta}_{MLE} = \underset{\theta}{\mathrm{arg\,max}} \, P(\mathbf{x} | \theta) $$

In other words, MLE estimates the parameter $ \theta $ that maximizes the likelihood of the observed data $ \mathbf{x} $.

### Relationship between MAP and MLE

If the prior $ P(\theta) $ is uniform (i.e., it does not depend on $ \theta $), then the MAP estimator simplifies to the MLE:

$$ \hat{\theta}_{MAP} = \hat{\theta}_{MLE} $$

#### Remarks


- MLE **often overfits**
- MAP **avoids overfitting** -> Regularization/ Shrinkage
- As n $ \rightarrow \infty $, MAP tends to look like MLE


This is an inline equation: $y = mx + b$.

This is a block (display) equation:
$$ y = mx + b $$


The variance of a vector $ \mathbf{x} $ multiplied by a matrix $ \mathbf{M} $ is given by:

$$ \text{Var}(\mathbf{M} \mathbf{x}) $$

Since variance is a quadratic form, we can factor the matrix $ \mathbf{M} $ out of the variance as follows:

$$ \text{Var}(\mathbf{M} \mathbf{x}) = \mathbf{M} \, \text{Var}(\mathbf{x}) \, \mathbf{M}^\top $$

Here, $ \text{Var}(\mathbf{x}) $ is the covariance matrix of the vector $ \mathbf{x} $, and $ \mathbf{M}^\top $ is the transpose of the matrix $ \mathbf{M} $.


### Jensen's Inequality

Jensen's inequality applies to a convex function and is given by the following statement:

For a convex function $f$, and for any random variable $X$:

$$ f\left( \mathbb{E}[X] \right) \leq \mathbb{E}[ f(X) ] $$

where:
- $f$ is a convex function,
- $\mathbb{E}[X]$ is the expected value of the random variable $X$,
- $\mathbb{E}[ f(X) ]$ is the expected value of the function $f$ applied to $X$.

### Interpretation

- If $f$ is **concave**, then Jensen's inequality is reversed:

$$ f\left( \mathbb{E}[X] \right) \geq \mathbb{E}[ f(X) ] $$

Jensen's inequality expresses that the value of a convex function at the expected value of a random variable is less than or equal to the expected value of the function applied to the random variable.
