# Chapter 16: Singular Value Decomposition (SVD)

- content: p. 471 - 502
- exercises: p. 503 - 520

Recommended supplementary videos:
- [Singular Value Decomposition (the SVD) - Intro](https://youtu.be/mBcLRGuAFUk) - Strang /  (2016)
- [6. Singular Value Decomposition (SVD) - Full Lecture](https://youtu.be/rYz83XPxiZo) - Strang / MIT (2019)
- [Computing the Singular Value Decomposition](https://youtu.be/cOUTpqlX-Xs) | MIT 18.06SC Linear Algebra (Fall 2011)
- [Full series of 43 videos on SVD and various applications (with Python/Matlab implementation)](https://www.youtube.com/watch?v=gXbThCXjZFM&list=PLMrJAkhIeNNSVjnsviglFoY2nXildDCcv) - Steve Brunton (2020)

## 16.1 Singular Value Decomposition

- Singular Value Decomposition (SVD) is closely related to eigen-decomposition.
- In fact, eigendecomposition can be seen as a special case of the SVD, with SVD being the generalized algorithm.
  - i.e. eigendecomposition works only on square matrices, SVD works on all matrices.

**Core idea of SVD:**
- provide a set of basis vectors called *singular vectors* for the 4 matrix subspaces (row space, null space, column space, left-null space).
- provide scalar *singular values* that encode the "importance" of each singular vetor.
  - (Singular vectors are similar to eigenvectors, singular values are similar to eigenvalues.)

**Equation for Singular Value Decomposition (SVD):**
$$A = U \Sigma V^T$$

$A$ = The MxN matrix to be decomposed.  It can be square or rectangular, and any rank.

$U$ = The *left singular vectors matrix* (MxM), which provides an orthonormal basis for $\mathbb{R}^M$.  This includes the column space of $A$ and its complementary left-null space.
- The size of $U$ corresponds to the number of rows in $A$ (recall that counter-intuitively, the size of the column space = the number of rows, i.e. the count of total elements in each column).

$\Sigma$ = The *singular values matrix* (MxN), which is diagonal and contains the singular values (the ith singular value is indicated $\sigma_i$).  All singular values are non-negative (that is, positive or zero) and real-valued.
- The size of $\Sigma$ is the same as A.

$V$ = The *right singular vectors matrix* (NxN), which provides an orthonormal basis for $\mathbb{R}^N$.  That includes the row space of $A$ and its complementary null space.
- The size of $V$ corresponds to the number of columns in $A$ (recall that counter-intuitively, the size of the row space = the number of columns, i.e. the count of total elements in each row).
- Notice that the decomposition contains $V^T$; hence, although the right singular vectors are in the *columns* of $V$, it is usually more convenient to speak of the right singular vectors as being the *rows* of $V^T$.

**Sizes of SVD matrices**

<img src='img/16/SVD-sizes.jpg' alt='SVD sizes' width=500>

## 16.2 Computing the SVD

- You may think that computing the SVD is very difficult, but the truth is that once you konw eigendecomposition, the SVD is almost trivial to compute.
- Start by considering eigendecomposition of matrix $A$ of size $M \neq N$
  - eigendecomposition is not defined for non-square matrix, however $A^TA$ is eigendecomposable.
  - Replacing $A^TA$ with the SVD matrices gives us the following:

$$A^TA = (U \Sigma V^T)^T(U \Sigma V^T)$$
$$A^TA = V \Sigma^T U^TU \Sigma V^T$$
$U$ is orthogonal, ergo $U^TU=I$.  Also $\Sigma$ is diagonal, so $\Sigma^T \Sigma = \Sigma^2$
$$A^TA = V \Sigma^2 V^T$$

- you can immediately see why the singular values are non-negative--any real number squared will be non-negative.
- we're missing the U matrix, but we can obtain it via the eigendecomposition of matrix $AA^T$:

$$AA^T = (U \Sigma V^T)(U \Sigma V^T)^T$$
$$AA^T = U \Sigma V^T V \Sigma^T U^T$$
$V$ is orthogonal, ergo $V^TV=I$.  Also $\Sigma$ is diagonal, so $\Sigma \Sigma^T = \Sigma^2$
$$AA^T = U \Sigma^2 U^T$$

- So now we see that the way to compute the SVD of any rectangular matrix is to apply the following steps...

### Steps to compute SVD:

1. Compute the eigendecomposition of $A^TA$ to get $V$ (and $\Sigma$).
2. Compute the eigendecomposition of $AA^T$ to obtain $U$ (and $\Sigma$).

- Note that it's actually not necessary to complete both steps to obtain the SVD.
- After completing one step, we can compute the missing matrix by using one of the following formulas:
$$A V \Sigma^{-1} = U$$
$$\Sigma^{-1}U^TA = V^T$$

- quick aside: how do we know that $U$ and $V$ are orthogonal matrices?
  - because they come from the eigendecomposition of a symmetric matrix.  Look back at ch. 15 eigendecomposition of symmetric matrices for more details.

- When first computing the SVD by hand (which the author recommends doing at least a few times to solidify the concept), you should first decide whether to apply step 1 and then solve for U, or apply step 2 and then solve for V.
- The best strategy depends on the size of the matrix, because you want to compute the eigendecomposition of whichever of $A^TA$ or $AA^T$ is smaller.

### Normalizing singular vectors

- Singular vectors, like eigenvectors, are important because of their direction, so it may seem unnecessary to normalize them.
- But since the singular values are scaling the singular vectors, the vectors must be normalized or else $A \neq U \Sigma V^T$.
- Therefore, **all singular vectors must be normalized to unit vectors**.

- Also the signs of the singular vectors are not arbitrary; the singular values are all positive, so flipping signs of singular vectors may be necessary in reconstructing the matrix.
- This is noticeably different from diagonalization via eigendecomposition, where the eigenvectors are sign and magnitude invariant.
  - the key to understanding this difference is that the eigenvalues matrix is flanked on both sides by the eigenvectors matrix and its inverse ($V \Lambda V^{-1}$); any non-unit-magnitudes in $V$ can be absorbed into $V^{-1}$.
  - But in the SVD, $U$ and $V^T$ are not inverses of each other (indeed, they may even have different dimensionalities), and thus the magnitude of singular vectors is not cancelled.

## 16.3 Singular values and eigenvalues

- The previous section seemed to imply the trivial relationship that the eigenvalues of $A^TA$ equal the squared singular values of $A$.
- That is true, but there is a more nuanced relationship between the eigenvalues and the singular values of a matrix.
  - This relationsih pis organized into three cases:

### Case 1: eig $(A^TA)$ vs. svd $(A)$

- The eigenvalues equal the squared singular values, for the reasons explained in the previous section

Example:
$$
A = 
\begin{bmatrix}
3 & 1 & 0 \\
1 & 1 & 0
\end{bmatrix}
$$
$$
A^TA = 
\begin{bmatrix}
10 & 4 & 0 \\
4 & 2 & 0 \\
0 & 0 & 0
\end{bmatrix}
$$
$$\lambda(A^TA) = 0, .3431, 11.6569$$
$$\sigma(A) = .5858, 3.4142$$
$$\sigma^2(A) = .3431, 11.6569$$

- why are there three $\lambda$'s but only two $\sigma$'s?
  - It's because $A^TA$ is 3x3 but $\Sigma$ has the same size as the matrix $A$, which is 2x3; hence, the diagonal has only two elements.  But the non-zero $\lambda$'s equal the squared $\sigma$'s.
- This case concerns the eigenvalues of $A^TA$, not the eigenvalues of $A$.
  - In fact, there are no eigenvalues of $A$ because it is not a square matrix.

### Case 2: eig $(A^TA)$ vs. svd $(A^TA)$

- In this case, the eigenvalues and singular values are identical--without squaring the singular values.
- This is because eigendecomposition and SVD are the same operation for a square symmetric matrix (more on this point later)

### Case 3a: eig $(A)$ vs. svd $(A)$ for real-valued $\lambda$

- This is different from case 2 because here we assume that $A$ is not symmetric, which means that the eigenvalues can be real-valued or complex-valued, depending on the elements in the matrix.
- We start by considering the case of a matrix with all real-valued eigenvalues.
  - of course, the matrix does not need to be square for it to have eigenvalues, so let's add another row to the previous example above:

$$
A = 
\begin{bmatrix}
3 & 1 & 0 \\
1 & 1 & 0 \\
1 & 1 & 1
\end{bmatrix}
$$
$$\lambda(A) = .5858, 1, 3.4142$$
$$\sigma(A) = .4938, 1.1994, 3.6804$$

- There is no easy to spot relationship between the eigenvalues and the singular values.  In fact, there isn't really a relationship at all.
- Of course there is a macro relationship that $W \Lambda W^{-1} = U \Sigma V^T$ (i.e. the entire eigendecomposition equals the entire SVD since they are both equal to A)
  - But the macro relationship does not mean that there is any relationship between $\Lambda$ and $\Sigma$

### Case 3b: eig $(A)$ vs. svd $(A)$ for complex-valued $\lambda$

- The lack of an obvious relationship between eigenvalues and singular values is even more apparent when a matrix has complex-valued eigenvalues.
- We know that a real-valued matrix can have complex-valued eigenvalues, but what do we know about singular values?
- **The singular value of all matrices--real or complex--are guaranteed to have real-valued singular values.**
- Why?  Because the SVD can be obtained from the eigendecomposition of the matrix times its transpose, and that matrix is always symmetric.
  - *(The SVD of complex matrices uses the Hermitian transpose instead of the regular transpose)*

### Discussion of cases

- Cases 2 and 3 may initially seem contradictory.
- $A^TA$ is simply a matrix so if we set $C=A^TA$, then we've written that $\lambda(C)=\sigma(C)$ but $\lambda(A) \neq \sigma(A)$.
- care to guess why?
  - The difference is that $C$ is defined as a matrix times its transpose, whereas $A$ is not.  And a matrix times its transpose has definite properties that a normal matrix may not have (i.e. square, symmetric, etc)

### Matrix "energy"

- Eigendecomposition and SVD are exact decompositions, which means that all the "energy" contained in matrix $A$ must be contained inside the three eigendecomposition / SVD matrices.
  - For SVD, matrices $U$ and $V$ are orthogonal and have a matrix norm of 1, which means that **all the "energy" is contained in matrix $\Sigma$.
  - Eigendecomposition, on the other hand, has an orthogonal eigenvectors matrix only when the matrix is symmetric.  When it is non-symmetric, then the "total energy" in the matrix can be distributed over the eigenvectors and eigenvalues.

- In conclusion, there is a clear relationship between the eigenvalues of $A^A$ and the singular values of $A$ (or the singular values of $A^TA$), but there is no relationship between the eigenvalues of non-symmetric $A$ adn the singular values of $A$.

### Code
The SVD is very easy to compute in Python.
- *note that Python returns $V^T$ whereas MATLAB returns $V$.*
- Python also returns the singular values in a vector instead of in a diagonal matrix

In [2]:
import numpy as np
a = [[1, 1, 0], [0, 1, 1]]
A = np.array(a)
U, s, V = np.linalg.svd(A)
print("U = {}\n".format(U))
print("s = {}\n".format(s))   # note that s is a vector and not a diagonal matrix
print("V = {}".format(V))

U = [[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

s = [1.73205081 1.        ]

V = [[-4.08248290e-01 -8.16496581e-01 -4.08248290e-01]
 [-7.07106781e-01  2.13278616e-16  7.07106781e-01]
 [ 5.77350269e-01 -5.77350269e-01  5.77350269e-01]]


In [3]:
S = np.diag(s)  # convert s --> S as a diagonal matrix
print(S)

[[1.73205081 0.        ]
 [0.         1.        ]]


## 16.4 SVD of a symmetric matrix

- Simply put: the left and right singular vectors of a symmetric matrix are the same.

**SVD of a symmetric matrix**
$$A = U \Sigma U^T, \;\;\;\;\; if A = A^T$$

- proving this simply involves writing out the SVD and its transpose:
$$A = U \Sigma V^T$$
$$A^T = (U \Sigma V^T)^T = V \Sigma U^T$$
- Because $A=A^T$, these two equations must be equal:
$$U \Sigma V^T = V \Sigma U^T$$

## 16.5 SVD and the four subspaces

- One of the remarkable features of the SVD is that it provides orthogonal basis sets for each of the four matrix subspaces, which is one of the main reason's the SVD is such a powerful and useful decomposition.
- But let's start by talking more about $\Sigma$--the matrix that contains the singular values on the diagonal and zeros everywhere else.

Example of the $\Sigma$ matrix for a 5x3, rank 2 matrix:
$$
\begin{bmatrix}
\sigma_1 & 0 & 0 \\
0 & \sigma_2 & 0 \\
0 & 0 & 0 \\
0 & 0 & 0 \\
0 & 0 & 0 \\
\end{bmatrix}
$$
- by construction, $\sigma_1 > \sigma_2$. (i.e. there is more "energy" in $\sigma_1$)
- Indeed, SVD algorithms always sort the singular values descending from top-left to lower-right.
- $\sigma_3$ is of course 0 because this is a rank 2 matrix.
  - we'll learn below that any zero-valued singular values correspond to the null space of the matrix.
  - this means that the number of non-zero values in $\Sigma$ is equal to the rank.
  - in fact, this is how software programs like MATLAB and Python compute the rank: take the SVD and count the number of non-zero singular values.

### "Big picture" of SVD breakdown

**Visualization of the "big picture" of SVD breakdown**

<img src="img/16/SVD-big-picture.jpg" alt="SVD big picture" width=500>

There's a lot going on in the above figure so let's break it down:
- The overall picture is the visualizion of the original equation $A = U \Sigma V^T$, for a quick review:
  - matrix $A$ is decomposed into three matrices:
    - $U$ provides an orthogonal basis for $\mathbb{R}^M$ and contains the left singular vectors.
    - $\Sigma$ is the diagonal matrix of singular values (all non-negative, all real-valued).
    - $V$ provides an orthogonal basis for $\mathbb{R}^N$ and contains the right singular vectors.  Recall that $V$ is transposed, so we are talking about the *rows* of $V$.
- This figure also shows how the columns of $U$ are organized into basis vectors for the column space (light gray) and left-null space (darker gray); and how the rows of $V^T$ are organized into basis vectors for the row space (light gray) and null space (darker gray).
- In particular, the first $r$ columns in $U$, and the first $r$ rows in $V^T$, are the bases for the column and row spaces of $A$.
- The columns and rows after $r$ get multiplied by the zero-valued singular values, and thus form bases for the null spaces.
- The singular vectors for the column and row spaces are sorted according to their "importance" / "energy" contribution to matrix $A$, as indicated by the relative magnitude of the corresponding singular values.

You can see that the SVD reveals a lot of important info about the matrix:
- The rank of the matrix $(r)$ is the number of non-zero singular values.
- The dimensionality of the left-null space is the number of columns in $U$ from $r+1$ to $M$.
- The dimensionality of the null space is the number of rows in $V^T$ from $r+1$ to $N$.

### Reflection
There is a lot here to take in at first glance.  Don't expect to understand everything about the SVD just by staring at the figure discussed previously.  You'll gain more familiarity and intuition about the SVD by working with it, which is the goal of the rest of the chapter!

## 16.6 SVD and matrix rank

- It may seem strange that the number of non-zero eigenvalues doesn't necessarily equal the rank, as it does with SVD.  Reasons why will be explained below.
- One key difference between eigendecomposition and SVD is that for SVD, the two singular vector matrices span the entire ambient spaces ($\mathbb{R}^M$ and $\mathbb{R}^N$), which is not necessarily the case with eigenvectors.
- Given that matrix rank is the dimensionality of the column space, it is sensible that the rank of the matrix corresponds to the number of columns in $U$ that provide a basis for the column space (we could say the same thing about the number of rows in $V^T$ that provide a basis for the row space)
- Thus, it is sufficient to demonstrate that each column in $U$ that is in the column space of $A$ has a non-zero signular value, and that each column in $U$ that is in the left-null space of $A$ has a singular value of zero.
  - again, the same is true of $V^T$, the row space, and the null space)

- Let's start by rewriting the SVD using one pair of singular vectors and their corresponding singular value.  (this is analogous to the single-vector eigenvalue equation)

$$Av = u\sigma$$

- Now let's refresh on the definition of column space and left-null space:
  - The column space comprises all vectors that can be expressed by some combination of the columns of $A$
      - $C(A): Ax = b$
  - The left-null space comprises all non-trivial combinations of the columns of $A$ that produce the zeros vector.
    - $N(A^T): Ay = 0$
- We can now think of the equation $Av = u\sigma$ in this context: all singular vectors are non-zeros, and thus the right-hand side of the equation must be non-zero.
  - The only possible way for the right-hand side of the equation to be the zeros vector--and thus in the left-null space of $A$-- is for $\sigma$ to equal zero.
- Thus, any $u$ with a corresponding non-zero $\sigma$ is in the left-null space of the matrix.
- You can make the same argument for the row space, by starting from the equation $u^TA = \sigma v^T$.

- There's another way to explain why the rank of $A$ corresponds to the number of non-zero singular values.
  - This comes from the rule about the rank of a product of matrix multiplications.
- $A$ is the product of the three SVD matrices, therefore the rank of $A$ is constrained by the ranks of those matrices.
  - $U$ and $V$ are by definition full-rank.
  - $\Sigma$ is of size M x N but could have a rank smaller than M or N.
  - Thus, the maximum possible rank of $A$ is the rank of $\Sigma$
  - the rank of $A$ could not be smaller than the rank of $\Sigma$, because the ranks of $A$ and $U \Sigma V^T$ are equal.
  - Therefore, the rank of $A$ must equal the smallest rank of the three matrices, which is always $\Sigma$, and as a diagonal matrix, the rank of $\Sigma$ is the number of non-zero diagonal elements, which is the number of non-zero singular values.

**"Effective" rank**

- We've found multiple times that computers have difficulties with really small and really large numbers.
  - i.e. rounding errors, precision errors, underflow, overflow, etc.
- How does a computer decide whether a singular value is small but non-zero vs. zero with a rounding error?
  - essentially, the program uses a "tolerance" value to separate the noise from the values.
  - more info on p. 488

## 16.7 SVD spectral theory

- There are several "spectral theories" in math, and they all involve the concept that a complicated thing--such as a matrix, an operation, or a transformation--can be represented by the sum of simpler things.
  - similar to how light can be decomposed into the colors of the rainbow
- There is a spectral theory of matrices, which is sometimes used as another term for eigendecomposition.
- This concept and terminology will be modified slightly to create the *SVD spectral theory*, which is that all matrices can be expressed as a sum of rank-1 matrices, and that the SVD provides a great decomposition to obtain these individual rank-1 matrices.

- recall the "layer perspective" of matrix multiplication (p. 146)
  - which involves constructing a product matrix as the sum of outer-product matrices created from the columns of a matrix on the left, and the rows of a matrix to the right
  - each of those outer-product matrices has a rank of 1, because each column (or row) is a scalar multiple of one column (or one row).
- With matrix multiplication via layers, the two vectors that multiply to create each "layer" of the product matrix are defined purely by their physical position in the matrix.

- The SVD provides an interesting way to construct a matrix by summing rank-1 layers that are computed using the columns of $U$ and the rows of $V^T$, scaled by their corresponding singular value.
- The mechanics are given by re-writing the SVD formula using a summation of vectors instead of matrices:

$$A = \sum^r_{i=1} u_i \sigma_i v^T_i$$

- where $r$ is the rank of the matrix (the singular values after $\sigma_r$ are zeros, and thus can be omitted from this equation)
- The summation equation may seem less concise than the matrix SVD equation, but it sets us up for the SVD spectral theory, and also leads to one of the most important applications of SVD: low rank approximations.

- Let's consider only the first iteration of the summation:
$$A_1 = u_1 \sigma_1 v^T_1$$
- we might call matrix $A_1$ the "first SVD layer of $A$"
  - it is a rank-1 matrix (same size as A) formed as the outer product between the first left singular vector and the first right singular vector.
  - because the two vectors are unit length, the 2-norm of their outer product is also 1.
  - But is the norm of $A_1$ also equal to 1?  No! Because the outer product gets scalar multiplied by the corresponding singular value.

- Because we always sort singular values in descending order, $A_1$ is the "most meaningful" SVD-layer of matrix $A$.
  - *"meaningful" can be interpreted as the amount of total variance in the matrix, or as the most important feature of the matrix.*
- $A_2$ is the next most meaningful, and so on down to $A_r$
- Thus, each corresponding left and right singular vector combine to produce a layer of the matrix.
  - This layer is like a direction in the matrix space, but that direction is simply a pointer of unit length, it doesn't convey "importance".
- The singular *value* indicates how important each direction is.  It is the weight / magnitude applied to the direction vectors.

**Illustration of the SVD Spectral Theory**

<img src="img/16/SVD-spectral-theory.jpg" alt="Illustration of SVD Spectral Theory" width=500>

- notice that layer 1 ("L1" in the above figure) captures the most prominent feature of the matrix $A$ (the horizontal band in the middle)
  - we will refer to this as the best rank-1 approximation of A.
- Though it may not seem obvious from the color scaling, each column of the L1 matrix is simply the left-most column of $U$ scalar multiplied by the corresponding elements of the first row of $V^T$, and also multiplied by $\Sigma_{1,1}$
  - same story for L2 and L3
- Columns 4 and 5 of $U$ do not contribute to reconstructing $A$ (since they are multiplied by rows 4 and 5 of $\Sigma$ which are zeros)
  - in terms of matrix spaces, columns 4 and 5 are in the left-null space of $A$
  - *question: then do columns 4 and 5 serve any purpose or are they arbitrary noise?*

- SVD layer 2 captures the second most prominent feature of matrix $A$, which is the vertical stripe in the lower right.
- Layer 3 has a relatively small singular value (10% of total variance in the matrix), and therefore accounts for relatively little information in the matrix.

## 16.8 SVD and low-rank approximations

- We now see that SVD layers with smaller singular values are less important for the matrix.
- This leads to the idea of low-rank approximations:
  - Create a matrix $\~A$ that is sort of like matrix $A$; it is a rank-k version of the original rank-r matrix created by adapting the full vector-sum formula slightly:

**Low rank approximation formula**

$$\~A = \sum^k_{i=1} u_i \sigma_i v^T_i, \;\;\;\;\; k < r$$

**Example of using SVD for low rank approximation of a matrix**

<img src="img/16/SVD-approximation.jpg" alt="Example of using SVD for low rank approximation of a matrix" width=500>

So how does one choose a value for $k$? (the low-rank approximation).  One of two ways:
1. a visual inspection of the "scree plot" (the plot of decreasing singular values - B in the figure above).  Find the point where the incremental gain seems to diminish and the curve flattens out.  It is somewhat subjective and requires judgment.
2. A less subjective method is to define a minimum threshold, e.g. >1% of the variance of the entire matrix.

Why would we want a low rank approximation?  There are many valuable applications that are used in technology we use every day (Google, image compression, etc)

1. **Noise reduction** - It's possible that the data features associated with large singular values represent signal, whereas data features associated with small singular values represent noise.  Thus, but filtering out small singular values, we can eliminate some of the noise (this happens automatically with the SVD calc in Python/Matlab anyways, but may need stronger filtering)

2. **Machine-learning classification** - Many machine-learning analyses involve identifying features or patterns in data that predict some outcome variable.  It may be the case that the most relevant data features are repreesnted in the $k$ largest singular vectors, and thus the analysies can be faster and more accurately done by training on the "most important" singular vectors instead of on the original data matrix.

3. **Data Compression** - Let's imagine a large dataset contained in a 10,000 x 100,000 matrix.  At floating-poit precision, this matrix can take up to 8 GB of hard-disk space (assuming no compression).  Now let's imagine that this is a rank-100 matrix.  Sotring the first 100 columns of $U$ and $V$, and the first 100 singular values (as a vector, not as a diagonal matrix), would take up only 0.09 GB (around 90 MB).  It would then be possible to load in these vectors and scalars and recreate the full-sized $\~A$ as needed.

## 16.9 Normalizing singular values

- Imagine you have 2 different matrices, and find that the largest singular value of matrix $A$ is 8 and for $B$ is 678.
  - How would you interpret that difference?  Are the numbers comparable?  And what does the number "8" even mean in this context?
- The answer is that those two $\sigma_{max}$ values are not comparable, unless you know that the matrices have numbers in exactly the same range.

- Before showing how to normalize the singular values, let's demonstrate that the singular values are scale-dependent, meaning that they change with the scale of the numbers in the matrix.
- Below is an example: the second matrix is the same as the first but multiplied by 10.  Notce that their singular values are the same except for a scaling by 10.

$$A = \begin{bmatrix} 8 & 4 & 10 \\ 4 & 5 & 6 \end{bmatrix}, \;\;\;\;\; \sum_A = \begin{bmatrix} 15.867 & 0 \\ 0 & 2.286 \end{bmatrix}$$
$$B = 10A = \begin{bmatrix} 80 & 40 & 100 \\ 40 & 50 & 60 \end{bmatrix}, \;\;\;\;\; \sum_A = \begin{bmatrix} 158.674 & 0 \\ 0 & 22.863 \end{bmatrix}$$

- So when a matrix is multiplied by a scalar, its singular values are multiplied by that same scalar.
  - remember that the vectors are all unit-normalized so they cannot contain any "energy", so everything goes into the singular values.
- The point is that it is difficult to compare singular values without context.  So what is the solution?
- **The solution is to normalize the singular values.**

- Because the singular vectors are all unit-length and are scaled by the singular values when reconstructing the matrix from its SVD layers, the sum over all singular values can be interpreted as the total variance or "energy" in the matrix.
- This sum over all singular values is formally called the *Schatten 1-norm* of the matrix.

**Equation for the Schatten p-norm**

*this is the generalized equation but we are using p=1*
$$||A||_p = \biggl( \sum^r_{i=1}\sigma^p_i \biggr)^{1/p}$$

- The next step is to scale each singular value to the percent of the Schatten 1-norm
$$\~\sigma_i = \frac{100\sigma_i}{||A||_1}$$
- This is a useful normalization because it allows for direct interpretation of each singular value, as well as direct comparison of singular values across different matrices.
- In the example earlier of $B$ matrix being $A$ scaled by 10, the two *normalized* matrices would be equal to each other.

- Going back to the problem at the start of the section, two matrices with the largest singular values of 8 and 678 are not comparable, but let's say the normalized largest singular values are 35% and 82%.
  - i.e. the largest singular value of $A$ contributes/explains 35% of the total variance in the matrix, and the largest of $B$ contributes/explains 82% of the total variance.
- We can interpret that to mean various things, e.g. that the first matrix is more complicated since the largest contributor only explains 35% of the variance.

- Now let's think back to the question of how many SVD-layers to use in a low rank approximation (i.e. how to select $k$)
- When the singular values are normalized, you can pick some variance threshold and retain all SVD-layers that contribute at least that much variance.
  - e.g. you might keep all SVD-layers with $\sigma$ > 1%, or perhaps 0.1% to retain more information.
- the choice of a threshold is somewhat subjective, and depends on the situation
  - i.e. how critical is accuracy? is the goal to minimize storage space? etc...

## 16.10 Condition number of a matrix

- The condition number of a matrix is used to evaluate the "spread" of a matrix.
- It is defined as the ratio of the largest to the smallest singular values, and is often indicated using the Greek letter $\kappa$.

**Condition number of a matrix**

$$\kappa = \frac{\sigma_{max}}{\sigma_{min}}$$

- for example, the condition number of the identity matrix is , because all of its singular values are 1.
- the condition number of any singular matrix is undefined ("not a number"; NaN) because singular matrices have at least one zero-valued singular value, and would lead to dividing by zero.
- The condition number of all orthogonal matrices is the same.
  - Can you guess what it is and why?
  - to build suspense, the answer will be provided later.

- A matrix is called *well-conditioned* if it has a low condition number and *high-conditioned* if it has a high condition number.
- But there is no absolute threshold for when a matrix can be labeled high-conditioned.
  - In some cases, $\kappa$ > 10,000 is used as a threshold, but this can be application specific.
- Furthermore, singular matrices can contain a lot of information and be useful in applications, but have a condition number of NaN.

- In data analysis and statistics, the condition number is used to indicate the stability of a matrix, i.e. the sensitivity of the matrix to small perturbations.
- A high condition number means that the matrix is very sensitive, which could lead to unrealiable results in some analyses, e.e. those that require the matrix inverse.
- But don't take the condition number too seriously: matrices can contain a lot of info or very little info regardless of their condition number.  The condition number should never be used to determine if a matrix is useful or not.

- And now for the answer to the orthogonal matrices question earlier:
  - the conditional number of any orthogonal matrix is 1, because all the singular values of an orthogonal matrix are 1.
  - This is the case because orthogonal matrix is defined as $Q^TQ=I$, and the eigenvalues of a diagonal matrix are its diagonal elements.

### Code
You can compute the condition number on our own based on the SVD, but Python and MATLAB also have built in functions

In [2]:
# Compute the condition number in Python
import numpy as np
A = np.random.randn(5,5)
s = np.linalg.svd(A)[1]
cond_num_manual = np.max(s) / np.min(s)
cond_num_function = np.linalg.cond(A)
print(cond_num_manual)
print(cond_num_function)

7.800157320712822
7.80015732071283


## 16.11 SVD and the matrix inverse

Let's consider the inverse of a matrix and its SVD. (assume $A$ is square full rank)
$$A^{-1} = (U \Sigma V^T)^{-1}$$
$$A^{-1} = V \Sigma^{-1} U^{-1}$$
$$A^{-1} = V \Sigma^{-1} U^T$$

- because $U$ and $V$ are orthogonal matrices, their inverses are trivial to compute (their transposes)
- and because $\Sigma$ is a diagonal matrix, its inverse is also trivial to compute (simply invert each diagonal element - see 12.2 on p. 333)
  - actually, $\Sigma^{-1}$ may not be trivial to compute in practice, because if some singular values are close to machine precision (rounding threshold), then tiny rounding errors or other numerical inaccuracies can make the inverse unstable)
  - This shows why the explicit inverse of an ill-conditioned matrix can be numerically unstable.

- we can use the above equation to prove that the inverse of a symmetric matrix is itself symmetric
- lets write out the SVD for a symmetric matrix and its inverse (remember that symmetric matrices have identical left and right singular vectors)
$$A^T = (V \Sigma V^T)^T = V \Sigma V^T$$
$$A^{-1} = (V \Sigma V^T)^{-1} = V \Sigma^{-1} V^T$$
- It's immediately clear that $A$, $A^T$ and $A^{-1}$ have the same singular vectors
- the singular values may differ, the the point is that $\Sigma$ is also symmetric, and thus $A^{-1}$ is symettric as long as $A$ is symmetric.
- This may seem academic, but it is crucial for the pseudoinverse, coming up next

## 16.12 The MP Pseudoinverse, part 2

- Back in section 12.8 we read about the pseudoinverse: it is an approximation to an inverse for a singular matrix.
- We will now revisit the algorithm in the context of the SVD:

**Pseudoinverse via SVD equation (Moore-Penrose pseudoinverse)**

$$A^\dagger = (U \Sigma V^T)^\dagger$$
$$A^\dagger = V \Sigma^\dagger U^T$$
$$\Sigma^\dagger_{i,i} = \Bigg\{ 
\begin{matrix}
  \frac{1}{\sigma_i} \;\;\; \text{if } \sigma_i \neq 0  \\
  0 \;\;\;\;\; \text{if } \sigma_j = 0
\end{matrix}$$

- notice that this work work for any matrix; square, rectangular, full-rank or singular.
- When the matrix is square and full-rank, then the Moore-Penrose pseudoinverse will equal the true inverse.

- computer programs that implement the MP pseudoinverse will threshold very small-but-nonzero singular values to avoid numerical instability issues.
- As described previously, the tolerance is some multiple of machine precision, and to treat any values below that threshold as indistinguishable from zero.

**Illustration of two examples of MP pseudoinverse for singular matrix**

<img src="img/16/SVD-MP-pseudoinverse.jpg" alt="Illustration of two examples of MP pseudoinverse for singular matrix" width=500>

**One sided inverse**
- When the matrix is rectangular and either full column rank or full row rank, then the pseudoinverse will equal the left inverse or the right inverse, respectively.
- This makes computing the one-sided inverse computationally efficient, because it can be done without explicitly computing $(A^TA)^{-1}$.
- To understand the relationship between one sided inverse and pseudo inverse, let's work through the math of the left inverse:
  - see multi-step proof on p. 502
- The take-home message is that when you write out the SVD of a left inverse and simplify, you end up with exactly the same expression as the SVD-based inverse of the original matrix (replace $^{-1}$ with $^\dagger$ where appropriate).

## 16.13 - 16.14 Code Challenges

1. In Chapter 13, you learned about "economy" QR decomposition, which can be useful for large tall matrices. There is a comparable "economy" version of the SVD. Your goal here is to figure out what that means. First, generate three random matrices: square, wide, and tall. Then run the full SVD to confirm that the sizes of the SVD matrices match your expectations (e.g., Figure 16.1). Finally, run the economy SVD on all three matrices and compare the sizes to the full SVD.

In [8]:
# the "economy" SVD cuts off the extraneous columns/rows of U for tall matrices, and V for wide matrices
# those columns/rows are multiplied by 0 values in Sigma so they are extraneous anyways
wide = np.random.randn(2, 5)
# tall = np.random.randn(5, 2)  # not used in this example. Same situation as wide except U is truncated instead of V
Uf, sf, Vf = np.linalg.svd(wide)
Ue, se, Ve = np.linalg.svd(wide, full_matrices=False)
print("U (full): \n{}".format(Uf))
print("s (full): \n{}".format(Uf)) # s = Sigma (full) in vector format
print("V (full): \n{}".format(Vf)) # notice how the full version of V is 5x5, even though the bottom 3 rows will be ignored when multiplied by Sigma

U (full): 
[[-0.70729733 -0.70691618]
 [-0.70691618  0.70729733]]
s (full): 
[[-0.70729733 -0.70691618]
 [-0.70691618  0.70729733]]
V (full): 
[[ 0.15583693  0.03764079 -0.85539808 -0.29153161 -0.39699052]
 [ 0.36959977 -0.91059016  0.10819417 -0.12554203 -0.0821875 ]
 [ 0.75621245  0.40595748  0.36394339 -0.21320633 -0.29228331]
 [ 0.32267258 -0.01049848 -0.2056213   0.91782704 -0.10528814]
 [ 0.40390099  0.06705991 -0.28610005 -0.10670101  0.8597259 ]]


In [9]:
print("U (economy): \n{}".format(Ue))
print("s (economy): \n{}".format(Ue))  # s = Sigma (economy) in vector format
print("V (economy): \n{}".format(Ve))  # notice how the economy version of V is 2x5, it is excluding the bottom 3 rows which would be multiplied by 0 anyways.

U (economy): 
[[-0.70729733 -0.70691618]
 [-0.70691618  0.70729733]]
s (economy): 
[[-0.70729733 -0.70691618]
 [-0.70691618  0.70729733]]
V (economy): 
[[ 0.15583693  0.03764079 -0.85539808 -0.29153161 -0.39699052]
 [ 0.36959977 -0.91059016  0.10819417 -0.12554203 -0.0821875 ]]


2. Obtain the three SVD matrices from eigendecomposition, as described in section 16.2. Then compute the SVD of that matrix using the svd () function, to confirm that your results are correct. Keep in mind the discussions of sign-indeterminacy.

3. Write code to reproduce panels $\mathrm{B}$ and $\mathrm{C}$ in Figure 16.5. Confirm that the reconstructed matrix (third matrix in panel C) is equal to the original matrix. (Note: The matrix was populated with random numbers, so don't expect your results to look exactly like those in the figure.)

4. Create a random-numbers matrix with a specified condition number. For example, create a $6 \times 16$ random matrix with a condition number of $\kappa=42$. Do this by creating random $\mathbf{U}$ and $\mathbf{V}$ matrices, an appropriate $\boldsymbol{\Sigma}$ matrix, and then create $\mathrm{A}=\mathbf{U \Sigma V}^{\mathrm{T}}$. Finally, compute the condition number of $\mathrm{A}$ to confirm that it matches what you specified (42).

5. This and the next two challenges involve taking the SVD of a picture. A picture is represented as a matrix, with the matrix values corresponding to grayscale intensities of the pixels. We will use a picture of Einstein. You can download the file at https://upload.wikimedia.org/wikipedia/en/8/86/Einstein_tongue.jpg of course, you can replace this with any other picture a selfie day... However, you may need to apply some image pedding to reduce the image matrix from $3 \mathrm{D}$ to $2 \mathrm{D}$ (thus, processing stead of RGB) and the datatype must be double (MATLAB) or floats (Python).

After importing the image, construct a low-rank approximation using various numbers of singular values. Show the original and low-rank approximations side-by-side. Test various numbers of components and qualitatively evaluate the results. Tip: You don't need to include the top components!


6. Create a scree plot of the percent-normalized singular values. Then test various thresholds for reconstructing the picture (e.g., including all components that explain at least $4 \%$ of the variance). What threshold seems reasonable?

7. The final challenge for this picture-SVD is to make the assessments of the number of appropriate components more quantitative. Compute the error between the reconstruction and the original image. The error can be operationalized as the RMS (root mean square) of the difference map. That is, create a difference image as the subtraction of the original and low-rank reconstructed image, then square all matrix elements (which are pixels), average over all pixels, and take the square root of that average. Make a plot of the RMS as a function of the number of components you included. How does that function compare to the scree plot?

8. What is the pseudoinverse of a column vector of constants? That is, the pseudoinverse of $k 1$. It obviously doesn't have a full inverse, but it is clearly a full column-rank matrix. First, work out your answer on paper, then confirm it in MATLAB or Python.

9. The goal here is to implement the series of equations on page 505 and confirm that you get the same result as with the pinv() function. Start by creating a $4 \times 2$ matrix of random integers between 1 and 6 . Then compute its SVD (Equation 16.29). Then implement each of the next four equations in code. Finally, compute the MP pseudoinverse of the tall matrix. You will now have five versions of the pseudoinverse; make sure they are all equal.

10. This challenge follows up on the first code challenge from the previous chapter (about generalized eigendecomposition implemented as two matrices vs. the product of one matrix and the other's inverse). The goal is to repeat the exploration of differences between eig $(A, B)$ and eig $(\operatorname{inv}(B) * A)$. Use only $10 \times 10$ matrices, but now vary the condition number of the random matrices between $10^{1}$ and $10^{10}$. Do you come to different conclusions from the previous chapter?

11. This isn't a specific code challenge, but instead a general suggestion: Take any claim or proof I made in this chapter (or any other chapter), and demonstrate that concept using numerical examples in code. Doing so (1) helps build intuition, (2) improves your skills at translating math into code, and (3) gives you opportunities to continue exploring other linear algebra principles (I can't cover everything in one book!).