# Ch 13: Projections and Orthogonalization

- content: pp. 363 - 388
- exercists: pp. 389 - 394

In [1]:
import numpy as np
from matplotlib import pyplot as plt

### Intro

The goal of this chapter is to introduce a framework for projecting one space onto another space (e.g. a 3D shape forming a 2D shadow).  This framework forms the basis for orthogonalization and for an algorithm called *linear least-squares*, which is the primary method for estimating parameters and fitting models to data, and is therefore one of the most important algorithms in applied mathematics, including control engineering, statistics, and machine learning.  Along the way, we'll also rediscover the left inverse.

## 13.1 Projections in $\mathbb{R}^2$

We're going to discover a formula for projecting a point onto a line, and then generalize that formula to other projections.
- start with:
  - a vector $a$
  - point $b$ not on $a$
  - scalar $\beta$ such that $\beta a$ is as close to $b$ as possible without leaving $a$
- the question is, where do we place $\beta$ so that the point $\beta a$ is as close as possible to point $b$?
- the answer: when the line from $\beta a$ to $b$ is at a right angle to $a$. (i.e. orthogonal)

- we can express the line from point $b$ to point $\beta a$ as a subtraction from vector $b$.
- thus, the expression for the line is $b - \beta a$
- Importantly, vectors $a$ and $(b - \beta a)$ are orthogonal / perpendicular
$$(b - \beta a) \perp a$$
- and since they are orthogonal, that means that the dot product between them is 0, so we can rewrite the equation as:
$$(b - \beta a)^T a = 0$$

- from here, we can use algebra to solve for $\beta$
$$a^T(b - \beta a) = 0$$
$$a^Tb - \beta a^Ta = 0$$
$$\beta a^Ta = a^Tb$$
$$\beta = \frac{a^Tb}{a^Ta}$$
*(note that dividing both sides by $a^Ta$ is valid because it is a scalar)*

### Notation
- Projections of $b$ onto the subspace defined by vector $a$ is typically written as
$$proj_a(b)$$
- note that it can be tricky to remember which is projecting onto which in $proj_a(b)$ or $proj_b(a)$
  - a memory trick is that the **Subspace goes in the Subscript**

### Equation for the projection of a point onto a line
$$proj_a(b) = \frac{a^Tb}{a^Ta}a$$

### Example:

$$a = 
\begin{bmatrix}
-2 \\
-1
\end{bmatrix},
b = (3, -1)$$

$$
proj_a(b) = 
\frac
{
  \begin{bmatrix}
  -2 \\
  -1
  \end{bmatrix}^T
  \begin{bmatrix}
  3 \\
  -1
  \end{bmatrix}
}
{
  \begin{bmatrix}
  -2 \\
  -1
  \end{bmatrix}^T
  \begin{bmatrix}
  -2 \\
  -1
  \end{bmatrix}
}
\begin{bmatrix}
-2 \\
-1
\end{bmatrix}
= 
\frac{-6 + 1}{4 + 1}
\begin{bmatrix}
-2 \\
-1
\end{bmatrix}
=
-1
\begin{bmatrix}
-2 \\
-1
\end{bmatrix}
$$

- notice that $\beta = -1$
- thus, we are projecting "backwards" onto the vector.
- this makes sense when we think of $a$ as being a basis vector for a 1D subspace that is embedded in $\mathbb{R}^2$ 

### Reflection
*Mapping over magnitude:*  Meditating on the projection equation will reveal that it is a mapping between two vectors, scaled by the squared length of the "target" vector.  It's useful to understand this intuition (mapping over magnitude), because many computations in linear algebra and its applications (e.g. correlation, convolution, normalization) involve some kind of mapping divided by some kind of magnitude or norm.

## 13.2 Projections in $\mathbb{R}^N$

- now that we can project a point onto a line, we're going to extend this by projecting a point onto any dimensional subspace
- we begin by replacing vector $a$ (which is a 1D subspace) with matrix $A$, the columns of which form a subspace with some dimensionality betweeen 1 and the matrix rank.
- point $b$ is still the same.
- Because $A$ has multiple columns, we also need to replace scalar $\beta$ with a vector.  We'll call that vector $x$

$$A^T(b-Ax)=0$$
$$A^Tb - A^TAx=0$$
$$A^TAx=A^Tb$$
- unlike before, we can't divide by $A^TA$ because it is a matrix
- So instead, we'll left multiply by the inverse (since that is essentially dividing by a matrix)
$$x=(A^TA)^{-1} A^Tb$$

- Note thta $A^TA$ must be square and full rank for this equation to be valid
- $A^TA$ is always square (so that is given)
- $A^TA$ is full rank only when $A$ is full column rank

### Code

In [2]:
# calculate least squares in Python
A = [[1,2],[3,1],[1,1]]
b = [5.5, -3.5, 1.5]
np.linalg.lstsq(A,b, rcond=None)[0]   # note: a "FutureWarning popped up recommending to add `rcond=None`"

array([-2.5,  4. ])

## 13.3 Orthogonal and parallel vector components

- in this section, we'll learn how to decompose one vector into two separate vectors that are orthogonal to each other, and that have a special relationship to a 3rd vector.
- Begin with an example:
  - start with 2 vectors: $w$ (target vector), and $v$ (reference vector)
  - we want to break up $w$ into two separate vectors, one *parallel* to $v$, and the other *perpendicular* to $v$
    - the component of $w$ that is parallel to $v$ is labeled $w_{||v}$
    - the component of $w$ that is perpendicular to $v$ is labeled $w_{\perp v}$
  - these two components sum to form w.  In other words:
$$w = w_{|| v} + w_{\perp v}$$

### Parallel component

- if w and v have their tails at the same point, then the component of w that is parallel to v is simply collinear with v.
- in other words, $w_{||v}$ is a scaled version of v.
- If we think about it, this is exactly what we did previously with projections!  So let's grab the formula and use it here.
$$w_{||v} = proj_v(w) = \frac{w^Tv}{w^Tv}v$$

### Perpendicular component

- to solve for $w_{\perp v}$, recall that the vector w is the sum of the parallel and perpendicular components
$$w_{\perp v} = w - w_{||v}$$

### Summary of equations decomposing a vector relative to another vector
$$w = w_{|| v} + w_{\perp v}$$
$$w_{||v} = proj_v(w) = \frac{w^Tv}{w^Tv}v$$
$$w_{\perp v} = w - w_{||v}$$

Tip to remember the formulas:
- because $w_{||v}$ is parallel to v, it's really the same as vector v but scaled.
- thus $w_{||v} = \alpha v$

### Reflection
The geometric representations easily demonstrate that the algebra is correct.  Whenever possible, you should learn math by solving problems where you can *see* the correct answer, and working on more challenging problems only after understanding the concept and algorithm in teh visualizable examples.  The same principle underlies the motivation to test statistical analysis methods on simulated data (where the ground truth is known) before applying those methods to empirical data (where ground truth is usually unknown).

## 13.4 Orthogonal matrices

### Properties of orthogonal matrices

- typically indicated using the letter $Q$
- (not the only letter used for orthogonal matrices, but if you see the letter $Q$ for a matrix, its safe to assume its orthogonal.)

2 key properties of an orthogonal matrix:
1. All columns are pairwise orthogonal.
2. All columns have a magnitude of 1.

### Orthogonal matrix definition:
$$Q^TQ = I$$

extending that to include other equalities:
$$Q^TQ = QQ^T = Q^{-1}Q = QQ^{-1} = I$$

### Rectangular Q
- an orthogonal matrix does not need to be square, however, the transpose of a rectangular orthogonal matrix is only a one-sided inverse.
- It's possible for a tall matrix to be orthogonal, but its NOT possible for a wide matrix to be orthogonal.
- a wide matrix can't be orthogonal because it can't satisfy both properties of an orthogonal matrix (orthogonal columns, each with magnitude of 1)
- that said, wide matrices can be "almost orthogonal" - they can have interesting properties that almost meet the criteria of an orthogonal matrix.

### Recap
- orthogonal matrices seem a bit magical
- one might believe they are rare and difficult to construct
- Definitely not!  We already know everything needed to make an orthogonal matrix, and will put the pieces to gether in the next section

## 13.5 Orthogonalization via Gram-Schmidt

- let's say you have a set of vectors in $\mathbb{R}^N$ that is independent but not orthogonal.
- You can create a set of orthogonal vectors from the original vectors by applying the Gram-Schmidt process.

### Gram-Schmidt steps for creating a set of orthonormal vectors

1. Start with $v_1$ and normalize to unit length:
$v^*_1 = \frac{v_1}{|v_1|}$
For all remaining vectors in teh set:
2. Orthogonalize $v^*_k$ to all previous vectors
3. Normalize $v^*_k$ to unit length

The result of this procedure is a set of orthonormal vectors that, when placed as column vectors in a matrix, yield an orthogonal matrix $Q$.

*See detailed example on page 382*

### Reflection
The Gram-Schmidt procedure is numerically unstable, due to round-off errors that propogate forward to each subsequent vector and affect both the normalization and the orthogonalization.  You'll see an example of this in the code challenges.  Computer programs therefore use numerically stable algorithms that achieve the same conceptual result, based on modifications to the standard Gram-Schmidt procedure or alternative methods such as Givens rotations or Gaussian elimination.

## 13.6 QR decomposition

- The Gram-Schmidt procedure transforms matrix $A$ into orthogonal matrix $Q$.
- Unless $A$ is already an orthogonal matrix, $Q$ will be different than $A$, possibly very different.
- Thus, information is lost when going from $A \rightarrow Q$.
- Though info is lost, it's possible to recover the information by QR decomposition.

$$A = QR$$
- The $Q$ here is the same $Q$ that you learned about above; it's the result of Gram-Schmidt orthogonalization (or other comparable but more numerically stable algorithm)
- $R$ is like a "residual" matrix that contains the information that was orthogonalized out of $A$

- To compute $R$, take advantage of the definition of orthogonal matrices:
$$Q^TA = Q^TQR$$
- and since $Q^TQ=I$
$$Q^TA=R$$

*see example on page 385*

### Sizes of Q and R, given A

- The sizes of $Q$ and $R$ depend on the size of $A$, and on a parameter of the implementation.

**Square matrix:**
  - if $A$ is a squaree matrix, then $Q$ and $R$ are also square matrices, of the same size as $A$
  - this is true regardless of the rank of $A$


**Tall matrix:**
- computer algorithms can implement the "economy QR" or "full QR" decomposition
- economy QR:
  - $Q$ is the same size as $A$, and $R$ will be NxN.
  - however it's possible to create a square $Q$ from a tall $A$.
  - that's because the columns of $A$ are in $\mathbb{R}^M$, so even if $A$ has only N columns, there are M-N more possible columns to create that will be orthogonal to the first M.
- full QR:
  - will have $Q \in \mathbb{R}^{MxM}$ and $R \in \mathbb{R}^{MxN}$
  - in other words, $Q$ is square and $R$ is the same size as $A$

**Wide matrix**
- there's no economy QR for wide matrices, because $A$ already has more columns than could form a linearly independent set.
- Thus, for a wide matrix $A$, the story is the same as for the full QR decomposition of a tall matrix:
  - will have $Q \in \mathbb{R}^{MxM}$ and $R \in \mathbb{R}^{MxN}$
  - in other words, $Q$ is square and $R$ is the same size as $A$

### Ranks of $Q, R$, and $A$

- $Q$ will always have its maximum possible rank (M or N depending on its size), even if $A$ is not full rank
  - it may seem surprising that the rank of $Q$ can be higher than the rank of $A$, considering that $Q$ is created from A, but it can.
- $R$ will have the same rank as $A$
  - because $R$ is created from $Q^TA$, the max possible rank of $R$ will be the rank of $A$, because the rank of $A$ is equal to or less than the rank of $Q$.

### Code

In [3]:
# QR decomposition in Python
A = np.random.randn(4,3)
Q,R = np.linalg.qr(A)
Q

array([[-0.62398737,  0.7539715 ,  0.18789483],
       [-0.74263329, -0.65200527,  0.15279724],
       [-0.01933606, -0.05213946, -0.27780652],
       [-0.24240809,  0.06080777, -0.92960856]])

In [4]:
R

array([[ 2.52939486, -0.32894288,  0.75203594],
       [ 0.        ,  1.85820658, -2.13106638],
       [ 0.        ,  0.        ,  1.1208514 ]])

## 13.7 Inverse via QR

- we noted previously to avoid having computers compute the inverse unless it's necessary, because of the risk of inaccurate results due to numerical instabilities.
- The QR decomposition provides a more stable algorithm to compute the matrix inverse, compared to the MCA algorithm learned in the previous chapter.

- We can use the $A=QR$ equation to derive the inverse:
$$A=QR$$
$$A^{-1} = (QR)^{-1}$$
$$A^{-1} = R^{-1}Q^{-1}$$
$$A^{-1} = R^{-1}Q^T$$

We still need to compute the explicit inverse of $R$, but if $A$ is a dense matrix, then $R^{-1}$ is easier and more stable to calculate than $A^{-1}$, because nearly half the elements in $R$ are zeros, and matrices with a lot of zeros are easy to work with.

## 13.8 - 13.9 Exercises

do with discussion group?

## 13.10 - 13.11 Code challenges

do with discussion group?