# vector and matrix

- matrix $\mathbf{A}\in \mathbb{R}^{m \times n}$

    - row indexing $\mathbf{A}[i,:]$
    
    - column indexing $\mathbf{A}[:,i]$
    
    - matrix slicing $\mathbf{A}[a:b,c:d]$


- vector

    - column vector $\mathbf{v}\in \mathbb{R}^{n}$

    - row vector $\mathbf{v}^T\in \mathbb{R}^{1 \times n}$

- dot product $\mathbf{a} \cdot \mathbf{b}=\mathbf{a} ^T \mathbf{b}$ where $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$


- matrix multiplication $\mathbf{A} \mathbf{B}$

- matrix transpose $\mathbf{A}^T$


- zero vector $\mathbf{0}$


- one vector $\mathbf{1}$


- $\mathbf{1}^T\mathbf{v}$ is sum of elements in $\mathbf{v}$

- matrix addition $\mathbf{A}+ \mathbf{B}$


- matrix subtraction $\mathbf{A} - \mathbf{B}$


- Hadamard product $\mathbf{A} \odot  \mathbf{B}$ = elementwise multiplication

$$A \odot B = 
\begin{bmatrix}
A_{1,1}B_{1,1} & A_{1,2}B_{1,2} & \cdots & A_{1,n}B_{1,n} \\
A_{2,1}B_{2,1} & A_{2,2}B_{2,2} & \cdots & A_{2,n}B_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
A_{m,1}B_{m,1} & A_{m,2}B_{m,2} & \cdots & A_{m,n}B_{m,n}
\end{bmatrix}
$$


- Hadamard quotient $\frac{\mathbf{A}}{\mathbf{B}}$ = elementwise division

$$
A \oslash B = 
\begin{bmatrix}
A_{1,1}/B_{1,1} & A_{1,2}/B_{1,2} & \cdots & A_{1,n}/B_{1,n} \\
A_{2,1}/B_{2,1} & A_{2,2}/B_{2,2} & \cdots & A_{2,n}/B_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
A_{m,1}/B_{m,1} & A_{m,2}/B_{m,2} & \cdots & A_{m,n}/B_{m,n}
\end{bmatrix}
$$


# broadcasting


- scalar multiplication $c\mathbf{A}$


- scalar division $\frac{\mathbf{A}}{c}$


- scalar addition $c+\mathbf{A}$


- scalar subtraction $c-\mathbf{A}$


- vector-matrix addition $\mathbf{v}+\mathbf{A}$

    column vector: add vector to every column of matrix
    
    row vector: add vector to every row of matrix


- function broadcasting $f(\mathbf{A})$ where $f: \mathbb{R} \rightarrow \mathbb{R}$


![image.png](attachment:image.png)

![image.png](attachment:image.png)

# span and vector space

- linear map: a function $f: \mathbb{R}^m \rightarrow \mathbb{R}^n$

    $$
    f(\mathbf{x})= \mathbf{W}\mathbf{x}+\mathbf{b}
    $$

- bilinear map: a function $g: \mathbb{R}^m \times \mathbb{R}^n \rightarrow \mathbb{R}$

    $$
    g(\mathbf{x}, \mathbf{y})=\mathbf{x}^T \mathbf{W}\mathbf{y}+b
    $$

    $\mathbf{W}$: weight matrix, $\mathbf{b}$: bias


- span of $\mathbb{B}$: 

    let $\mathbb{B} \in \mathbb{R}^n$ be a set of n-dimensional vectors,
    
    the set is span of $\mathbb{B}$
    
    $$
    \text{span}(\mathbb{B})=\left \{a_1 \mathbf{b}^{(1)} + a_2 \mathbf{b}^{(2)} +...+ a_k \mathbf{b}^{(k)} | \mathbf{a} \in \mathbb{R}^k \ \text{and}\ \forall i\ [\mathbf{b}^{(i)} \in \mathbb{B}] \right \}
    $$


- vector space $\mathbb{V}$

    if $\mathbb{V}=\text{span}(\mathbb{B})$ for some $\mathbb{B} \in \mathbb{R}^n$,
    
    then the set of vectors $\mathbb{V} \in \mathbb{R}^n$ is vector space


- linearly independent

    a set of vectors is linearly independent if 
    
    $$
    a_1 \mathbf{b}^{(1)} + a_2 \mathbf{b}^{(2)} +...+ a_k \mathbf{b}^{(k)}=\mathbf{0}
    $$
    
    implies $\mathbf{a}=\mathbf{0}$ for all $\mathbf{b}^{(k)} \in \mathbb{B}$


- basis for vector space

    $\mathbb{B}$ is basis for vector space $\mathbb{V}$ if $\mathbb{B}$ is linearly independent and $\text{span}(\mathbb{B})=\mathbb{V}$


- dimension of vector space: number of elements in set $\mathbb{B}$

    $$
    \text{dim}(\mathbb{V})=|\mathbb{B}|
    $$

# Similarity and distance metrics

## cosine similarity

- cosine similarity between 2 vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ is

$$
\cos(\mathbf{u}, \mathbf{v})=\frac{\mathbf{u}^T \mathbf{v}}{\left \| \mathbf{u} \right \| \cdot \left \| \mathbf{v} \right \|}=\cos \theta \in [-1,1]
$$

where $\theta$ is the angle between 2 vectors $\mathbf{u}, \mathbf{v}$

when $\mathbf{u}, \mathbf{v}$ point in same direction, $\cos(\mathbf{u}, \mathbf{v})=1$

when $\mathbf{u}, \mathbf{v}$ point in opposite direction, $\cos(\mathbf{u}, \mathbf{v})=-1$

when $\mathbf{u}, \mathbf{v}$ is orthogonal, $\cos(\mathbf{u}, \mathbf{v})=0$

## cosine distance

- consine distance measures how different two vectors are

$$
1 - \cos(\mathbf{u}, \mathbf{v}) \in [0, 2]
$$

## Euclidean distance

- Euclidean distance ($l_2$ distance) between 2 vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ is

$$
\left \| \mathbf{u}- \mathbf{v} \right \|
$$

# variance and covariance

- variance of 1 random scalar-valued variable $a \in \mathbb{R}$

$$
var(a)= E[\left(a-E[a]\right)^2]=cov(a, a) \in \mathbb{R}
$$


- covariance of 2 random scalar-valued variables $a, b \in \mathbb{R}$

$$
cov(a,b)= E[\left(a-E[a]\right)\left(b-E[b]\right)] \in \mathbb{R}
$$


- **covariance matrix** of 2 random vector-valued variables  $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$

$$
cov(\mathbf{a},\mathbf{b})= E[\left(\mathbf{a}-E[\mathbf{a}]\right)\left(\mathbf{b}-E[\mathbf{b}]\right)^T]  \in \mathbb{R}^{n \times n}
$$

$$
=\begin{bmatrix}
cov(a_1, b_1) & ... & cov(a_1, b_n)\\ 
\vdots  & \ddots  & \vdots \\ 
cov(a_n, b_1) & ... & cov(a_n, b_n)
\end{bmatrix}
$$

Note $cov(\mathbf{a},\mathbf{b})_{i,j}=cov(a_i, b_j)$ for all position $i, j$


- **covariance matrix** of a vector-valued variable $\mathbf{a} \in \mathbb{R}^n$

$$
cov(\mathbf{a},\mathbf{a})= var(\mathbf{a})=E[\left(\mathbf{a}-E[\mathbf{a}]\right)\left(\mathbf{a}-E[\mathbf{a}]\right)^T]  \in \mathbb{R}^{n \times n}
$$

$$
=\begin{bmatrix}
cov(a_1, a_1) & ... & cov(a_1, a_n)\\ 
\vdots  & \ddots  & \vdots \\ 
cov(a_n, a_1) & ... & cov(a_n, a_n)
\end{bmatrix}
=\begin{bmatrix}
var(a_1) & ... & cov(a_1, a_n)\\ 
\vdots  & \ddots  & \vdots \\ 
cov(a_n, a_1) & ... & var(a_n)
\end{bmatrix}
$$

- **covariance matrix** of a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$

$$
cov(\mathbf{X})=\frac{\left[\mathbf{X}-\bar{\mathbf{X}}\right]^T \left[\mathbf{X}-\bar{\mathbf{X}}\right]}{m} \in \mathbb{R}^{n \times n}
$$

where $\bar{\mathbf{X}}=\frac{\mathbf{1}^T\mathbf{X}}{m} \in \mathbb{R}^{1 \times n}$ is row mean of matrix $\mathbf{X}$ known as a **priori**

when mean is 0, $\bar{\mathbf{X}}=\mathbf{0}$,

$$
cov(\mathbf{X})=\frac{{\mathbf{X}}^T \mathbf{X}}{m} \in \mathbb{R}^{n \times n}
$$


- if we don't know the row mean, but estimated from the data,

$$
cov(\mathbf{X})=\frac{\left[\mathbf{X}-\bar{\mathbf{X}}\right]^T \left[\mathbf{X}-\bar{\mathbf{X}}\right]}{m-1} \in \mathbb{R}^{n \times n}
$$

when mean is 0, $\bar{\mathbf{X}}=\mathbf{0}$,

$$
cov(\mathbf{X})=\frac{{\mathbf{X}}^T \mathbf{X}}{m-1} \in \mathbb{R}^{n \times n}
$$



we can view matrix $\mathbf{X}$ as a data sample that estimates the distribution of $n$-dimensional random variable $\mathbf{x}$ with sample size $m$

- where $X_i$ is ith column vector of matrix $\mathbf{X}$

$$
\mathbf{X} = [\mathbf{X}_1 ... \mathbf{X}_n]
$$

# matrix decomposition

- Note: while every matrix have a singular value decomposition, not every matrix have a eigendecomposition

- QR decomposition, $Q$ orthogonal, $R$ upper triangular

$$
M = QR
$$

- Singular value decomposition (SVD), $U, V$ orthogonal, $S$ diagonal

$$
M = USV^T
$$

- Eigendecomposition (ED) according to Spectral theorem, $M$ **symmetric**, $U$ orthogonal, $S$ diagonal

$$
M = USU^T
$$

- Polar decomposition, $U$ orthogonal, $S$ semi-positive-definite symmetric

$$
M = US
$$

# orthogonal vectors and matrix

## orthogonal vectors

2 vectors in same dimension $x, y \in \mathbb{R}^n$ are orthogonal if the dot product of $x$ and $y$ is 0

$$
\left \langle x,y \right \rangle = x^Ty=0
$$

## orthonormal vectors

2 vectors in same dimension $x, y \in \mathbb{R}^n$ are **orthonormal** if

they are **orthogonal**

$$
\left \langle x,y \right \rangle = x^Ty=0
$$

and are **normalized**, i.e., they are **unit vectors**

$$
\left \| x \right \|_2=\left \| y \right \|_2=1
$$

## orthogonal matrix

- an orthogonal/orthnormal matrix is a **square matrix** $U \in \mathbb{R}^{k \times k}$ that its inverse is its transpose

$$
U^{-1} = U^T
$$

thus, since $U^{-1}U = UU^{-1} = I_k$, $I_k \in \mathbb{R}^{k \times k}$ is an identity matrix

$$
U^TU = UU^T = I_k
$$

- a semi-orthogonal/orthnormal matrix is a **Non**-square matrix $U \in \mathbb{R}^{p \times k}$ ($p \neq k$)

if $p > k$ (in PCA case), columns of $U$ are orthonormal vectors
$$
\begin{align}
&U^TU = I_k \in \mathbb{R}^{k \times k}\\[1em]
&UU^T \in \mathbb{R}^{p \times p} \text{ matrix of orthogonal projection onto column space of U}
\end{align}
$$

if $p < k$, rows of $U$ are orthonormal vectors
$$
\begin{align}
&UU^T = I_p \in \mathbb{R}^{p \times p}\\[1em]
&U^TU \in \mathbb{R}^{k \times k} \text{ matrix of orthogonal projection onto row space of U}
\end{align}
$$


**property of orthogonal matrix and semi-orthogonal matrix**

- orthogonal matrix: both columns and rows are orthonormal vectors. proof matrix $U$ is orthonormal $\Leftrightarrow $ proof columns of $U$ are orthonormal vectors

    semi-orthogonal matrix: either columns or rows are orthonormal vectors, determined by number of columns vs number of rows

- multiply an orthogonal matrix $U$ to **a vector** $\mathbf{x}$ will not change the vector's **Euclidean norm**

$$
\left \| U\mathbf{x} \right \|_2 = \left \| \mathbf{x} \right \|_2 
$$

- multiply an orthogonal matrix to **a matrix** $A$ by left ($U$) or by right ($V$) will not change the matrix's **Frobenius norm**

$$
\left \| UA \right \|_F^2 = \left \| A \right \|_F^2= \left \| AV \right \|_F^2
$$

# direction of data

## 1-D projection: 1 unit vector

- suppose we have datas $X \in \mathbb{R}^{p \times n}$ where column vector $x_i \in \mathbb{R}^{p}$ with $i \in [n]$

    a **direction** is a **unit vector** $v \in \mathbb{R}^{p}$

$$
\left \| v \right \|_2 = 1
$$

- PCA is a linear method, require unit vectors to be **orthonormal** with each other

    for other non-linear methods, unit vectors don't need to be orthonormal 

- if we project data $\mathbf{x}_i$ on that direction $v$,

    then we have projected data $\tilde {\mathbf{x}}_i$,

    the coordinate of projected data $\tilde {\mathbf{x}}_i$ is $\tilde \alpha_i = \left \langle x_i, v \right \rangle \in \mathbb{R}$


- our original data $x$ is in $p$ dimensions, the projected data $\tilde {\mathbf{x}}_i$ still in $p$ dimensions,

    but the coordinate of projected data $\tilde \alpha_i$ is in $1$ dimension, so we complete **dimensionality reduction**

## 2-D projection: 2 orthonormal vectors

- suppose we have 2 orthonormal directions (unit vectors) $u_1, u_2 \in \mathbb{R}^p$, $u_1, u_2$是一组标准正交基

$$
\left \| u_1 \right \|_2=\left \| u_2 \right \|_2=1,\ 
\left \langle u_1,u_2 \right \rangle=0 
$$



- if we project data $\mathbf{x}_i$ on directions $u_1, u_2$,

    then we have projected data $\tilde x_i \in \mathbb{R}^p$ 

    the coordinate of projected data is $(\hat \alpha_i^1, \hat \alpha_i^2) \in \mathbb{R}^2$

    this is a linear function

$$
\hat \alpha_i^1 u_1 + \hat \alpha_i^2 u_2 = \tilde x_i^1 + \tilde x_i^2  = \tilde x_i\ (\tilde x_i^1, \tilde x_i^2 \in \mathbb{R}^p)
$$

$$
\left\{\begin{matrix}
\hat \alpha_i^1=\left \langle x_i,u_1 \right \rangle = u_1^T x_i\\ 
\hat \alpha_i^2=\left \langle x_i,u_2 \right \rangle = u_2^T x_i\\ 
\end{matrix}\right.
$$

## k-D projection: orthogonal base matrix $U \in \mathbb{R}^{p \times k}$

- suppose we have an orthogonal base matrix $U \in \mathbb{R}^{p \times k}$ 正交基矩阵

$$
U^TU=I_k
$$

where $I_k$ is $k\times k$ identity matrix 单位矩阵，只有对角线元素为1，其他元素为0的方阵

- columns/rows of $U$ are orthonormal vectors

- ith column vector of $U$ is an unit vector $u_i \in \mathbb{R}^{p}$ with $i \in [k]$

- if we project data $\mathbf{x}_i$ on the base matrix $U$, then we have coordinate of projected data $\tilde x_i \in \mathbb{R}^{p}$: 

$$
\tilde \alpha = U^T x_i \in \mathbb{R}^{k}
$$

## k-D projection: general base matrix $A \in \mathbb{R}^{p \times k}$

- for general base matrix $A \in \mathbb{R}^{p \times k}$, $rank(A)=k$ if $p \geq k$

    the projected data $\tilde {\mathbf{x}}_i$ is least square solution (the best representative of $\mathbf{x}_i$)
    
    the coordinate is

$$
\tilde \alpha = \underset{\alpha}{\arg \min} \left \| A\alpha - x_i \right \|_2^2= (A^TA)^{-1}A^T x_i \in \mathbb{R}^{k}
$$


- if $A$ is orthogonal matrix, solution is $\tilde \alpha = A^T x_i$

- for dictionary learning, let $k \gg p$, then $\tilde \alpha$ is sparse

# column space and complement of column space

## column space of $U$: $\mathcal{K} \subset \mathbb{R}^k$

- column space of $U  \in \mathbb{R}^{p \times k}: \mathcal{K} \subset \mathbb{R}^k$ is a k-D subspace of $\mathbb{R}^k$

- column vectors $\mathbf{u}_i  \in \mathbb{R}^{p}$ of $U$ form basis vectors for space $\mathcal{K}$

$$
\text{basis} = \left\{\mathbf{u}_1, ..., \mathbf{u}_k\right\}
$$

- dimension of column space of the number of vectors in the basis

$$
\dim(\mathcal{K}) = k
$$

注意：虽然向量v属于列空间K，但v的维度为p，K的维度为k

- any vector $\mathbf{v} \in \mathbb{R}^{p}$ $\in \mathcal{K}$ can be written as a linear combination of $\mathbf{u}_i$

$$
\mathbf{v} = \sum_{i=1}^k \alpha_i \mathbf{u}_i = U \alpha
$$

where $\alpha \in \mathbb{R^k}$ is an unique vector, i.e., coefficients of basis vectors = coordinates of $\mathbf{v}$ in space $\mathcal{K}$

$$
\alpha =\begin{bmatrix}
\alpha_1\\ 
\vdots \\ 
\alpha_k
\end{bmatrix}= U^T\mathbf{v}
$$

## k = 2

- for k = 2, we have orthogonal base matrix $U \in \mathbb{R}^{p \times 2}$

    $\mathbf{u}_1$, $\mathbf{u}_2$ is standard base vectors, $\mathbf{u}_1 = \mathbf{e}_1, \mathbf{u}_2 = \mathbf{e}_2 \in \mathbb{R}^p$ 
$$
U = \begin{bmatrix} 
\mathbf{u}_1 & \mathbf{u}_2
\end{bmatrix}
=
 \begin{bmatrix}
\mathbf{e}_1 & \mathbf{e}_2
\end{bmatrix}
=\begin{bmatrix}
1 & 0\\ 
0 & 1\\ 
0 & 0\\ 
\vdots  & \vdots\\ 
0 & 0
\end{bmatrix}
$$


- def of standard base vector $\mathbf{e}_i \in \mathbb{R}^p$, 

$$[\mathbf{e}_i]_j = \left\{\begin{matrix}
1 & j=i \\
0 & j \neq i \\
\end{matrix}\right.
$$

- any vector $\mathbf{v} \in \mathbb{R^p}$ $\in \mathcal{K}$ can be written as a linear combination of $\mathbf{e}_1$ and $\mathbf{e}_2$

$$
\mathbf{v}  = \sum_{i=1}^2 \alpha_i \mathbf{e}_i = \alpha_1 \mathbf{e}_1 + \alpha_2 \mathbf{e}_2 =  U \alpha =\begin{bmatrix}
1 & 0\\ 
0 & 1\\ 
0 & 0\\ 
\vdots  & \vdots\\ 
0 & 0
\end{bmatrix} \begin{bmatrix}
\alpha_1\\ \alpha_2
\end{bmatrix}
= \begin{bmatrix}
\alpha_1\\  
\alpha_2\\
0 \\ 
\vdots \\
0 \\
\end{bmatrix}
$$

where $\alpha \in \mathbb{R^2}$ is an unique vector, i.e., coefficients of basis vectors = coordinates of $\mathbf{v} $ in space $\mathcal{K}$

$$
\alpha = U^T\mathbf{v}  = \begin{bmatrix}
\alpha_1\\  
\alpha_2\\
\end{bmatrix}
= \begin{bmatrix}
\left \langle \mathbf{e}_1,\mathbf{v}  \right \rangle\\  
\left \langle \mathbf{e}_2,\mathbf{v}  \right \rangle\\
\end{bmatrix}
= \begin{bmatrix}
 \mathbf{e}_1^T \mathbf{v} \\  
\mathbf{e}_2^T \mathbf{v} \\
\end{bmatrix}
$$

## orthonormal basis is not unique

- suppose we have an orthogonal base matrix $U \in \mathbb{R}^{p \times k}$ 正交基矩阵

$$
U^TU=I_k
$$

   take $G \in \mathbb{R}^{k \times k}$ as a rotation matrix or any orthonormal transformation matrix

$$
G^TG=GG^T=I_k
$$

let $\tilde U=UG$, columns $\tilde u_i$ are orthonormal vectors


in general, though $U \neq \tilde U$, but 

$$colspace(U) = colspace(\tilde U)$$

- recall $u_1 = e_1$, $u_2 = e_2$,

    let $\tilde u_1 = \frac{e_1 + e_2}{\sqrt 2}$, $\tilde u_2 = \frac{e_1 - e_2}{\sqrt 2}$

    then $\tilde u_1, \tilde u_2$ are vectors after $u_1, u_2$ rotate 45°

    though $u_1 \neq \tilde u_1, u_2 \neq \tilde u_2$, but

$$Span\left\{ u_1, u_2 \right\} = Span\left\{ \tilde u_1, \tilde u_2 \right\}$$

## orthogonal complement of $\mathcal{K}$: $\mathcal{K}^{\perp }$

- some info of data is lost when project data $x_i \in \mathbb{R}^p$ to space $\mathcal{K} \subset \mathbb{R}^k$ ($p > k$), by examine the orthogonal complement space of $\mathcal{K}$: $\mathcal{K}^{⊥}$, we can know that lost info

- take vector $\mathbf{w} \in \mathbb{R}^p$ with $U^T\mathbf{w}=\mathbf{0}_k \in \mathbb{R}^k$

    the zero vector means projection of $w$ on to column space $\mathcal{K}$ of $U$ has no info

    that is, $\mathbf{w} $ is orthogonal to every (linear combination of) column of orthogonal matrix $U$,

    that is, $\mathbf{w} $ is orthogonal to any vector $\mathbf{v}$ lives in $\mathcal{K}$ 

$$
\left \langle \mathbf{w} , \mathbf{v}  \right \rangle=\mathbf{w} ^T\mathbf{v} =0
$$

thus, $\mathbf{w} $ lives in orthogonal complement of space $\mathcal{K}$: $\mathcal{K}^{\perp }$, i.e., **Null Space (Kernel)** of orthogonal matrix $U^T$ 

- for any vector $\mathbf{v} \in  \mathbb{R}^p$ lives in space $\mathcal{K}$, where data $\mathbf{x}  = \mathbf{w}+\mathbf{v}  \in  \mathbb{R}^p$

$$
UU^T(\mathbf{w}+\mathbf{v} )=\mathbf{v} 
$$

- show $w \in \mathcal{K}^{\perp}$ by show $U^Tw = \mathbf{0}$


    $U^Tw = U^T(x - UU^Tx)=U^Tx - U^TUU^Tx=U^Tx - IU^Tx=\mathbf{0}$

## nice property: energy of data can be decomposed to a part lives in $\mathcal{K}$ and a part lives in $\mathcal{K}^{\perp }$

- for data vector $x \in  \mathbb{R}^p$

$$
\begin{align}
\text{energy/variance of x} 
&= \left \| x \right \|_2^2\ l_2 \text{ norm square of x} \\[1em]
&= \left \| v + w\right \| _2^2  \\[1em]
&= \left \| v\right \| _2^2  + \left \|  w\right \| _2^2  + 2\left \langle v,w \right \rangle \\[1em]
&= \left \| v\right \| _2^2  + \left \|  w\right \| _2^2  (l_2 \text{ norm square of v} \in \mathcal{K}) + (l_2 \text{ norm square of w} \in \mathcal{K}^{\perp}) \\[1em]
&= \left \| UU^Tx\right \| _2^2  + \left \|  (I-UU^T)x\right \| _2^2  \\[1em]
\end{align}
$$

since $x = v + w = UU^Tx + w$, $w = x - UU^Tx = (I-UU^T)x$

- for vector $\mathbf{v} \in \mathcal{K}$

    energy/variance of $\mathbf{v}$ is $l_2$-norm square of $v$, also is $l_2$-norm square of $\alpha$
    
    since $\mathbf{v} = U \alpha$, $U^TU = I_k$, $\alpha=U^T\mathbf{v}$
    
    $$
    \left \| \mathbf{v} \right \|_2^2 = \left \| U \alpha \right \|_2^2 = (U\alpha )^T(U\alpha ) =\alpha ^TU^TU\alpha =\alpha ^T\alpha =\left \| \alpha  \right \|_2^2 =\left \| U^T\mathbf{v} \right \|_2^2 
    $$


# orthogonal projection

## orthogonal projection matrix onto column space: $UU^T$

$UU^T \in \mathbb{R}^{p \times p}$ is an **orthogonal projection matrix** onto space $\mathcal{K}$

suppose we have data vector $\mathbf{x} \in \mathbb{R}^p$

we want to project p-D data $\mathbf{x}$ onto k-D space $\mathcal{K}$

$$
\mathcal{K} = \text{colspace}(U)
$$

where $U$ is orthogonal basis matrix $U \in \mathbb{R}^{p \times k}$

- The orthogonal projection of vector $\mathbf{x}$ onto space $\mathcal{K}$: $P_{\mathcal{K}}(\mathbf{x})$  is a vector $\mathbf{g} \in \mathbb{R}^p$ that minimize Euclidean distance between $\mathbf{x}$ and $\mathbf{g}$

$$
P_{\mathcal{K}}(\mathbf{x}) = \underset{\mathbf{g} \in \mathcal{K}}{\arg \min} \left \| \mathbf{x}-\mathbf{g} \right \|_2^2 
$$

where $\left \| \mathbf{x}-\mathbf{g} \right \|_2^2$ is the lost info, $\mathbf{x}-\mathbf{g}$ is error vector


- since $\mathbf{g} \in \mathcal{K}$, $\exists \alpha \in \mathbb{R}^k$ with $\mathbf{g} = U\alpha$ 

    then we change from **finding $\mathbf{g}$ to find $\alpha$** that minimize Euclidean distance between $\mathbf{x}$ and $U\alpha$

$$
P_{\mathcal{K}}(\mathbf{x}) = \underset{\alpha \in \mathbb{R}^k}{\arg \min} \left \| \mathbf{x}- U\alpha\right \|_2^2
$$

- optimal $\hat \alpha$ is just lease square solution

$$
\hat \alpha = (U^TU)^{-1}U^T\mathbf{x} = I_k^{-1}U^T\mathbf{x} = U^T\mathbf{x}
$$

plug in $\hat \alpha$ to $P_{\mathcal{K}}(\mathbf{x})$:

$$
P_{\mathcal{K}}(\mathbf{x}) = U(U^T\mathbf{x}) = UU^T\mathbf{x} \in \mathbb{R}^p
$$

## orthogonal projection matrix onto complement space: $I_p - UU^T$

orthogonal projection matrix onto **orthogonal complement** of space $\mathcal{K}$: $\mathcal{K}^{\perp}$ is $I_p - UU^T$

- The orthogonal projection of vector $\mathbf{x}$ onto orthogonal complement space $\mathcal{K}^{\perp}$ is a vector $\in \mathbb{R}^p$

$$
P_{\mathcal{K}^{\perp}}(\mathbf{x})  = (I_p - UU^T)\mathbf{x}  \in \mathbb{R}^p
$$

# Norm

## vector norm: $l_q$ norms

for a vector $\beta \in \mathbb{R}^p$

$$
\beta=\begin{bmatrix}
\beta_1\\ 
\beta_2\\ 
\vdots \\ 
\beta_j\\ 
\vdots \\
\beta_p\\ 
\end{bmatrix}\
$$

- $l_q$ norms are always **convex function** for $q > 0$

$$
\left \| \beta \right \|_q=\left ( \sum_{j=1}^p \left | \beta_j \right |^q \right )^{\frac{1}{q}}
$$

where $\left | \beta_{j} \right |$ is absolute value of the jth entry of  $\beta$

- $l_q$ norm $q$-powered

$$
\left \| \beta \right \|_q^q=\sum_{j=1}^p \left | \beta_j \right |^q 
$$

- $l_2$ norm: $$\left \| \beta \right \|_2=\sqrt{\sum_{j=1}^p  \beta_j ^2}$$


- $l_2$ norm squared: $$\left \| \beta \right \|_2^2 = \sum_{j=1}^p \beta_j^2$$


- $l_1$ norm: $$\left \| \beta \right \|_1= \sum_{j=1}^p \left | \beta_j \right |$$


- $l_0$ norm: not a convex, count the number of non-zero elements in a vector

$$\left \| \beta \right \|_0= \sum_{j=1}^p \mathbb{1}(\beta_j \neq 0)$$

## Matrix norm: Frobenius norm

- for a matrix $A \in \mathbb{R}^{m \times n}$

    Frobenius norm is $l_2$ norm of that matrix, square root of sum of squares of every entry in the matrix

$$
\left \| A \right \|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{(ij)}^2} = \sqrt{trace(AA^T)} = \sqrt{\left \langle A,A \right \rangle}
$$

- property: multipy an orthonormal matrix by left ($U$) or by right ($V$) to a matrix $A$, the Frobenius norm of matrix $A$ remained the same

$$
\left \| A \right \|_F = \left \| UA \right \|_F = \left \| AV^T \right \|_F
$$

# rank of matrix

- rank of a matrix $X$ is same as its column rank and row rank

    $\text{rank}(X)=\text{row rank}(X)=\text{column rank}(X)$
 
 
- rank of a matrix $X \in \mathbb{R}^{m \times n}$ is the number of **linearly independent** columns/rows of matrix

    $$
    \text{rank}(X)\leq \min(m,n)
    $$


- a matric of full rank

    $$
    \text{rank}(X) = \min(m,n)
    $$


- properties of rank

    - for matrix $X \in \mathbb{R}^{m \times n}$,
    
    $$\text{rank}(X)=rank(X^T)$$
    
    - for matrix $X \in \mathbb{R}^{m \times n}, Y \in \mathbb{R}^{n \times p}$,
    
    $$rank(XY)\leq \min(\text{rank}(X), rank(Y))$$
    
    - for matrix $X,Y \in \mathbb{R}^{m \times n}$,
    
    $$rank(X+Y)\leq \text{rank}(X)+ rank(Y)$$

# Trace of matrix

- Trace of a square matrix $A \in \mathbb{R}^{n \times n}$ is sum of diagonal entries

$$
\text{tr}(A) = \sum_{i=1}^n A_{ii}
$$

- properties

    - $\text{tr}(A) =\text{tr}(A^T)$, $A \in \mathbb{R}^{n \times n}$

    - $\text{tr}(aA) =a\text{tr}(A) \ a \in \mathbb{R}$

    - $\text{tr}(A+B) =\text{tr}(A)+\text{tr}(B)$, $A, B \in \mathbb{R}^{n \times n}$

    - $\text{tr}(AB) =\text{tr}(BA)$, $AB \in \mathbb{R}^{n \times n}$ is a square matrix

    - $\text{tr}(ABC) =\text{tr}(BCA)=\text{tr}(CAB)$, $ABC \in \mathbb{R}^{n \times n}$ is a square matrix


# gradient

## partial derivative

- definition of partial derivative: 

    define $\mathbf{\mathbf{e_i}} \in \mathbb{R}^d$ such that $\mathbf{\mathbf{e_i}}_{j}=\mathbb{1}(i=j) = \begin{bmatrix}
0 \\
\vdots  \\
1 \\
\vdots \\
0 \\
\end{bmatrix}$

    partial derivative of $f(\mathbf{x})$ evaluated at point $\mathbf{z}$ is

$$
\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}_i}|_{\mathbf{x}=\mathbf{z}} = \displaystyle \lim_{\Delta h \to 0}\frac{g(\mathbf{z}+\Delta h \mathbf{\mathbf{e_i}})-g(\mathbf{z})}{\Delta h}
$$

## gradient

- gradient of a scalar-valued function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ w.r.t vector $\mathbf{\mathbf{x}} \in \mathbb{R}^d$

$$
\nabla_{\mathbf{x}} f=\begin{bmatrix}
\frac{\partial f}{\partial \mathbf{x}_1} \\
\vdots  \\
\frac{\partial f}{\partial \mathbf{x}_d} \\
\end{bmatrix}
$$

## Jacobian matrix

- Jacobian matrix $\mathbb{R}^{m \times n}$ is gradient of a vector-valued function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ w.r.t vector $\mathbf{\mathbf{x}} \in \mathbb{R}^n$

$$
J = \begin{bmatrix}
\frac{\partial f}{\partial x_1} & \cdots  &  \frac{\partial f}{\partial x_n}\\
\end{bmatrix}
=  \begin{bmatrix}
\nabla^Tf_1 \\
\vdots  \\
\nabla^Tf_m \\
\end{bmatrix}
=\begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots  &  \frac{\partial f_1}{\partial x_n}\\
\vdots  & \ddots  &  \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots  &  \frac{\partial f_m}{\partial x_n}\\
\end{bmatrix}
$$

## Hessian matrix

- Hessian matrix $H \in \mathbb{R}^{m \times n}$ is 2nd order derivative of a vector-valued function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ w.r.t vector $\mathbf{\mathbf{x}} \in \mathbb{R}^n$

    i.e., Hessian matrix is Jacobian matrix of gradient of function $f$: $H(f) = J(\nabla f)$

$$
H =\nabla J
= \begin{bmatrix}
\frac{\partial f^2}{\partial x_1} & \cdots  &  \frac{\partial f^2}{\partial x_n}\\
\end{bmatrix}
=\begin{bmatrix}
\frac{\partial f^2_1}{\partial x_1} & \cdots  &  \frac{\partial f^2_1}{\partial x_n}\\
\vdots  & \ddots  &  \vdots \\
\frac{\partial f^2_m}{\partial x_1} & \cdots  &  \frac{\partial f^2_m}{\partial x_n}\\
\end{bmatrix}
$$

- Hessian matrix $H \in \mathbb{R}^{n \times n}$ is 2nd order derivative of a scalar-valued function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ w.r.t vector $\mathbf{\mathbf{x}} \in \mathbb{R}^n$


$$
H =\nabla J
=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots  &  \frac{\partial^2 f}{\partial x_1 \partial x_n}\\
\frac{\partial^2 f}{\partial x_2 x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots  &  \frac{\partial^2 f}{\partial x_2 \partial x_n}\\
\vdots  & \vdots  & \ddots  & \vdots  \\
\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2}  & \cdots  &  \frac{\partial^2 f}{\partial x_n^2}\\
\end{bmatrix}

$$

# equivariant vs. invariant

equivariant 等变性: 对于某个函数，对输入做某种变换，输出**也会有这种变换**，则该函数对这种变化具有等变性

invariant 不变性: 对于某个函数，对输入做某种变换，输出**没有变化**，则该函数对这种变化具有不变性