# Linear Algebra for Deep Learning

Sources:
1) Deep Learning - Ch2 Linear Algebra (Goodfellow et. al)

2) Deep Learning - Appendix A (Bishop and Bishop)

3) Essence of linear algebra - 3Blue1Brown - https://www.youtube.com/watch?v=kYB8IZa5AuE&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=3

Linear Alegrba concerns several types of mathematical objects:

1) Scalars: Single numbers

2) Vectors: An ordered array of numbers (matrices with a single column. In linear algebra, a vector are scalar coordinates for a basis vectors that define some space.)

3) Matrices: A 2D array of numbers

4) Tensors: An n-dimensional array of numbers

## Transpose

An important operation in linear algebra is the transpose. The transpose of a matrix is obtained by swapping its rows and columns. It can also be though of as making a mirror image of the matrix across the main or leading diagonal. If $A$ is a matrix, the transpose of $A$ is denoted as $A^T$. It is defined as:

$(A^T)_{ij} = A_{ji}$

\begin{equation}
A = \begin{bmatrix}
A_1 & A_2 & A_3\\
A_4 & A_5 & A_6
\end{bmatrix}
\Rightarrow
A^T = \begin{bmatrix}
A_1 & A_4\\
A_2 & A_5\\
A_3 & A_6
\end{bmatrix}
\end{equation} 
*In the case of a vector, the transpose of a vector is a row vector. An for a scalar, the transpose is the scalar itself.

## Addition and Scalar Multiplication
Matrices can be added to each other, provided that the have the same dimensions. In this case the sum is obtained by adding the corresponding elements of the two matrices.

Matrices can be multiplied by scalars, obtained by multiplying each element of the matrix by the scalar. Or they can add scalars to matrices, obtained by adding the scalar to each element of the matrix.

In the context of deep learning, vectors or matrices can be added to matrices even if they have different dimensions through a process called broadcasting. This requires the vector or matrix of smaller dimension to be extended to match the dimensions of the larger matrix, given that the dimensions of the smaller one are compatible with the larger one.

## Matrix Multiplication



The matrix multiplication of two matrices $A$ and $B$ is defined as:

\begin{equation}
C = AB
\end{equation}

where 

\begin{equation}
C_{ij} = \sum_k A_{i,k}B_{k,j}
\end{equation}

In order for this matrix multiplication to be broadcasted, the number of columns in $A$ must be equal to the number of rows in $B$ i.e $m \times n$ and $n \times q$ and so the shape of $C$ will be $m \times q$.

Note: there also exists the element-wise or Hadamard product multiplication of two matrices, denotes by $A \odot B$ which is obtained by multiplying the corresponding elements of the two matrices. For example:

\begin{equation}
A = \begin{bmatrix}
A_1 & A_2\\
A_3 & A_4
\end{bmatrix}
B = \begin{bmatrix}
B_1 & B_2\\
B_3 & B_4
\end{bmatrix}
\Rightarrow
A \odot B = \begin{bmatrix}
A_1B_1 & A_2B_2\\
A_3B_3 & A_4B_4
\end{bmatrix}
\end{equation}

Matrix multiplication are linear transformations of space - a movement of the basis. 

(Linear if all lines must remain lines and origin must remain in place. Or more formally, a transformation L is linear if it satisifies the following properties: 

1) $L(\textbf{v} + \textbf{w}) = L(\textbf{v}) + L(\textbf{w})$ - "Additivity"

2) $L(c \textbf{v}) = c \ L(\textbf{v})$ - "Scaling"

i.e. Keeps grid lines parallel and evenly spaced.) 

The columns of $A$ above denote where the basis vectors $\hat{i}$ and $\hat{j}$ are transformed to.

For example, the $90^\circ$ rotation counterclockwise of the basis is given by :

\begin{equation}
A = \begin{bmatrix}
0 & -1\\
1 & 0
\end{bmatrix}
\end{equation} 

Multiple transformations applied in one single 'action' is known as a "composition". It's a product of individual matrices (read from right ,the first transformation, to left the final transformation).

In general, for a matrix transformations $M_1$ and $M_2$:
(Notice: $M_1 M_2 \neq M_2 M_1$)

\begin{equation}
M_2 \times M_1 = \begin{bmatrix}
a & b\\
c & d
\end{bmatrix} \begin{bmatrix}
e & f\\
g & h
\end{bmatrix} = \begin{bmatrix}
ae + bg & af + bh\\
ce + dg & cf +dh
\end{bmatrix}
\end{equation}



## Cross Product

The cross product between two vectors $a$ and $b$ is the determinant of the concatenated matrix of a, b and the basis vectors, $\phi$:

\begin{equation}

a \times b = det(\phi, a, b)

\end{equation}

For example in three dimensions:

\begin{equation}
a \times b = \begin{bmatrix}
a_1\\
a_2\\
a_3
\end{bmatrix} \times \begin{bmatrix}
b_1\\
b_2\\
b_3
\end{bmatrix} = 
det \left( \begin{bmatrix}
\hat{i} & \hat{j} & \hat{k}\\
a_1 & a_2 & a_3\\
b_1 & b_2 & b_3
\end{bmatrix} \right)
\end{equation}
*Note: it does not have to be the basis vectors here. It could be any general vector, $n$. This demonstrates duality (not fully sure on this: https://www.youtube.com/watch?v=BaM7OCEm3G0&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=11)

\begin{equation}
a \times b = \begin{bmatrix}
\hat{i} \\ \hat{j} \\ \hat{k}\\
\end{bmatrix} \cdot \begin{bmatrix}
a_2b_3 - a_3b_2\\
a_3b_1 - a_1b_3\\
a_1b_2 - a_2b_1
\end{bmatrix}
\end{equation}

\begin{equation}
a \times b = ||a|| \ ||b|| \ \sin(\theta) \hat{n}
\end{equation}

where $\hat{n}$ is the basis vector defining the space and $\theta$ is the angle between the two vectors.

## Dot Product

The dot product between two vectors $a$ and $b$ which are of the same dimensionality is the matrix product $a^Tb$:

\begin{equation}
a \cdot b = \sum_{i =1}^{n} a_ib_i
\end{equation}

\begin{equation}
a \cdot b = ||a|| \ ||b|| \ \cos(\theta)
\end{equation}

where $\theta$ is the angle between the two vectors.

For example:

\begin{equation}
a \cdot b = \begin{bmatrix}
a_1\\
a_2
\end{bmatrix} \cdot \begin{bmatrix}
b_1\\
b_2 
\end{bmatrix} = a_1b_1 + a_2b_2
\end{equation}


This is essentially a projection of $a$ onto $b$ creating a scalar value which lines on a number line (duality - anytime there is transformation from a some space to the number line, the scalar is associated with a unique vector in that space (dual vector)). 

## Properties of matrix multiplication

Distrubutive: 

$A(B + C) = AB + BC$

Associative:

$ A(BC) = (AB)C$

Not commutative:

$ AB \neq BA$
*Except for vectors

The transpose of a matrix product has the form:
$(AB)^T = B^TA^T$

(which can be verified by writing out the indicies)

## Systems of linear equations in linear algebra

A system of linear equations can be written compactly using linear algebra notation:

\begin{equation}
Ax = b
\end{equation}

where $A_{i, :} x = \sum_{k = 1}^{n} A_{i, k}x_k = b_i$

## Identity and Inverse matrices

Linear alegra offers a convient way of solving such linear equations using matrix inversion that enables analytical solutions to be found for may values of A. 

Essential to understanding matrix inversion is the idenity matrix, $I_n$ where $I_n \in \mathbb{R}^{nxn}$ and by definition:

\begin{equation}
\forall x \in \mathbb{R}^{n}, I_n x = x
\end{equation}

The structure of the identity matrix is that all of the entries along the main diagonal are 1. i.e.

$I_2 = \begin{bmatrix}
1 & 0\\
0 & 1
\end{bmatrix}$

The matrix inverse, denoted as $A^{-1}$ is defined such that:

\begin{equation}
A^{-1}A = I_n
\end{equation}

This allows the linear equations to be solved:

$Ax = b$

$A^{-1}Ax = A^{-1}b$

$I_nx = A^{-1}b$


For $A^{-1} to exist, $Ax = b$ must have one solution for every $b$. It is possible for for the system of equations to have no solutions or infinitely many solutions for some values of $b$. It is not possible to have more than one but less than infinite solutions for a particular $b$ if both $x$ and $y$ are solutions, then

$ z = \alpha x + (1 - \alpha)y $ is also a solution for any real $a$.

To find how many solutions the equation has, think of the columns of $A$ as specifying different directions that can be travelled from the origin, then determine how many ways there are of reaching $b$. This means each element of $x$ specifies how far to travel in each of these directions to reach $b$, i.e. $x_i$ specifies how far to travel in the direction of column $i$:

$Ax = \sum_{i} x_i A_{:, i}$

In general, this operation is a called a linear combination which is defined formally as multipying each vector in a set $\{ v^{(1)},...,v^{(n)} \} $ by a scalar coefficient and adding the results:

$ \sum_{i} c_{i} v^{(i)}$

The **span** of a set of vectors is the set of all points obtainable by a linear combination of thr original vectors. Therefore determining whether there is a solution to the linear equation amounts to testing whether $b$ is in the span of the columns of $A$ - this particular span is known as the column space or the range of $A$. So for the system to have a solution for $b \in \mathbb{R}^m$ then the column space of $A$ must be $\mathbb{R}^m$. So if any point in $\mathbb{R}^m$ is excluded from the column space of $A$, that point is a potential value of $b$ that has no solution. 

The requirement that the column space of $A$ be all of $\mathbb{R}^m$ means that $A$ must have at least $m$ columns (and that for numbers of columns greater than $m$ they are not redundant i.e repition of columns). For example, consider a 3×2 matrix. The target b is 3-D, but x is only 2-D, so modifying the value of x at best enables us to trace out a 2-D plane within $\mathbb{R}^3$. The equation has a solution if and only if b lies on that plane.

The point of redunancy above is formally known as linear dependence. A set of vectors is linearlly independent if no vector in the set is a linear combination of the other vectors. If a new vector is added to a set that is a linear combination of the other vectors, that vector does not add any points to the set's span. This means that that for the column space of the matrix to encompass all of $\mathbb{R}^m$ the matrix must contain at least one set of $m$ linearly indepdent columns in order for the linear equation to have a solution for every value of $b$. 

A set of vectors $\{ x_1, ..., x_n \}$ is said to be linearly independent if the relation:

\begin{equation}
\sum_n \alpha_n x_n = 0
\end{equation}

holds only if all $\alpha_n = 0$. This implies that none of the vectors can be expressed as a linear combination of the remainder. The rank of a matrix is the maximum number of linearly independent rows (or equivalently, the maximum number of linearly independent columns).

Additionally, for the linear equation to have an inverse, it must be ensured that the linear equation has at most one solution for each value of $b$. This means that $A$ must have $m$ columns and no more or less; otherwise, there is more than one way of parameterizing each solution. 

All togther, this means that the matrix must be square. 

A square matrix with linearly dependent columns is known as **singular**

If $A$ is not square or is square but singular, solving the equations is possible but not through matrix inversion. 

Note: it is also possible to define an inverse matrix that is multipled on the right:

\begin{equation}
 AA^{-1} = I
\end{equation}

For square matrices, the left and right inverse are equal.


## Norms

The size of a vector is calculated using the **norm** (intuitively this is the distance from the origin to $x$). Formally, the $L^{p}$ norm is defined as:

\begin{equation}
||x||_p = (\sum_i |x_i|^{p})^{\frac{1}{p}}
\end{equation}

for $p \in \mathbb{R}$, $p \geq 1$

Rigorously, the norm must satisy:

1) $f(x) = 0$ $\Rightarrow$ $x = 0$
2) $f(x + y) \leq f(x) + f(y)$ (triangle inequality)
3) $\forall \alpha \in \mathbb{R}$, $f(\alpha x) = |\alpha|f(x)$

The $L^2$ norm is known as the Eucliean norm (i.e. the Euclidean distance from the origin to the point identified by $x$). Given its frequent use in machine learning, it is often dentoed as $||x||$. It is also common to measure the size of a vector using the squared $L^2$ norm which is $L^2 = x^Tx$.

In many ML applications, and in some the following distinction is essential, it is important to discriminate between elements that are zero and those that are small but nonzero. In these cases the $L^1$ norm is used. Every time an element of $x$ moves from 0 by $\epsilon$, the $L^1$ norm increases by $\epsilon$. 

The max norm, $L^\infty$, is absolute value of the element with the largest magnitude in the vector:

\begin{equation}

||x||_\infty = max_i |x_i|

\end{equation}

The **Frobenius norm** is the size of a matrix (analogous to the L^2 norm of a vector):

\begin{equation}
||A||_F = \sqrt{\sum_{i,j} A^2_{i, j}}
\end{equation}

The dot product of two vectors can be rewritten in terms of norms:

\begin{equation}
x^Ty = ||x||_2||y||_2 \cos (\theta)
\end{equation}

where $\theta$ is the angle between $x$ and $y$

# Special kinds of matrices and vectors

**Diagonal** matrices consist of nonzero entries only along the main diagonal. Formally, a matrix $D$ is diagonal if and only if $D_{i, j} = 0$ for all $i = j$. ($diag(v)$ to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector v). Multiplication by a diagonal matrix is computationally efficient. (i.e. $diag(v)x = v \odot x$). Inversion of diagonal sqaure matrices are also efficient (if every diagonal entry is nonzero) $diag(v)^{-1} = diag([\frac{1}{v_1},...,\frac{1}{v_n}]^T)$. 

In many cases, a general ML algorithm may be derived in terms of arbitary matrices but a less expensive (and less descriptive) algorithm can be obtained by restricting some matrices to be diagonal. 

**Symmetric matrix**: $A = A^T$

**Unit vector**: $||x||_2 = 1$

**Orthogonal vectors**: A vector $x$ and $y$ are orthogonal to each other if $x^Ty = 0$. (If both vectors have nonzero norm, this means they are at $90^\circ$ to each other). 

In $\mathbb{R}^{n}$, at most n vectors may be mutually orthogonal with nonzero norm.

**Orthonormal**: Vectors that are orthogonal and also have unit norm.

**Orthogonal matrix**: A square matrix whoses rows are mutually **orthonormal** and whoses columns are mutually **orthonormal**:

\begin{equation}
A^TA = AA^T = I
\end{equation}

which implies that $A^{-1} = A^T$.

Orthogonal matrices are of interest as their inverse is very cheap to compute. 

## Eigendecomposition


Decompistion of matrices can show information about their functional properties that are not obvious from the representation of the matrix as an array of elements. In a similar way that decomposing numbers into their primes can help in their analysis.

Eigendecomposition: decompose a matrix into a set of eigenvalues and eigenvectors.

An eigenvector of a square matrix ($M \times M$) $A$ is a nonzero vector $v$ such that multiplication by $A$ alters only the scale of $v$:

(i.e. The vector remains on the same span (for example a line), but is scaled by a factor of $\lambda$, known as the eigenvalue)

\begin{equation}
Av = \lambda v
\end{equation}

where $\lambda$ is a scalar known as an eigenvalue corresponding to the eigenvector.

This can be viewed as a set of $M$ simultaneous homogenous linear equations, and the condition for solution known as the **characteristic equation** is 


\begin{equation}
(A - \lambda I )\textbf{v} = 0
\end{equation}

The only solution to this equation, given that $v$ is non-zero, is if the determinant of the matrix is zero:
\begin{equation}
det(A - \lambda I ) = 0
\end{equation}

This equation is in the form of a polynomial of order $M$ (the dimension of the matrix) in $\lambda$ it must have $M$ solutions (though they do not need to be all distinct). The rank of $A$ is equal to the number of non-zero eigenvalues. Note: eigenvalues must be real numbers to ensure that there are eigenvector solutions (i.e. rotations matrix transforms do not have eigenvectors).

If $v$ is an eigenvector if $A$ then so is any rescaled vector $sv$ for $s \in \mathbb{R}$, $s \neq 0$. $sv$ has the same eigenvalue and for this reason, only unit eigenvectors are usually considered. 

#### Eigenbasis:side note

For diagonal matrices, all of the basis vectors are eigenvectors and the eigenvalues are the diagonal elements of the matrix. 

Diagonal matrices are easier to compute as the can just use multiplication of easy to compute scale factors for multiple calculations.

A change of basis to an eigenbasis (i.e. a basis of eigenvectors) guarantees that the resulting output matrix is diagonal. There are benefits in efficiency to changing to an eigenbasis when performing intensive calculations (https://www.youtube.com/watch?v=PFDu9oVAE-g&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=14).

#### Facts about eigenvectors:

1) $\frac{1}{M} Tr(A) = \frac{1}{M} \sum_{i=1}^{M}\lambda_i$ (Mean of eigenvalues)

2) det(A) = $\prod_{i=1}^{M}\lambda_i$ (Product of eigenvalues)

3) If two numbers have a mean m and product p, then the numbers are the roots of the quadratic: $\lambda_1, \lambda_2 = m \pm \sqrt{m^2 - p}$ (Quick way to find eigenvalues)


If matrix $A$ has $n$ linearly independent eigenvectors $\{ v^{(1)},...,v^{(n)} \}$ with corresponding eigenvalues $\{ \lambda^{(1)},...,\lambda^{(n)} \}$, then all of the eigenvectors can be concatenated to form a matrix, $V$ with one eigenvector per column: $V = [v^{(1)},...,v^{(n)}]$ and the eigenvalues can be concatenated to form a vector $\lambda = [\lambda^{(1)},...,\lambda^{(n)}]$. The **eigendecomposition** of $A$ is then given by:

\begin{equation}
A = V diag(\lambda) V^{-1}
\end{equation}

Not every matrix can be decomposed into eigenvalues and eigenvectors. In some cases, the decomposition exists but may involve complex rather than real numbers. 

However, every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:

\begin{equation}
A = Q \Lambda Q^T
\end{equation}

where $Q$ is an orthogonal matrix composed of eigenvectors of $A$, and $\Lambda$ is a diagonal matrix. The eigenvalue $\Lambda_{i, i}$ is associated with the eigenvector in column $i$ of $Q$, denoted as $Q_{:, i}$. As $Q$ is an orthogonal matrix, $A$ can be thought of as scaling space by $\lambda_i$ in direction $v^{(i)}$.

While any real symmetric matrix $A$ is guaranteed to have an eigendecomposition, the eigendecomposition may not be unique. If any two or more eigenvectors share the same eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors with that eigenvalue, and a $Q$ could be choosen using those eigenvectors instead. By convention, the entries of $\Lambda$ are sorted in descending order. Under this convention, the eigendecomposition is unique only if all the eigenvalues are unique.

The eigendecomposition reveals many useful facts about the matrix:

1) The matrix is singular if and only if any of the eigenvalues are zero.
2) The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expressions of the form $f(x) = x^TAx$ subject to $||x||_2 = 1$.
3) Whenever $x$ is equal to an eigenvector of $A$, $f$ takes on the value of the corresponding eigenvalue. 
4) The maximum value of $f$ within the constraint region is the maximum eigenvalue and its minimum value within the constraint region is the minimum eigenvalue. 

**Positive semidefinite matrix**: A matrix whose eigenvalues are all positive or zero

Guarantee that $\forall x, x^TAx \geq 0$


**Positive definite matrix**: A matrix whose eigenvalues are all positive

Additionally guarantee that, $x^TAx = 0 \Rightarrow x = 0$

**Negative definite**: A matrix whose eigenvalues are all negative

**Negative semidefinite**: A matrix whose eigenvalues are all negative or zero

## Singular value decomposition (SVD)

SVD provides another way to factorize a matrix, into **singular vectors** and **singular values**. It allows some of the same information to be revealed as in eigendecomposition but it is more generally applicable; every real matrix has a singular value decomposition but the same is not true of eigenvalue decomposition (i.e. if a matrix is not square, the eigendecomposition is not defined and SVD must be used instead).

In eigendecomposition, a matrix $A$ is analyzed to discover a matrix $V$ of eigenvectors and a vector of eigenvalues $\lambda$ that can be used to write $A$ as:

\begin{equation}
A = V diag(\lambda)V^{-1}
\end{equation}

The SVD is similar, except $A$ will be written as the product of three matrices:

\begin{equation}
A = UDV^T
\end{equation}

If $A$ is an $m \times n$ matrix, then $U$ is defined to be an $m \times m$ matrix, $D$ is defined to be an $m \times n$ matrix, and $V$ to be an $n \times n$ matrix.

The elements along the diagonal of $D$ are known as the singular values of the matrix $A$. The columns of $U$ are known as the **left-singular vectors**. The columns of $V$ are known as the **right-singular vectors**.

The SVD of $A$ can be interpreted as the eigendecomposition of functions of $A$. The left-singular vectors of $A$ are the eigenvectors of $AA^T$. The right-singular vectors of $A$ are the eigenvectors of $A^TA$. The nonzero singular values of $A$ are the square roots of the eigenvalues of $A^TA$ and $AA^T$.

One of the most useful features of SVD is that it can be used to partially generalize matrix inversion to nonsquare matrices.

## Moore-Penrose pseudoinverse

Matrix inversion is not defined for matrices that are not square. However, suppose there exists a left-inverse matrix $B$ of a matrix $A$ so that the following linear equation can be solved:

\begin{equation}
Ax = y
\end{equation}

with 

\begin{equation}
x = By
\end{equation}

Depending on the structure of the problem, it may not be possible to design a unique mapping from $A$ to $B$. 

If $A$ is taller than it is wide, then it is possible for this equation to have no solution. If $A$ is wider than it is tall, then there could be multiple solutions. 

The Moore-Penrose pseudoinverse enables headway to be made in these cases. The pseudoinverse of $A$ is defined as a matrix:

\begin{equation}
A^{+} = lim_{a \rightarrow 0}(A^TA + \alpha I)^{-1}A^T
\end{equation}

Practical algorithms for computing the pseudoinverse are based on the formula:

\begin{equation}
A^{+} = VD^{+}U^{T}
\end{equation}

where $U$, $D$ and $V$ are the singular value decomposition of $A$, and the pseudoinverse $D^{+}$ of a diagonal matrix $D$ is obtained by taking the reciprocal of its nonzero elements then taking the transpose of the resulting matrix.

When $A$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $x = A^{+}y$ with minimal Euclidean norm $||x||_2$ among all possible solutions.

When $A$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives the $x$ for which $Ax$ is as close as possible to $y$ in terms of Euclidean norm $||Ax - y||_2$.

## Trace operator

Trace operator: the sum of all the diagonal entries of a matrix

\begin{equation}
Tr(A) = \sum_i A_{i,i}
\end{equation}

The trace operator is useful for a variety of reasons. Some operations that are difficult to specify without resorting to summation notation can be specified using matrix products and the trace operator. 

For example, the trace operator provides an alternative way of writing the Frobenius norm of a matrix:

\begin{equation}
||A||_F = \sqrt{Tr(AA^T)}
\end{equation}

The trace operator is invariant to the transpose operator which is very useful for manipulation of expressions using indentities:

$Tr(A) = Tr(A^T)$

By writing out the indices, it is possible to see that:

$Tr(AB) = Tr(BA)$

The trace of a square matrix composed of many factors is also invariant to moving the last factor into the first position which can be seen from applying the above formula multiple times to the product of the three matrices:

$Tr(ABC) = Tr(CAB) = Tr(BCA)$

or more generally this cyclic property of the trace operator is known as,

$Tr(\prod_{i=1}^{n}F^{(i)}) = Tr(F^{(n)}\prod_{i=1}^{n - 1}F^{(i)})$

This invariance to cyclic permutation holds even if the resulting product has a different shape.

*Note: A scalar is its own trace

## Determinant

Determinant: a function that maps matrices to real scalars for sqaure matrices. 

Intuitively when thinking about matrices as linear transformations, the magnitude of the determinant describes how the area of the basis changes under the transformation. (Determinant: The factor by which a linear transformation changes any area in a transformation.) Note: a negative sign in a determinant means that the transformation inverts the orientation of space.

The determinant is equal to the product of all the eigenvalues of the matrix. The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space (i.e. if the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation preserves volume).

The determinant, $|A|$ of an $N \times N$ matrix A is defined by

\begin{equation}
|A| = \sum (\pm 1) A_{1,i1}A_{2,i2}...A_{N,iN}
\end{equation}

in which the sum is taken over all products consiting of precisely one element from each row and one element from each column, with a coefficient $+1$ or $-1$ indicating whether the permutation is even or odd.

For example, the determinant of a $2 \times 2$ diagonal matrix is given by the product of the elements on the leading diagonal:

\begin{equation}
|A| = \begin{vmatrix}
a_{11} & a_{12} \\ 
a_{21} & a_{22}
\end{vmatrix} = a_{11}a_{22} - a_{12}a_{21}
\end{equation}

The determinant of a product of two matrices is given by:

\begin{equation}
|AB| = |A||B|
\end{equation}

The determinant of an inverse matrix is given by:

\begin{equation}
|A^{-1}| = \frac{1}{|A|}
\end{equation}

which can be shown by taking the determinant of $AA^{-1} = A^{-1}A = I$ and applying the multiplication rule.

If $A$ and $B$ are matrices of size $N \times M$, then:

\begin{equation}
|I_N + AB^T| = |I_M + A^TB|
\end{equation}

A useful special case is

\begin{equation}
|I_N + ab^T| = 1+ a^Tb
\end{equation}

where $a$ and $b$ are $N-$dimensional column vectors.

A determinant of $|A| = 0$ 'squashes' all of space onto a line or a single point. 

## Basis vectors

*'a set B of vectors in a vector space V is called a basis if every element of V may be written in a unique way as a finite linear combination of elements of B. The coefficients of this linear combination are referred to as components or coordinates of the vector with respect to B. The elements of a basis are called basis vectors'*

(source: https://en.wikipedia.org/wiki/Basis_(linear_algebra))

**Matrix transformations can be thought of a change of basis.**

For example, a transformation matrix $A$ represents the basis vectors from the current space to a new coordinate system. For a vector $x$ in the current space, the coordinates of $x$ in the new space is given by $Ax$. 

For a vector $x'$ in the new space, the coordinates of $x'$ in the original space is given by $A^{-1}x'$.

A transformation in the new space is given by $A^{-1}BA$, where $B$ is the transformation matrix in the original space (i.e. apply the change of basis followed by the transformation in the new space and then return to the original space).

## Matrix derivatives

In some cases, it is important to consider the derivative of a vector or matrix with respect to a scalar ($x$). For example, the derivative of a vector $a$ with respect to $x$ is

\begin{equation}
(\frac{\delta a}{\delta x})_i = \frac{\delta a_i}{\delta x}
\end{equation}

(This also applies to a general matrix aswell)

Derivatives with respect to vectors and matrices can also be defined, for instance

\begin{equation}
(\frac{\delta x}{\delta a})_i = \frac{\delta a}{\delta a_i}
\end{equation}

and similarly for vector/vector or matrix/matrix, 

\begin{equation}
(\frac{\delta a}{\delta b})_{ij} = \frac{\delta a_i}{\delta b_j}
\end{equation}

The following expression can be seen from writing out the components:

\begin{equation}
\frac{\delta}{\delta x} (x^Ta) = \frac{\delta}{\delta x}(a^T x) = a
\end{equation}

Or for matrices,
\begin{equation}
\frac{\delta}{\delta x} (AB) = \frac{\delta A}{\delta x} B + A \frac{\delta B}{\delta x}
\end{equation}

The derivative of a matrix inverse can be shown by differentiating the equation $A^{-1}A = I$ then right multiplying by $A^{-1} and is expressed as
\begin{equation}
\frac{\delta}{\delta x} (A^{-1}) = -A^{-1} \frac{\delta A}{\delta x} A^{-1}
\end{equation}

Furthermore, 

\begin{equation}
\frac{\delta}{\delta A_{i,j}} ln|A| = Tr(A^{-1} \frac{\delta A}{\delta x})
\end{equation}

If $x$ is chosen to be one the elements of $A$ then

\begin{equation}
\frac{\delta}{\delta A_{i,j}} Tr(AB) = B_{j,i} = B^T
\end{equation}

With the above notation, the following properties can be written:

\begin{equation}
\frac{\delta}{\delta A} Tr(A^TB) = B
\end{equation}

\begin{equation}
\frac{\delta}{\delta A} Tr(A) = I
\end{equation}


\begin{equation}
\frac{\delta}{\delta A} Tr(ABA^T) = A(B + B^T)
\end{equation}

which can be seen from writing out the matrix indices.

In addition, 
\begin{equation}
\frac{\delta}{\delta A} \ ln |A| = (A^{-1})^T
\end{equation}

## Principal component analysis (PCA)

A simple ML algorithm that can be derived exclusively with linear algebra.

Suppose there is a collection of m points $\{ x^{(1)},..., x^{(m)} \}$ in $\mathbb{R}$. PCA is a lossy compression method for those points (i.e. a way of storing those points that requires less memory but may lose some precision.)

One way to encode these points is to represent a lower-dimensional version of them. For each point $x^{(i)} \in \mathbb{R}^{n}$, a corresponding code vector $c^{(i)} \in \mathbb{R}^{l}$. If $l < n$ then storing the code points will take less memory than storing the original data. In order to do this, an encoding function is required such that:

\begin{equation}
f(x) = c
\end{equation}

and also, a decoding function (g) is required that reproduces the input given the code:

\begin{equation}
x \approx g(f(x))
\end{equation}

PCA is defined by the choice of the decoding function. To make the decoder very simple, matrix multiplication is used to map the code back into $\mathbb{R}^{n}$. Let $g(c) = Dc$, where $D \in \mathbb{R}^{n \times l}$ is the matrix defining the decoding.


Computing the optimal code for the decoder could be a difficult problem. In order to keep the encoding problem easy, PCA constrains the columns of $D$ to be orthogonal to each other. ($D$ is not technivally orthogonal unless $l = n$).

Many solutions are possible to the above prescription because by decreasing $c_i$ proportionally for all points means that the scale of $D_{:, i} can be increased. To give the problem a unique solution, the columns of $D$ are constrained to have unit norm.

In order to turn this idea into an algorithm, a first step required to figure out how to generate the optimal code point $c^*$ for each input point $x$. One way to do this is to minimize the distance between the input point $x$ and its reconstruction, $g(c*)$. This distance can be measured using a norm; in this case the $L^2$ norm is used:

\begin{equation}
c^* = arg min_{c}||x - g(c)||_2
\end{equation}

The squared $L^2$ norm can be used as both are minimized by the same value of $c$ as the $L^2$ norm is non-negative and the squaring operation is monotonically increasing for non-negative arguments:

\begin{equation}
c^* = arg min_{c}||x - g(c)||_2^2
\end{equation}

This function being minimized simplifies to 

\begin{equation}
(x - g(c))^T(x-g(c))
\end{equation}

Expanding the expression,

\begin{equation}
x^Tx - x^Tg(c) - g(c)^Tx + g(c)^Tg(c)
\end{equation}

Using the distributive property (because the scalar g(c)^Tx is equal to the transpose of itself),
\begin{equation}
x^Tx - 2x^Tg(c) + g(c)^Tg(c)
\end{equation}


Now the function being minimized can be rewritten to omit the first term as it does not depend on $c$:

\begin{equation}
c^* = argmin_c -2x^TDc + c^TD^TDc
\end{equation}


\begin{equation}
= argmin_c -2x^TDc + c^TI_lc
\end{equation}

(by the orthogonality constaints on $D$)

so 

\begin{equation}
= argmin_c -2x^TDc + c^Tc
\end{equation}

This optimization problem can be solved using vector calculus (figure out this step...):

\begin{equation}
\nabla_c(-2x^TDc +c^Tc) = 0
\end{equation}

\begin{equation}
-2D^Tx + 2c = 0
\end{equation}

so 

\begin{equation}
c = D^Tx
\end{equation}



In order to make this algorthim efficient: $x$ can be optimally encoded using just a matrix-vector operation:

\begin{equation}
f(x) = D^Tx
\end{equation}

Using a further matrix multiplication, we can define the PCA reconstruction operation:

\begin{equation}
r(x) = g(f(x)) = DD^Tx
\end{equation}

The next step is to choose the encoding matrix $D$ using the $L^2$ distance between inputs and reconstructions. Since we will use the same matrix $D$ to decode all the points, we can no longer consider the points in isolation. Instead, the Frobenius norm of the matrix of errors computed over all dimensions and all points must be minimized:

\begin{equation}
D^* = argmin_D \sqrt{\sum_{i, j}(x_j^{(i)} - r(x^{i}_j))^2}
\end{equation}

(Subject to $D^TD = I_l$)

To derive the algorithm for finding $D*$, first consider the case where $l=1$. In this case, $D$ is just a single vector, $d$. Using the expression for the reconstruction in the Frobenius norm calculation gives:

\begin{equation}
d^* = argmin_d \sum_i ||x^{(i)} - dd^Tx^{(i)}||_2^2
\end{equation}

(Subject to $||d||_2 = 1$)

Or, exploiting the fact that a scalar is its own transpose:

\begin{equation}
d^* = argmin_d \sum_i ||x^{(i)} - x^{(i)}dd||_2^2
\end{equation}

(Again, subject to $||d||_2 = 1$)

It is most helpful to rewrite the problem in terms of a simple design matrix of examples. rather than as a sum over separate example vectors which enable the use of more compact notation.

Let $X \in \mathbb{R}^{m \times n} be the matrix defined by stacking all the of the vectors describing the points, such that $X_{i, :} = x^{(i)T}$. The problem can be rewritten as:

\begin{equation}
d^* = argmin_d ||X - Xdd^T||^2_F
\end{equation}

(Subject to $d^Td = 1$)

Disregarding the constraint for the moment, the Frobenius norm portion can be written as:

\begin{equation}
argmin_d ||X - Xdd^T||^2_F
\end{equation}

\begin{equation}
= argmin_d \ Tr((X - Xdd^T)^T(X - Xdd^T))
\end{equation}

(Using the above expression for the Frobenius norm written in terms of the trace).


Expanding the above expression,

\begin{equation}
= argmin_d \ Tr(X^TX - X^TXdd^T - dd^TX^TX + dd^TX^TXdd^T)
\end{equation}

\begin{equation}
= argmin_d \ Tr(X^TX) - Tr(X^TXdd^T) - Tr(dd^TX^TX) + Tr(dd^TX^TXdd^T)
\end{equation}

\begin{equation}
= argmin_d \ - Tr(X^TXdd^T) - Tr(dd^TX^TX) + Tr(dd^TX^TXdd^T)
\end{equation}

(As  Tr(X^TX) does not depend on d and so does not effect the argmin)

\begin{equation}
= argmin_d \ -2Tr(X^TXdd^T) + Tr(dd^TX^TXdd^T)
\end{equation}

(As the order of the matrices can be cycled inside the trace operator and so again, )

\begin{equation}
= argmin_d \ -2Tr(X^TXdd^T) + Tr(X^TXdd^Tdd^T)
\end{equation}

Reintroducing the constraint that $dd^T = 1$ then:

\begin{equation}
= argmin_d \ -2Tr(X^TXdd^T) + Tr(X^TXdd^T)
\end{equation}

so 

\begin{equation}
= argmin_d \ -Tr(X^TXdd^T)
\end{equation}

\begin{equation}
= argmax_d \ Tr(X^TXdd^T)
\end{equation}

\begin{equation}
d^* = argmax_d \ Tr(d^TX^TXd)
\end{equation}


This optimization problem may be solved by eigendecomposition. Specifically, the optimal $d$ is given by the eigenvector of $X^TX$ corresponding to the largest eigenvalue. This derivation is specific to the case of $l = 1$ and recovers only the first principal component, More generally, when a basis of principal components is desired, the matrix $D$ is given by the $l$ eigenvectors corresponding to the largest eigenvalues. This may be shown by proof by induction. 


#### Add algorithm for PCA

## Cramer's Rule

"In linear algebra, Cramer's rule is an explicit formula for the solution of a system of linear equations with as many equations as unknowns, valid whenever the system has a unique solution" (https://en.wikipedia.org/wiki/Cramer%27s_rule)


# 
![Cramer's Rule](./figures/cramers_rule.png)


The general case of Cramer's rule for a system of linear equations expressed as:

\begin{equation}
\textbf{A} \textbf{x} = \textbf{b}
\end{equation}
where $\textbf{x}$ is a column vector.

The solution for the $i$th element of $\textbf{x}$ is given by:
\begin{equation}
x_i = \frac{det(\textbf{A}_i)}{det(\textbf{A})}
\end{equation}

$i \in \{1, ..., n \}$

where $\textbf{A}_i$ is the matrix formed by replacing the $i$th column of $\textbf{A}$ with the column vector $\textbf{b}$.



## Useful notes:

https://www.youtube.com/watch?v=TgKwz5Ikpc8&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=16

~ Determinant and eigenvectors don't care (are invariant) across coordinate systems and transformations.

~ 'Functions are another type of vector'; they obey the formal definition of linearity (preserve additiona and scalar multiplication): 
1) Additivity: Functions are additive (i.e. $f(x + y) = f(x) + f(y)$
2) Scaling: Functions are scaled under a scalar operation (i.e. $f(ax) = af(x)$)

Which is the same for linear transformations of vectors.

For example, the derivative operator is linear obeying both additivity and scaling. This allows us to define the derivative of a function as a linear transformation.

~ A linear transformation is completely defined by its action on the basis vectors of the space.

~ A space can be defined, for example, for all polynomials. The basis functions of this space are $1, x, x^2, x^3, ...$ and so the space is infinite dimensional and defined by a vector that scales these basis functions.

# 
![Linear Algebra vs Functions](./figures/linear_algebra_vs_functions.png)


![vector axioms](./figures/vector_axioms.png)

## "Abstractness is the price of generality"