# Linear Algebra
Summary: We have a matrix A and we get as much information as possible from it.

### Scalar, vector, matrices
Scalar: (a) one point in a line

Vector: (a) multiple points in a line, (b) one point in a n-dimensional space

Matrix: (a) multiple points in a n-dimensional space, (b) transformation from one vector space to another vector space

Tensor: (a) multiple space transformations, (b) {transfomration from one matrix space to another matrix space?}

We can think of two matrix being multiplied (AB = C) as the dot product between the rows of A and the columns of B.

Part of how we multiply matrices is arbitrary. We could also take row vectors and flip them. 

## Basics

$A^T$: exchange the lower triangular entries with the upper ones.

$A \verb| symmetric| \iff A^T = A$

$A^{-1}: AA^{-1} = I$

$det(A)$: a way to encode information of A

$trace(A) = \sum_i A_{ii}$

## Subspaces
Subspace: a set of vectors that (a) contain the zero vector and (b) any linear combination of two vectors in the subspace remains in the subspace. 

Say A is a matrix and x is a vector.
Ax: linear combination of columns of A given by coefficients from x
xA: linear combination of rows of A given by coefficients from x

Column space (C(A)): linear combinations of columns that give a non-zero vector.

Null space/kernel (N(A)): vectors that represent the linear combinations of columns that give the zero vector.

Row space ($C(A^T)$): linear combinations of rows that give a non-zero vector.

Left null space/cokernel ($N(A^T)$): vectors that represent the linear combinations of rows that give the zero vector.

Basis (of a space): a set of independent vectors that span the whole space

Span (of a set of vectors): all the vectors that can be expressed as a linear combination of the set of vectors.

Independence: a set of vectors is independent if every linear combination of the given set of vectors is non-zero (except for the linear combination where each vector has a coefficient of zero.)

$rank(A) = dims(C(A)) = dims(C(A^T))$

$n - r = dim(N(A))$

$m - r = dim(N(A^T))$

### Example
Suppose we have a matrix A. Assume it's a 3x3 matrix and rank(A) = 2. We defined the rank as the dimension of the column space. Thus, for every vector x, Ax will lie in a 2D space. By the definition above, dim(N(A)) should equal 1. Let's see why. We can picture the transformation of a space from n dimensions to k < n dimensions in the specific case of n = 2 and k = 1. We start having every point in the plane, and we end having every point in a given line. Thus, it's as if we are splashing the plane into a line. Notice that this splash is given by a 2x2 matrix that has linearly dependent vectors. This can be generalized to the case of n and k where we splash n - k dimensions.

If we compute the eigenvectors of the matrix A that splashes n - k dimensions, we will see that there are n - k eigenvalues with value zero (let's call them splashing dimensions.) Express every vector x in the eigenbasis. Now, every vector that has 0 in all entries but the splashing dimensions will be in the nullspace. As there are n - k splashing dimensions, dim(N(A)) = n - k.

## Change of basis
### Example 
Say $x = [1, 1, 1]$ and its basis vectors are $e_1 = [0, 0, 1],\ e_2 = [0, 1, 1],\ e_3 = [1, 1, 1].$ Notice we can represent the basis vectors more compactly as
$$
A = 
\begin{bmatrix}
 0 & 0 & 1 \\
 0 & 1 & 1 \\
 1 & 1 & 1 \\
\end{bmatrix}
$$
Define $y = Ax.$ Now, $y$ is a vector that represents the same point as $x,$ but instead of being expressed in the basis we described above, it's expressed in the standrad basis. Thus, we can see $A$ as the transformation that takes from the above basis to the standard basis. Similarly, $x = A^{-1}y.$ Thus, $A^{-1}$ takes vectors from the standard basis to the basis described above. 

## Projections
### Vector form
Sometimes we have a space $S_1$, and we want to find the vector $V_1$ in $S_1$ that is the nearest from some other vector $V_2$ in a higher-dimensional space $S_2$. Thus, we project the $V_2$ into $S_1$ to get $V_1.$ 

We can express $S_1$ and $S_2$ in the same basis (using some more basis vectors for $S_2.$) Then, we express $V_1$ in that basis, and drop the basis vectors that don't appear in $S_1.$

So, let's get a formula for the projection matrix.

![Screenshot%20from%202018-10-07%2010-24-40.png](attachment:Screenshot%20from%202018-10-07%2010-24-40.png)

We want to know the value of $p$ in terms of $a$ and $b.$ We know that $p = xa$ with $x \in R.$ 

We know the error $e = b - p$ should be orthogonal to $a$ for $p$ to be the closest point to $b$ in $a.$ To understand this, say we have another point $p'$ where $b - p'$ isn't perpendicular to $a.$ Now, $b - p$ is perpendicular to $a,$ so we can create a triangle that has base $p' - p$ and sides $b - p$ and $b - p'.$ Note that the sides $b - p$ and $p' - p$ form a right triangle, so by the pythagorean theorem we know that $b - p < b - p'.$ Thus, for the error to be minimized, it should be orthogonal to $a.$


$$
\begin{align}
a^T(e) &= 0 \\
a^T(b - p) &= 0 \verb|   (1)| \\
a^T(b - xa) &= 0 \\
xa^Ta &= a^Tb \\
x &= \frac{a^Tb}{a^Ta} \verb|   (2)| \\
p = xa &= \frac{aa^Tb}{a^Ta} 
\end{align}
$$

In (1), I was tempted to do 

$$
\begin{align}
a^T(b - p) &= 0 \\
a^Tb &= a^Tp \verb|   (3)| \\
(a^T)^{-1}a^Tb &= p \verb| (wrong)| \\
b &= p \verb| (wrong)| \\
\end{align}
$$

But this makes no sense; $b$ isn't always the same as $p.$ The problem here is that we forgot that when we project $b$ onto $a,$ we are losing information. It's that lose of information that makes the equality (3) valid. When we don't lose that information, then the equality doesn't hold because $p$ and $b$ aren't equal. Notice that passing $a^Ta$ dividing to the right (what we do in (2)) is valid because b is the vector that has to lose information, not x.

So, the formula below tells us the vector we got when we project $b$ onto $a.$ 

$$p = xa = \frac{aa^Tb}{a^Ta} $$

Now, I wonder why it has so many $a$'s. We can change a little bit the equation.

$$p = \frac{a^Tb}{a^Ta}a $$

We have a fraction multiplied by $a.$ The last $a$ makes sense: the projected vector $p$ has to be a scalar times $a.$ Why is $x = \frac{a^Tb}{a^Ta}$ the scalar we are looking for?

Let's see what happens if we fix $a$ and vary $b.$ 
* If $b = a,$ then $x = 1$
* If $b$ is perpendicular to $a,$ then $x = 0$
* If $b = -a,$ then $x = -1$
* If $b = ka,$ then $x = k$

Now, say $a = kc$ with $k \in R.$ Let's vary $k$ and fix $b.$ In this case, $p$ remains the same regardless of the selected value of $k.$ And it makes sense, we want $a$ to establish only the direction of the vector, not the magnitude. So, we can see the scalar $x = \frac{a^Tb}{a^Ta}$ as $a^Tb$ being the projection of $b$ onto $a$ and then $a^Ta$ being the normalizing factor of that projection. Note that if $a$ is a unit vector, then $a^Ta = 1.$ Thus, the projection onto a unit vector simplifies to $x = a^Tb$

### Matrix form
There is one more level of difficulty. The vector $b$ we want to project remains being a vector. But now we are interested in projecting this vector into a hyper-plane. We can define a hyperplane as the column space of a matrix. So, our vector $a$ now is a matrix $A.$ Before, the space of vectors $p$ was given by $p = ax.$ Now, the space of vectors $p$ is given by $p = Ax$ with $A \in R^{m \times n}$ and $x \in R^n.$ 

Let's do a similar derivation for $p$ in terms of $A$ and $b.$

$$
\begin{align}
A^T(b - p) &= 0 \\
A^T(b - Ax) &= 0 \\
A^TAx &= A^Tb \\
x &= (A^TA)^{-1}A^Tb \\
p = Ax &= A(A^TA)^{-1}A^Tb \\
\end{align}
$$

Let's focus on the term $x = (A^TA)^{-1}A^Tb.$ Previously, $x$ was a scalar. Now, $x$ is a vector. Let's see what happens if $x = A^Tb$
$$
x = A^Tb \\
\implies A^TA = I \\
\implies A^T = A^{-1} \\
$$
Thus, we could see how an orthonormal matrix is the natural extension for a unit vector. If $A$ is an orthonormal matrix, then we don't need any normalization factor; instead, the multiplication $A^Tb$ gives us all what we need. 

We saw in the case when $A$ was the vector $a$ instead of being a matrix that varying $a$ didn't change the value of the projection. Let's see that the same happens here. Consider the case when $A = I_2$ and $b = [1, 1].$ As A is orthonormal, $p = AA^Tb = b = [1, 1].$ If we set $A = 3 I_2,$ we would like p to remain the same. 

$$
\begin{align}
p &= A(A^TA)^{-1}A^Tb \\
&= 3I_2(9I_2)^{-1}3I_2^T[1, 1] \\
&= 3I_2(9I_2)^{-1}3[1, 1] \\
&= 3I_2(1/3)[1, 1] \\
&= [1, 1] \\
\end{align}
$$

As we see, the $(A^TA)^{-1}$ term correctly behaves as a normalizator. 

Remember that $A \in R^{m \times n}$. If $m < n,$ then we are splashing some dimensions when projecting. That is, we are losing some information, and thus we have a projection. For the projection to work as it is written, we need $rank(A^TA) = n.$ Otherwise, $A^TA$ won't be invertible. 

It's interesting to note that if $A$ isn't orthonormal, $A^Tb$ could need to be normalized in some direction and not in the other.

## Eigenworld!
### Introduction
The eigenvalues and eigenvectors appear a lot in applications. The intuition behind is that they allow us to decompose a matrix in a more natural form related to the values of the matrix. In particular, we want to decompose a matrix multiplication into three steps: (a) rotate, (b) scale, (c) rotate. In this way, the matrix multiplication gets clearer for humans to understand and easier for machines to compute.[0]

Let's say we start with a vector $b$ expressed in the standard basis and a matrix $A$. Usually, $Ab \neq \lambda b$ with $\lambda \in R$ (ie $Ab$ won't be equal to just scaling $b.$) However, we are going to express $b$ in terms of other vectors where that property holds.

### Computing them
Let's find the non-zero vectors that satisfy

$$
Ax = \lambda x \\
Ax - \lambda x = 0 \\
(A - \lambda I)x = 0 \\
\implies x \verb| is in the nullspace of | (A - \lambda I) \\
\implies (A - \lambda I) \verb| isn't full rank| \\
\implies det(A - \lambda I) = 0
$$

All the values of $\lambda$ that satisfy the last equation above are called the eigenvalues. Each eigenvalue comes with a corresponding eigenvector, which is the value of the vector $x.$

### Eigenbasis
Now, we want to make our matrix multiplication easier using the fact that multiplying $A$ by any eigenvector is the same as scaling. So, we want to express the vector $b$ in terms of the eigenvectors of $A.$ That is, we want to change the basis of $b$ from the standard basis to one where we have the eigenvectors as basis vectors. In this way, to get $Ab$ we first see how to express $b$ in terms of the eigenvectors, then we scale every eigenvector by the eigenvalues, and then we express $b$ in the standard basis again.

Formally,
$$
Q = \verb|eigenvectors stacked| \\
\Lambda = diag(\lambda_1, \lambda_2, ..., \lambda_n) \\
Ab = Q\Lambda Q^{-1}b \\
$$
$Q^{-1}b$ performs the conversion from standard basis to eigenbasis. $\Lambda$ performs the scaling. $Q$ performs the conversion from eigenbasis to standard basis.

Notice from the last equality that we can express A as follows.
$$
A = Q\Lambda Q^{-1}
$$

For real symmetric matrices A, we can write it as $Q\Lambda Q^T.$ However, there could be multiple valid decompositions. In particular, if there are two eigenvectors with the same eigenvalue, then we can picture a plane being stretched (instead of a line in the case of only one eigenvector.) So, every two orthogonal vectors that lie on that plane are valid eigenvectors.

### Why we can decompose $A$ into $Q\Lambda Q^T$
Lemma: if we have two eigenvectors with distinct eigenvalues, then they are orthogonal. 

Proof: say x and y are eigenvectors and b and c are their eigenvalues, with $b \neq c$
$$bx^Ty = (bx)^Ty = (Ax)^Ty = x^TA^Ty = x^TAy = x^T(cy) = cx^Ty$$
As $b \neq c$, $x^Ty$ must be zero.

Eigenspace: the space formed by all the eigenvectors that share an eigenvalue (unioned with the zero vector.) 

By the lemma above, eigenspaces are orthogonal between them. Now, if we have more than one eigenvector spanning a eigenspace, then we can find an orthonormal basis for that eigenspace using gram-schmidt.

Notice that if we have a singular matrix, then one eigenvalue will be zero. That doesn't mean we will have less eigenvectors than in the full-rank matrix case. We will have the same amount, but some eigenvectors will be spanning the null space. In other words, if we have a matrix with two dependent vectors, that doesn't imply that we will have two dependent eigenvectors because the eigenvectors also have to span the nullspace. 

Thus, we know that we can get an eigenbasis where every eigenvector is orthogonal to every other eigenvector. The final step is to scale the eigenvectors to make them unit. Stacking all the eigenvectors in $Q$ gives us an orthonormal matrix. Hence, $Q^{-1} = Q^T.$ 

### Some properties of eigenbasis
Let's prove that matrix singular \iff some eigenvalue is zero.
* if $\exists\ eigenvalue = 0$, then the matrix is scaling its vector in at most $n-1$ dimensions. So it's not full rank. So it's singular.
* If matrix is singular, then there is at least one dimension that it doesn't span. So we can't have n orthogonal eigenvalues. So some eigenvalue is zero

By convention, we order the entries in $\Lambda$ in descending order

Say we want to optimize a quadratic expression like $f(x) = x^TAx$ subject to $||x||_2 = 1.$ If x is an eigenvector of A, then $f(x) = x^T\lambda x = \lambda x^Tx = \lambda.$ If x isn't an eigenvector of A, then we can express as a linear combination of eigenvectors(A), because they span the whole space. Then, the value of f will be a linear combination of eigenvalues(A) given by the same coefficients as the linear comb. above. Thus, the maximum value of $f(x)$ is the first eigenvalue in $\Lambda,$ and the minimum value of $f(x)$ is the last eigenvalue in $\Lambda$ (if we consider the constraint.) 

$
A^n = S\Lambda S^{-1} \cdot S\Lambda S^{-1} ... \cdot S\Lambda S^{-1} = S\Lambda^n S^{-1} \\
A^{-1} = S\Lambda^{-1} S^{-1} = S diag\bigg(\frac{1}{\lambda_1}, ..., \frac{1}{\lambda_n}\bigg) S^{-1}
$

We can express the result of a matrix A multiplied by a vector v as a linear combination of the eigenvectors of A. 

$$
\begin{align}
Av &= A\sum_i c_ix_i \\
&= \sum_i c_iAx_i \\
&= \sum_i c_i\lambda_ix_i \\
&= \sum_i d_ix_i \\
\end{align}
$$

Let's see what happens if we do Av with v being a unit vector. We can express v in the eigenbasis of A 
$$
\begin{align}
1 &= ||v||_2 = ||v||^2_2 = ||\sum_i c_ix_i||^2_2 \\
&= \sum_j\big(\sum_i c_ix_i\big)_j^2 \\
&= \sum_j\big(\sum_i c_ix_{i, j}\big)^2 \\
&= \sum_j\big(cx_{:, j}\big)^2 \\
&= sum(cX \odot cX) \\
&= (cX)(cX)^T \\
&= cXX^Tc^T \\
1 &= cc^T = ||c||_2\\
\end{align}
$$

$$
\begin{align}
1 &= ||Av||_2 = ||Av||^2_2 = ||\sum_i d_ix_i||^2_2 \\
&= \sum_j\big(\sum_i d_ix_i\big)_j^2 \\
&= \sum_j\big(\sum_i d_ix_{i, j}\big)^2 \\
&= \sum_j\big(dx_{:, j}\big)^2 \\
&= sum(dX \odot dX) \\
&= (dX)(dX)^T \\
&= dXX^Td^T \\
1 &= dd^T = ||d||_2\\
\end{align}
$$

### Conclusion
We can think of a matrix multiplication as stretching the space in every eigenvector by every eigenvalue. Specifically, we go from the space where A transforms its vectors to the space where the transformations of A are just scaling vectors, then we scale the input vector, and then we return to the space of A.

## Inverse
What are the requirements for a matrix to have an inverse?
* It can't map multiple vectors to the same vector. Thus, $Ax = b$ must have at most one solution for every vector b. So all columns of A have to be linearly independent. (Otherwise, there are infinite x s.t. $Ax = b$. For example, we could substract n from one entry of x and compensate by updating the entries of $x$ that are linearly dependent with the entry of x we first selected.)
* It has to map every input vector to other vector. Thus, $Ax = b$ must have at least one solution for every vector b. So A has to have m linearly independent columns.
We conclude that A has to be an full-rank square matrix to be invertible.

## Norm
Three properties:

1) $f(x) = 0 \iff x =\ $**0**

2) $f(x + y) \leq f(x) + f(y)$

3) $f(ax) = |a|f(x)$

### 2) Triangle inequality property
#### Negative norm
Let's think about (2). If this doesn't hold, we can't make a triangle with sides f(x), f(y), and f(x + y). Assume (2) doesn't hold 

$$
0 = f(0) = f(x - x) = f(x - y + y - x) > f(x - y) + f(y - x) = f(x - y) + f((-1)(x - y)) = 2f(x - y) \\
0 > 2f(x - y) \\
0 > f(-y) \\
0 > f(z)
$$

This implies that for every z, the norm is negative. In particular, we can choose z = 0, and we reach a contradiction, because we get f(0) < 0 but we know that f(0) = 0. 

#### Transitivity of closeness
$d(x, y) := f(x - y) = f(y - x).$
Say $d(x, y) < \epsilon$ and $d(y, z) < \epsilon.$ We'd like to bound $d(x, z).$ We know that

$$
d(x, z) = f(x - z) = f(x - y + y - z)
f(x - y) + f(y - z) = d(x, y) + d(y, z) < 2\epsilon 
$$ 

The triangle inequality allows us to say that 
$$
f(x - y + y - z) \leq f(x - y) + f(y - z)
d(x, z) < 2\epsilon
$$

It makes sense that if we have vectors a, b, and c, the distance between a and c is at most the sum between the distance between a and b and the distance between a and c.

#### Adding basis vectors
If we apply property (2) recursively, we can reach to the following equivalent property.
$$f\left(\sum_i v_i\right) \leq \sum_i f(v_i)$$

In the euclidean space, the path that has minimum distance to a point is always a straight line. The distance of the straight line is represented by $f\left(\sum_i v_i\right).$ Any other way of reaching the point is represented by $\sum_i f(v_i).$ Thus, property (2) is equivalent to saying that straight lines are the fastest way to reach a point.

### 3) $f(ax) = |a|f(x)$
Here we require the norm to scale linearly if we scale the input linearly. This rules out, for instance, the norm $f(x) = x^2.$ 

#### Unit ball
[source: https://math.stackexchange.com/questions/957414/norm-of-the-one-dimension-real-space]
We can think of the set $B = \{x | ||x|| = 1\}$ as having all the information we need to answer every value for the norm. Say we are asked to compute the value of $f(x)$. And say it's equal to $y.$ By the third property, $\frac{1}{y}f(x) = f(\frac{x}/{y}) = 1.$ Thus, scaling the input by y would give us the result of the norm $f(x).$ Now, we don't know that y = f(x) a priori. So, to calculate $f(x) = y$ we scale the unit ball (the inputs) until the unit ball touches x. That means there's one vector u in the unit ball that, after being scaled by c, is equal to x. 

$$
cu = x \\
f(cu) = f(x) \\
f(cu) = |c|f(u) = |c| \\
f(x) = |c|
$$

Every norm can be defined by the input points that have norm 1. Now, we want to show that every norm for the one-dimensional case has the form $f(x) = k|x|.$ By the first property, we can't make $f(x) = 0$ for every x. By the third property, only two points x, y can have norm equal to 1 (and |x| = |y|. let's call m to that magnitude.) Proof. Assume it's otherwise. Then $|x| \neq |y|.$ In particular, $ax = y$ with $|a| \neq 1.$ Now, we said $f(x) = f(y) = 1.$ But $f(y) = f(ax) = |a|f(x) \neq f(x).$ That's an absurd. Thus, every norm with $x \in R$ is defined by a single m which is the magnitude of the two input points with norm equal to 1. Thus, there is a bijection between those norms and the norms written as $f(x) = \frac{1}{m}|x|.$ And there's another bijection between the last expression and $f(x) = p|x|.$ Notice that this expression works for obtaining the norm of not only the two input points with norm 1 (which works by definition), but also all the input points. This happens because it fulfills the scaling property $f(ax) = p|ax| = |a|p|x| = |a|f(x).$

#### More on the property
This property rules out $f(x) = -x.$ (otherwise we could reach a contradictoin: $x = f(-x) = f((-1)x) = |-1|f(x) = f(x) = -x.$ More general, this rule enforces the norm to fulfill $f(a) = f(-a).$ {Does that imply that we need the result of the norm to be either always positive or always negative?}

### $L_p$ norms
$$||x||_p = \left(\sum_i |x_i|^p\right)^{1/p}$$

One desirable property is to make the derivative of an entry of the vector only dependent on that entry and not in the whole vector. Raising to the $1/p$ with $p != 1$ makes the derivatives of the entries dependent on the other entries. However, the $L_n$ norm raised to the $n,$ will have this property. (Notice that raising the $L_1$ norm to the 1 is clearly leaving it as it is.) Because of this, we generally use the $L_1$ or squared $L_2$ norms.

Say we want to estimate a vector v with a vector u. We then want to minimize some norm of the difference between v and u. Say we use the $L_1$ norm. Thus, we want to minimize $g(v, u) = ||v - u||_1.$ If we take the derivative of g wrt to an entry (say $u_i$), we just get 1 if $v_i < u_i$, 0 if $v_i = u_i,$ and -1 if $v_i > u_i.$ This is useful, but we are losing information about the exact value of $u_i.$ This doesn't happen with $L_p$ norm where $p > 1.$

Frobenius norm: we extend the $L_2$ norm to a matrix. We have to travel through two dimensions instead of one. {what else}

## Positive definitiness
A matrix is positive definite if all its eigenvalues are positive. This means we don't flip any dimension, we just stretch them.

Say we have the quadratic function $f(x) = x^TAx$ and A is PSD. Now, $f''(x) = A.$  

Theorem: if the hessian of a function is PSD, then the function is convex.

Proof: Suppose the function isn't convex. Then, we will have some interval $[a, b]$ where there's a point above the line between $a$ and $b.$ Scale the input of the function to make that interval $[0, 1].$ Let's call $c$ the point that first point that is above the line. The average rate of change between 0 and c has to be greater than the average rate of change between 0 and 1, because f(c) > f(1). By the mean value theorem, there's a point d between 0 and c s.t. f'(d) is equal to the average rate of change between 0 and c. So, f'(d) = avg rate of change between 0 and c > avg rate of change between 0 and 1 = slope of the line between 0 and 1. Now, we know that f'(x) is stricly increasing because the hessian is positive. Thus f'(d) < f'(c). Thus, the avg rate of change between c and 1 is greater than the slope of the line between c and 1. And also, the line is below the point f(c). Thus, the line and f can't touch again. That's a contradiction. Thus, it's convex.

Another way to defining them is as follows.

$A \succ 0 \iff A $ is positive definite $\iff x^TAx$ is greater than zero for every non-zero vector $x$

$A \succeq 0 \iff A $ is positive semidefinite $\iff x^TAx$ is greater than or equal to zero for every non-zero vector $x$

$A \prec 0 \iff A $ is negative definite $\iff x^TAx$ is smaller than zero for every non-zero vector $x$

$A \preceq 0 \iff A $ is negative semidefinite $\iff x^TAx$ is smaller than or equal to zero for every non-zero vector $x$

## Special matrices
### Some other matrices
We transpose the matrix and then take the conjugate of every element.
$A^H := \bar A^T$
Thus, $(A^H)_{ij} = \bar a_{ji}$

#### Motivation for this operation
We can represent a complex number $a + bi$ as the following 2x2 matrix 
$
\begin{bmatrix}
 a & -b \\
 b & a \\
\end{bmatrix}
.$
Then, we can represent every m by n matrix with complex entries with a 2m by 2n matrix with real entries. Taking the plain transpose of the 2m by 2n matrix corresponds to taking the conjugate transpose of the m by n matrix. 
{from here, I don't know how you define that the conjugate transpose makes sense. what's of the matrices with real entries that is useful so as to have the fact that mirroring operations in that space is useful for complex matrices? It doesn't seem that is common to have this real matrices arising, because you have the constraint of dimensions being even and that a lot of entries have to be equivalent, or the opposite.}

Hermitian matrices are the matrices where 
$A^H = A$

Skew-hermitian matrices are the ones that
$A^H = -A$




### Analogy
"Symmetric matrices often arise when the entries are generated by some function of two arguments that does not depend on the order of the arguments." I feel that there is something similar to PSD matrices. (Not sure:) the fact that a PSD matrix has advantages over other matrices is that it has a more limited source compared to the other matrices. {continue exploring}

$A^TA \iff a^2$

$A\ positive\ definite \iff a > 0$ (the same for negative and semis)

$A$ orthonormal $\iff$ **a** unit vector $\iff$ $a = 1$ 

{is there an analogouous of the eigendecomposition in the scalars?}

Hermitian matrices: analogous to real numbers

Skew hermitian matrices (~antisymmetric): analogous to imaginary numbers

=> every matrix can be expressed as a sum of hermitian and skew hermitian matrices (as any complex number can be written as the sum of real and imaginary numbers.)

{https://math.stackexchange.com/questions/58890/similarity-between-special-matrices-and-special-complex-numbers}

## SVD
For any real matrix A, we can write it as $UDV^T,$ with U, V orthogonal (=>square) matrices, and D a diagonal matrix.

$AA^T = (AA^T)^T = AA^T.$ So $AA^T$ is symmetric and square. So it has an eigendecomposition with orthonormal eigenvectors.

$AA^T = UDV^TVD^TU^T = UD^2U^T$

So, applying $AA^T$ is equivalent to applying A and then A^T. Applying A is equivalent to rotating by U, stretching by D and rotating by V^T. Applying A^T is equivalent to rotating by V, stretching by D and rotating by U^T. Thus, as V is orthonormal, applying AA^T is equivalent to rotating by U, stretching by D^2 and rotating by U^T. 

It follows from the SVD that every matrix can be expressed as $\sum \sigma_i u_i v_i^T.$ That is, every matrix can be decomposed as a sum of rank 1 matrices. $\sigma_i$ determines the relative participation of every matrix. 

{https://math.stackexchange.com/questions/2276374/strangs-proof-of-svd-and-intuition-behind-matrices-u-and-v}

## Pseudoinverse
Say we have a non-square matrix A and Ax = y. We want a A's left-inverse B such that x = By. Note that if A is wider than it's tall, we could have multiple points in the domain pointing to the same point in the range. And if A is taller than it's wide, we could have points in the range that don't correspond to the any point in the domain.

Lemma: definition of inverse doesn't work for non-square matrices. 
Proof: let $A \in R^{m\times n}.$ rank(A) = min(m, n). Thus, either $AA^{-1}$ or $A^{-1}A$ isn't full-rank. Assume that the inverse works. Thus, both expressions have to be equal to the identity, which is full-rank. Absurd.

So, the definition we are going to use is that of recovering the input vector after a transformation. That is, $A^+(Ax)$ should equal x.

We know that $A = U\Sigma V^T$. Thus, we want $A^+ = V\Sigma^+U^T.$

Define $r := rank(\Sigma).$ The best we can do with $\Sigma^+\Sigma$ is to set it equal to a diagonal matrices with ones in the first r diagonal entries. Thus, we define $\sigma^x_i = 1 / \sigma^x_i.$ If $m < n$, then we will have an incomplete identity matrix. If $m \geq n,$ then we will have the complete identity.

Thus, $A^+A = V\Sigma^+U^TU\Sigma V^T = V\Sigma\Sigma^+V^T = VI_rV^T = I_r$

$A^+Ax = I_rx$

(Not sure at all:) the fact that the pseudoinverse computes the best solution if there are multiple solutions could be related to the fact that the singular values are sorted in descending order.

{why does it have such amazing properties? (the ones in dlb)}

Image: all the values that the function maps to.

Codomain: a set that includes the image

## Trace
Trace: an operation with useful identities. Thus, we tend to use it to make some manipulations easier

$$
1)\ tr(A) = tr(A^T) \\
2)\ tr(AB..YZ) = tr(ZAB..Y) \\
3)\ tr(a) = a
$$

Property (2) holds if the resulting matrix AB..YZ is squared and the product ZAB..Y is defined.

### Proof for tr(AB..YZ) = tr(ZAB..Y)
$tr(AB..YZ) = tr(ZAB..Y) \iff tr(MN) = tr(NM)$

Lemma: $tr(AB) = sum(A \odot B^T)$ with $A \in R^{m\times n},$ $B \in R^{n\times m},$ and $sum$ being a function that reduces a matrix to the sum of its elements.
Proof: think of A as a stack of row vectors and B as a stack of column vectors.  
$$
A = \begin{bmatrix}
 \_\_ & A_{1,:} & \_\_ \\
 \_\_ & A_{2,:} & \_\_ \\
 \ & \vdots & \ \\
 \_\_ & A_{m,:} & \_\_ \\
\end{bmatrix}
B = \begin{bmatrix}
 | & | & & | \\
 B_{:,1} & B_{:,2} & ... & B_{:,m} \\
 | & | & & | \\
\end{bmatrix}
$$
{maydo: replace dots by lines}

$$
(AB)_{ii} = A_{i,:}B_{:,i} \\
\begin{align}
tr(AB) &= \sum_{i} (AB)_{ii} = \sum_i A_{i,:}B_{:,i} = \sum_i \sum_j a_{i,j}b_{j,i} \\
&= \sum_i \sum_j (A)_{i,j}(B^T)_{i,j} = \sum_i \sum_j (AB^T)_{i,j} = sum(A \odot B^T)
\end{align}
$$ 

(An interesting conclusion from this lemma is that $tr(AA) = \sum_{ij} a^2_{ij}.$)

Now we want to prove that $tr(MN) = tr(NM).$
$$
\begin{align}
tr(MN) &= sum(M \odot N^T) \\
&= sum(M \odot N^T)^T \\
&= sum(M^T \odot N) \\
&= sum(N \odot M^T) \\
&= tr(NM) 
\end{align}
$$

## Determinant
$$det(A) = \prod_i eigenvalues(A)_i.$$ 
This is beautiful. The determinant of a matrix A tells us how much A stretches the space.

Let's think about the hyperrectangle given by a matrix A as the space spanned by the vectors resulting from multiplying A with any vector v s.t. $||v||_\infty = 1$. Then, the hyperrectangle given by the identity matrix is just the space spanned by all the vectors with $L_\infty$ norm equal 1, which is the unit hypersquare. The hyperrectangle given by any square matrix is contains the points that are $\sum_i \alpha_i\lambda_ix_i$ with $\lambda_i$ being the ith eigenvalue, $x_i$ being the ith eigenvector, and $\alpha_i \in [0, 1]$ being the coefficient for that eigenvector. We can prove this by saying that we take the eigendecomposition of A, $\Lambda.$

Note that a rotation doesn't affect the hypervolume of the hyperrectangle. Say Q is a rotation matrix. Then,
$$
det(AQ) = det(A) \\
A = Q\Lambda Q^T \\ 
AQ = Q\Lambda \\ 
det(A) = det(Q\Lambda) \\
$$

Notice $Q\Lambda$ is a matrix with the (unit) eigenvectors scaled by the eigenvalues. Thus, with $||v||_\infty=1$ $Q\Lambda v = \sum_i (Q\Lambda)_{:, i}v_i = \sum_i x_i\lambda_i\alpha_i$ (remember that we can see a matrix multiplying a vector as a linear combination of the rows of the matrix)

Now, the hyperrectangle spanned by multiplying $Q\Lambda$ with vectors with unit $L_\infty$ norm is the same as the hyperrectangle spanned by A (up to a rotation.)

The whole idea of what's written above is that the determinant calculates how the multiplying by A stretches the space. To calculate that, we think of A in terms of its eigenbasis. With the eigenbasis, it's easy to calculate how the space is stretched. We can measure how much the space is stretched by taking all the vectors that span the unit hypercube (which has hypervolume = 1) (which are the vectors with $L_\infty$ norm equal 1) and measure the hypervolume of the hyperrectangle formed by all the $L_\infty$-unit vectors multiplied by A. (we may need to rotate the hypercube to get it aligned with the eigenbasis.)

(This will be the same thoroughout the space because we are using _linear_ transformations.)


Note that for any unit vector v, we can write $v = \sum_i c_ix_i$ with $x_i$ being the i-th eigenvalue of A and $||c||_2 = 1.$ Then 
$$
Av = A\sum_i c_ix_i \\
Av = \sum_i c_iAx_i \\
Av = \sum_i c_i(\lambda_ix_i) \\
$$

We say that d expresses a _unit linear combination_ if it's a vector that has the coefficients for a linear combination and $||d||_2 = 1$.

Then, we can express the hyperrectangle given by any square matrix A as the space spanned by the unit linear combinations of the eigenvectors of A.  We can do this because for


identity matrix as a hypercube. Then, we can think about any square matrix A as a hyperrectangle with sides equal to its eigenvalues. Notice that the last sentence follows logically from the first sentence: if 

Let's define a hyperrectangle as the generalization of a rectangle for n dimensions. The hypervolume of the hyperrectangle is just the multiplication of its n sides. Say we have a matrix $A \in R^{n\times n}.$ We can express that matrix A as a hyperrectangle. To do this, we transform A to an orthogonal basis (eg we eigendecompose it.) Thus, each basis vector of A represents one side of the hyperrectangle. We can express A as a hyperrectangle because 


Notice that the identity represents the hypercube and thus the identity has a determinant of 1. 

Also, $det(A) = 0 \iff \exists_i eigenvalue(A)_i = 0 \iff$ the hypervolume ends being 0.

The determinant is summarizing the information from the eigenvalues. It says how the stretch of the eigenvalues changes the space as a whole (and not only in one dimension.)

## PCA
{PCA from n to n dims <=> eigendecomposing?}

We can picture a vector as a list of scalar features.

Say we have a bunch of vectors in a high-dimensional space. Picture the following process. (a) rotate the space, (b) for each vector, remove the value in many dimensions, (c) unrotate the space.[3] We want to select the rotation that minimizes the sum of the square distances between a vector before the process and that vector after the process. 

Take the same bunch of vectors in a high-dimensional space. We want to project the vectors into a low-dimensional space where we minimize the projection error.
 

Given a set of $m$ vectors $X = \{x_1, x_2, ..., x_m\}$ (with $\forall i: x_i \in R^n$) we want to find a set of smaller vectors $C = \{c_1, c_2, ..., c_m\}$ (with $\forall i: c_i \in R^l$ and $l < n$) and a matrix $D$ such that for every i, $|Dc_i - x_i|$ is as small as possible. Formally, we want 
$$D, c_1, c_2, ..., c_m = arg\min_{D, c_1, c_2, ..., c_m} \sum_i||Dc_i - x_i||_2.$$

That is

Assume we know the value for $D$.  


The decoding function g maps from $R^l$ to $R^n.$ We assume the decoding function is a matrix $D \in R^{n \times l}.$ We also assume that the column vectors of $D$ are unit and orthogonal [1]. Then, $g(c) = Dc.$ 

As g(c) represents a recovered version of x, we want g(c) to be as near as possible to x.

We want to define the optimal c, $c^*$, in terms of x.

$$
\begin{align}
c^* &= argmin_c ||g(c) - x||_2 \\
\\
||g(c) - x||_2 &= ||g(c) - x||_2^2 \\
    &= (g(c) - x)^T(g(c) - x) \\
    &= g(c)^Tg(c) - g(c)^Tx - x^Tg(c) + x^Tx \\
    &= g(c)^Tg(c) - 2x^Tg(c) \\
    &= (Dc)^T(Dc) - 2x^Tg(c) \\
    &= c^TD^TDc - 2x^TDc \\
    &= c^Tc - 2x^TDc \\
\\
c^* &= argmin_c c^Tc - 2x^TDc \\
\\
0   &= \nabla_c c^Tc - 2x^TDc \\
    &= 2c - 2D^Tx \\
D^Tx &= c
\end{align}
$$
{how do we know the problem is convex?}

Now we know that if the decoding function g(c) = Dc, then the optimal encoding function is f(x) = D^Tx. 

How do we select D? We again want to minimize the distance $||g(c) - x||_2$, but we want to do that for every point in the dataset. We consider the case where l = 1 and thus D = d ($\iff$ D is a vector).

(Everything below has the constraint of $d^Td = ||d||_2^2 = 1$)
$$
\begin{align}
\sum\limits_{ij} (x^i_j - g(c^i)_j)^2 &= \sum\limits_{ij} (x^i_j - r(x^i)_j)^2 \\
&= \sum\limits_{ij} (x^i_j - dd^Tx^i_j)^2 \\
&= \sum\limits_{i} ||x^i - dd^Tx^i||^2_2 \\
&= \sum\limits_{i} ||x^i - (x^i)^Tdd||^2_2 \\
&= ||X - Xdd^T||^2_F \\
&= Tr((X - Xdd^T)^T(X - Xdd^T)) \\
&= Tr((X^T - dd^TX^T)(X - Xdd^T)) \\
&= Tr(X^TX - dd^TX^TX - X^TXdd^T + dd^TX^TXdd^T) \\
&= Tr(X^TX) - Tr(dd^TX^TX) - Tr(X^TXddd^T) + Tr(dd^TX^TXdd^T) \\
&= -2Tr(dd^TX^TX) + Tr(dd^TX^TXdd^T) \\
&= -2Tr(d^TX^TXd) + Tr(d^TX^TXdd^Td) \\
&= -Tr(d^TX^TXd) \\
argmin_d -Tr(d^TX^TXd) &= argmax_d Tr(d^TX^TXd) \\
&= argmax_d d^TX^TXd \\
\end{align}
$$

We know from the eigenworld section that the solution to this optimization problem is to take d equal the eigenvector of X^TX with largest eigenvalue. 

We know that the right singular vectors of the SVD of X gives us the eigenvalues of X^TX and the singular values of SVD(X) are the square root of the eigenvalues of X^TX. Thus, from computing SVD(X), we can select the eigenvector of X^TX with largest eigenvalue. 

# Notes
[-1] Orthonormal matrices = orthogonal matrices.

[0] It's easier for the machine to compute given that we already decomposed the matrix into eigenvalues and eigenvectors. In that case, $Ax = A(\sum_i\alpha_iv_i) = \sum_iA\alpha_iv_i = \sum_i\lambda_i\alpha_iv_i.$ With $v_i$ the eigenvectors and $\lambda_i$ the eigenvalues. Thus, if we have x decomposed as a linear combination of the eigenvectors, then the matrix multiplication converts into a sum over n values. Consider that decomposing x as a lin. comb. of the eigenvectors costs us a matrix multiplication.

[1] We ask for a unit norm to avoid having more than one valid coding and decoding functions. Otherwise, we can divide D by two and multiply c by two and we recover the same x. We ask for a matrix D, and specifically a matrix D with orthogonal columns, to make the problem easier.

[2] By unrotate I mean to do the inverse to the rotation in (a).

# For other place
Several times I find myself thinking about complex improvements to simple things (like combining the $L_1$ and $L_2$ norm into one, or making a custom norm.) I think that we need to put a grain of salt in complex things. One reason is that they don't generalize as well as simple things. It's worth the effort trying to reduce a complex thing to a simpler one.

n-ary function: number of parameters of the function
Algebraic structures: a set with a collection of finitary operations
Algebra over a field

Associative algebra: 

### Abelian group
We have a set A and an operation #.

We have the following five requirements for any a, b, and c in A
* Associativity: (a # b) # c = a # (b # c)
* Inverse: there exists some d such that a # d = e
* Identity/Unital: there exists one and only one e such that a # e = e # a = a
* Closure: a # b is in A
* Commutativity: a # b = b # a

(R, +) and (R, \*) are in this group. (Matrices, \*) aren't in this group (we don't always have inversion nor commutativity)

### Monoid
Requirements
* Associativity
* Identity: it has to be unique

### Vector space
We have scalars from F and vectors from V. We have two operations # and .
Requirements
* (Vectors, #) is an abelian group
* Distributivity:
    * $a . (\textbf v # \textbf u) = a . \textbf v # a . \textbf u$
    * $(a + b) . \textbf v = a . \textbf v # b . \textbf v$
* Identity: e . v = v
* Interdisciplinary associativity: $a(b\textbf v) = (ab) \textbf v$

### Ring
A set with two operations, where the first operation corresponds to an abelian group, the second to a monoid, and the operations fulfill the following property
* Distributivity: a # (b . c) = a # b . a # c

### Field
A set with two operations, where the first operation corresponds to an abelian group, and the second one to an abelian group without commutativity 
check bill gates post on diff types of intelligence when he awas 18.
Explore integrals.


oxford, meet one three times per week, research --> not sure we can't do results, but he affirms it's good for learning. 