# Mathematical Background for Deep Learning





## Linear Algebra (remember, remember Mathe 1...)


A **scalar** is a single number, e.g. an integer, real number... denoted by a non-bold italic letter $a, n, x$. 

A **vector** is an element of a vector space = a 1D-array of numbers, denoted by a bold letter, e.g. 
$$\mathbf{x}=\left(\begin{array}{c}
x_1\\
\vdots\\
x_n
\end{array}\right).$$ 



A **matrix** is a 2D-array of numbers correpsonding to a linear map between vector spaces, and denoted by a large bold letter, e.g. 
$$\mathbf{W}=\left(\begin{array}{cccc}
w_{11}&w_{12}&\ldots&w_{1n}\\
\vdots&\ddots&\ddots& \vdots\\
w_{m1}&w_{m2}&\ldots &w_{mn}
\end{array}\right).$$ 
It gives rise to the linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ given by 
$$
\mathbf{W}\cdot \mathbf{x}=\left(\begin{array}{c}
w_{11}\cdot x_1+w_{12}\cdot x_2+\ldots+w_{1n}\cdot x_n\\
\vdots \\
w_{m1}\cdot x_1+w_{m2}\cdot x_2+\ldots +w_{mn}\cdot x_n
\end{array}\right)
$$

A **tensor** is a possibly higher-dimensional abstraction of matrices: it is an n-dimensional array of numbers, e.g. 
- zero-dimensional tensor = scalar
- one-dimensional tensor = vector
- two-dimensional tensor = matrix
- three dimensional tensor = a cube of numbers...

### Matrices and systems of linear equations

An equation of the form 

$$\mathbf{W}\cdot \mathbf{x}=\mathbf{b}$$
is a system of linear equations (LGS)
$$\begin{array}{ccc}
w_{11}\cdot x_1+w_{12}\cdot x_2+\ldots+w_{1n}\cdot x_n&=&b_1\\
\vdots&& \vdots\\
w_{m1}\cdot x_1+w_{m2}\cdot x_2+\ldots +w_{mn}\cdot x_n&=&b_n
\end{array}.$$


Such a system can have no solution, exactly one solution (if the matrix $\textbf{W}$ is invertible, i.e. has full rank), or infinitely many solutions (if the matrix $\textbf{W}$ is not invertible, i.e. has rank $<n$$). (GauÃŸ Algorithm)

### Matrix multiplication

Remember that matrices $\mathbf{A}=(a_{ij})$, $\mathbf{B}=(b_{ij})$ can be multiplied as long as the number of columns of $A$ is the same as the number of rows of $B$: 
$\mathbf{A}\mathbf{B}=(\sum_{k}a_{ik}b_{kj})_{ij}$, e.g.
$$\left(\begin{array}{cc}
1&2\\
3&4\\
5&6
\end{array}\right)\cdot \left(\begin{array}{ccc}
1&1&-1\\
0&-1&1
\end{array}\right)=?$$

**Answer:**
$\left(\begin{array}{ccc}
1&-1&1\\
3&-1&1\\
5&-1&1
\end{array}\right)$

**Question:** The neutral element for multiplication of $n\times n$-matrices is ?

**Answer:** the **identity matrix**
$\mathbb{I}_n=\left(\begin{array}{cccc}
1&0&\ldots&0\\
0&1&\ldots &0\\
\vdots&\ddots&\vdots &0\\
0&0&\ldots & 1
\end{array}\right)$

### Invertible matrices

If the matrix $\mathbf{W}$ is square (i.e. $n\times n$) and has **full rank** (meaning the column vectors are linearly independent, i.e. they span a vector space of dimenion $n$ $\Leftrightarrow$ none of the eigenvalues is 0), the corresponding system of linear equations
$$\mathbf{W}\mathbf{x}=\mathbf{b}$$ 
has exaclty one solution $\mathbf{x}$. This solution can be recovered by multiplying $\mathbf{b}$ with the unique inverse matrix $\mathbf{W}^{-1}$ (i.e. $\mathbf{x}=\mathbf{W}^{-1}\mathbf{b}$), which satisfies 
$$\textbf{W}^{-1}\cdot \textbf{W}=\textbf{W}\cdot \textbf{W}^{-1}=\mathbb{I}_n.$$ 

Matrices with rank less than the number of rows (meaning at least one of the eigenvalues is 0) are called **singular**. They cannot be inverted. 

Note: the inverse matrix is good in theory, but not so much in pracitce. 

**Problem with inverting invertible matrices numerically (i.e. with a computer):**

1. Numerical instability: Computers store and calculate in floating numbers $\Rightarrow$ small rounding errors. This can lead to huge differences when inverting a matrix. Example: $1\times 1$-matrix $\epsilon_1$ = very very small number. 

exact inverse: $\frac{1}{\epsilon_1}$= a very large number. 

if there was a floating number rounding error in the calculation, and we don't end up with $\epsilon_1$, but with $2\epsilon_1$: $\frac{1}{2}\frac{1}{\epsilon_1}$




2. Memory advantage: 
$$\mathbf{A}=\left(\begin{array}{ccc}
1  &  0  &   2\\
-1 &  5  &   0\\
0  &  3  &  -9
 \end{array}\right)\Rightarrow \mathbf{A}^{-1}=\left(\begin{array}{ccc}
0.8824 &  -0.1176  &  0.1961\\
0.1765 &   0.1765  &  0.0392\\
0.0588 &   0.0588  &  -0.0980\\
\end{array}\right)$$
$$\Rightarrow \mathbf{A}\cdot \mathbf{A}^{-1}=\left(\begin{array}{ccc}
1.0000  &  0.0000   &-0.0000\\
0       &  1.0000   &  -0.0000\\
0       & -0.0000   &  1.0000 \end{array}\right)\neq I_n$$
Due to rounding errors, the resulting matrix is not the identity matrix but has very small float numbers instead of 0's - while 0's aren't neccessarily stored in memory, these small entries are!

$\Rightarrow$ Never invert a matrix $\mathbf{A}$ numerically if you can help it! (Rather solve the linear equation $\mathbf{A}\mathbf{x}=\mathbf{b}$ in a different way. (Cholesky decomposition))

### Transpose of a matrix

The transpose of a matrix $\mathbf{W}$ as above is the matrix you get by the mirror image across the main diagonal: 
$$\mathbf{W}^T=\left(\begin{array}{cccc}
w_{11}&w_{21}&\ldots&w_{m1}\\
\vdots&\ddots&\ddots& \vdots\\
w_{1n}&w_{2n}&\ldots &w_{mn}
\end{array}\right).$$

Example: 
$$\left(\begin{array}{cc}
1&2\\
3&4
\end{array}\right)^T=\left(\begin{array}{cc}
1&3\\
2&4
\end{array}\right)$$
Rule: $(AB)^T=A^TB^T$

### Special matrices and vectors: 

- unit vector $||\mathbf{x}||=1$. 
- symmetric matrix: $\mathbf{W}=\mathbf{W}^T$
- orthogonal matrix: $$\mathbf{W}\cdot \mathbf{W}^T=\mathbf{W}^T\cdot \mathbf{W}=I_n\Rightarrow \mathbf{W}^{-1}=\mathbf{W}^T $$

## Differentiation


### Differentiating real-valued functions in one variable

School/Mathe2: If $f(x)$ is a real-valued function in one real variable which is "smooth" enough, one can compute the **derivative** 
$f'(x)=\frac{d}{dx}f(x)$, which is the slope of the function at the point $x$. 

**Question:** $f(x)=x^2 + \sin x + x$. Compute the derivative!

**Answer:** $f'(x)=2x+\cos x +1$

**Rules for differentiating:** 
- Chain rule: $$(f\circ g)'(x)=f'(g(x))\cdot g'(x)$$
- Product rule: $$(f(x)\cdot g(x))'=f'(x)\cdot g(x)+ f(x)\cdot g'(x)$$
- Quotient rule: 
  $$\left(\frac{f(x)}{g(x)}\right)'=\frac{f'(x)\cdot g(x)- f(x)\cdot g'(x)}{(g(x))^2}$$

**Example:** $f(x)=x\cdot \log x + e^{2x}$. Compute the derivative using chain rule and multiplication rule!


**Answer:** $f'(x)=1\cdot\log x + x\cdot \frac{1}{x} +2\cdot e^{2x}$

**Example:** $\sigma(x)=\frac{1}{1+e^{-x}}$ is the sigmoid function. Compute the derivative using the chain rule!


**Answer:** $\sigma'(x)=-\frac{1}{(1+e^{-x})^2}\cdot (-e^{-x})=\frac{1}{1+e^{-x}}\cdot \frac{-e^{-x}}{(1+e^{-x})}=\sigma(x)\cdot (1-\sigma(x))$. 

### Real-Valued Functions in several variables

Now we consider "smooth" real-valued functions $f(x_1,\ldots, x_n)$ in $n$ variables(one-dimensional output, n-dimensional input). 

**Example:** Atmospheric pressure (Luftdruck) at a certain time is a function of longitude and latitude. What does the graph of this function (i.e. plotting atmospheric pressure versus longitude and latitude in 3D) look like?

**Answer:** It is a smooth surface in 3D with hills and valleys. And on this surface one can consider lines of constant pressure.  

<br></br><img src="./Other_Images/Image_Isobaren.png" alt="Isobaren" width="400" title = "Wikipedia"/>   <br></br>

For a function as above, one calls the sets of points with constant value $f(x_1,\ldots, x_n)=c$ the **level sets** of a function. 

**Example:** The prediction function of a NN with one output head is a (mostly) smooth real-valued function (whether entirely smooth or not depends on the activation functions) in the input and the weights of the NN.

**Most important example**: The loss function of a ML model (e.g. MSE) is a real-valued smooth function in the parameters.

### Partial differentials and Gradient

If $f(x_1,\ldots, x_n)$ is a smooth real-valued function in $n$ variables as above, one can compute the **partial differential** with respect to each of the variables $x_i$ which is denoted by 
$$\frac{\partial}{\partial x_i}f(x_1,\ldots, x_n)$$
This is EXACTLY the same as "normal" differentiation: we assume the other variables are constants and consider $f(x_1,\ldots, x_n)$ only as a function in $x_i$, and compute the derivative as above!

**Example:** $f(x,y,z)=x^2+y^2+z\cdot \sin x$. Compute the partial derivatives with respect to $x,y,z$!




**Answer:** $\frac{\partial}{\partial x}f(x,y,z)=2x+z\cdot \cos x, \frac{\partial}{\partial y}f(x,y,z)=2y, \frac{\partial}{\partial z}f(x,y,z)=\sin x$. 

**Example:** $f(x,y,z)=\text{Softmax}_x(x,y,z)=\frac{e^x}{e^x+e^y+e^z}$. Compute the partial derivatives with respect to $x,y,z$!

**Answer:** 
$$\frac{\partial}{\partial x}f(x,y,z)=\frac{e^x\cdot (e^x+e^y+e^z)- e^x\cdot e^x}{e^x+e^y+e^z}=\frac{e^x}{e^x+e^y+e^z}\cdot(1- \frac{e^x}{e^x+e^y+e^z})=f(x,y,z)\cdot \left(1-f(x,y,z)\right),$$
$$\frac{\partial}{\partial y}f(x,y,z)=e^x\cdot (-\frac{1}{(e^x+e^y+e^z)^2})\cdot e^y=-\frac{e^xe^y}{(e^x+e^y+e^z)^2},$$
$$\frac{\partial}{\partial z}f(x,y,z)=-\frac{e^xe^z}{(e^x+e^y+e^z)^2}.$$ 

The **gradient** $\nabla f$ of $f$ is the vector of all partial differentials:
$$\nabla_{x_1,\ldots,x_n} f(x_1,\ldots, x_n) =\left(\begin{array}{c}
\frac{\partial}{\partial x_1}f\\
\frac{\partial}{\partial x_2}f\\
\vdots\\
\frac{\partial}{\partial x_n}f
\end{array}\right)$$
If it is clear for which variables we compute the partial differentials, one can also drop the subscript in the gradient sign $\nabla$.  

**Theorem:** $\nabla f(x_1,\ldots, x_n)$ ALWAYS points in the direction of the steepest ascent of $f$ at the point $(x_1,\ldots ,x_n)$ and $-\nabla f(x_1,\ldots, x_n)$ in the direction of the steepest descent. 

**Example:** Compute the gradient of the function $f(x,y,z)=x^2+y^2+z\cdot \sin x$.

**Answer:** 
$$\nabla f(x,y,z)=\left(\begin{array}{c}
2x+z\cdot \cos x\\
2y\\
\sin x
\end{array}\right).$$ 

### Differentiating real-valued higher-dimensional functions 

If $f(x)=(f_1(x),\ldots, f_d(x))$ is a vector of real differentiable functions in one variable (one-dimensional input, $n$-dimensional output), one can compute the differential by $x$ as: 

$f'(x)=(f'_1 (x),\ldots, f'_d(x))$. 

### Differentiating higher-dimensional functions in several variables: The Jacobian

Now we consider  $f(x_1,\ldots, x_n)=\left(f_1 (x_1,\ldots, x_n),\ldots, f_d(x_1,\ldots, x_n)\right)$ be a vector of real differentiable functions in $n$ variables ($n$-dimensional input, $d$-dimensional output). 

**Question:** What is the partial differential of of $f$ by $x_j$?

**Answer:** an entire vector:
$\frac{\partial}{\partial x_i}f(x_1,\ldots, x_n)=\left(\frac{\partial f_1 (x)}{\partial x_i},\ldots, \frac{\partial f_d (x)}{\partial x_i}\right).$

**Question:** What would an equivalent of the gradient (i.e. a tensor with all partial differentials) for real-valued functions look like in this case?


**Answer:** something two-dimensional, i.e. a matrix!

The **Jacobian matrix** is defined as the matrix which has the partial differentials of the function $f$ as vectors, (and gradients of the functions $f_i$ as rows): 
$$Jf_{x_1,\ldots, x_n}(x_1,\ldots, x_n)=\left(\begin{array}{cccc}
\frac{\partial f_1 (x)}{\partial x_1} & \frac{\partial f_1 (x)}{\partial x_2}&\cdots & \frac{\partial f_1 (x)}{\partial x_n}\\
\vdots&\vdots&\ddots&\vdots\\
\frac{\partial f_d (x)}{\partial x_1} & \frac{\partial f_1 (x)}{\partial x_2}&\cdots & \frac{\partial f_d (x)}{\partial x_n}
\end{array}\right)$$
If it is clear what the variables are (usually all of them), one can also drop the subscript. 

**Example:** 
$f(x,y,z)=\left(\begin{array}{c}
2x+ 3y + 4z\\\
x -y -2z\\
y-z
\end{array}\right)$

Write $f$ in matrix notation and compute the Jacobian of $f$!

**Answer:** 
$f(x,y,z)=\left(\begin{array}{ccc}
2& 3& 4\\\
1 &-1 &-2\\
0& 1&-1
\end{array} \right)\cdot \left(\begin{array}{c}
x\\\
y\\
z
\end{array} \right)
$

$f_1=2x+ 3y + 4z, f_2=x -y -2z, f_3=y-z$

$
\Rightarrow Jf=\left(\begin{array}{ccc}
2& 3& 4\\\
1 &-1 &-2\\
0& 1&-1
\end{array} \right)
$

So the Jacobian of a linear function (multiplication with a matrix) is the matrix itself!



**Example:** Compute the Jacobian of the following function: 
$Jf=\left(\begin{array}{c}
x^2+\log y+z^2 +xy\\
y\cdot e^x 
\end{array}\right)$

Answer: 
$Jf=\left(\begin{array}{ccc}
2x+y&\frac{1}{y}+x&2z\\
ye^x& e^x& 0
\end{array}\right)$

### Chain Rule for Differentiation

If $f\colon \mathbb{R}^n\to \mathbb{R}^m$ (input n-dimensional, output m-dimensional) and $g\colon \mathbb{R}^m\to \mathbb{R}^k$ (input m-dimensional, output k-dimensional) are two smooth functions, we can consider the composite $g\circ f\colon \mathbb{R}^n\to \mathbb{R}^k$. 

**Chain Rule for Differentiation:** $$J(g\circ f)(x_1,\ldots, x_n)=Jg(f(x_1,\ldots, x_n))\cdot Jf(x_1,\ldots, x_n)$$
If $k=1$, $g$ and $g\circ f$ are real-valued functions, so we can compute their gradient. Then, as a special case of the above, we get the chain rule: 
$$\nabla (g\circ f)(x_1,\ldots, x_n)=\nabla g(f(x_1,\ldots, x_n))\cdot Jf(x_1,\ldots, x_n)$$

### Notational convention

Since it is difficult to always think about the dimensions, i.e. whether you deal with ordinary differentials, the gradient, or the Jacobian, especially when dealing with large chains of functions, one writes 

$\frac{\partial f}{\partial \mathbf{x}}$ for all kinds of functions $f$ and numbers of parameters in the vector $\mathbf{x}$, i.e.
- if $f$ has 1-dim'l output and the input $\mathbf{x}=x$ is real, $\frac{\partial f}{\partial \mathbf{x}}=f'(x)$.
- if $f$ has 1-dim'l output and the input $\mathbf{x}=(x_1,\ldots,x_n)$ is n-dim'l, $\frac{\partial f}{\partial \mathbf{x}}=\nabla f(\mathbf{x})$
- if $f$ has m-dim'l output and the input $\mathbf{x}=(x_1,\ldots,x_n)$ is n-dim'l, $\frac{\partial f}{\partial \mathbf{x}}=Jf(\mathbf{x})$

## Computing powers of a matrix: vanishing or exploding entries

Recall how you can compute the powers of a matrix $A$ from Mathe 1: 
Compute the eigenvalues of $A$: $\lambda_1, \ldots, \lambda_P$ (P=number of parameters). 
Compute the eigenvectors of $A$: $v_1,\ldots, v_P$ and let $B$ be the matrix with these eigenvectors as columns. If 
$D=\left(\begin{array}{cccc}
\lambda_1&0&\ldots&0\\
0&\lambda_2&\ldots&0\\
\ldots&\ldots&\ddots&\ldots\\
0&0&\ldots&\lambda_P
\end{array}\right)$ is the diagonal matrix with the eigenvalues on the diagonal, then one can show that 
$$A=B\cdot D\cdot B^{-1}\Rightarrow A^k=(B\cdot D\cdot B^{-1})\cdot \ldots \cdot B\cdot D\cdot B^{-1}=B\cdot D^k\cdot (B^{-1})$$
and $D^k$ is the diagonal matrix with $\lambda^k$ on the diagonal. 

$$A=B\cdot D^k\cdot (B^{-1})$$
and $D^k$ is the diagonal matrix with $\lambda^k$ on the diagonal. 

$\Rightarrow$ If the eigenvalues are large, then the entries of $D^k$ become huge with growing $k$ (and therefore $A^k$ as well, it "explodes"). 

$\Rightarrow$ If the eigenvalues are less than 1, then the entries of $D^k$ become almost 0 with growing $k$ (and therefore $A^k$ as well, it "vanishes"). 

**Application in RNNs:**
- Activations: To compute the activations after $n$ time steps, the hidden-to-hidden weight matrix $\mathbf{W}_{hh}$ gets multiplied $n$ times (plus activations in between). $\Rightarrow$ if the eigenvalues of $\mathbf{W}_{hh}$ are not close to 1, the activations will either explode or vanish. 
- Backpropagation Through Time (BPTT): For each optimization step in (S)GD, we need to compute the gradient from the first to the last layer. 

For those who are interested in the mathematics: 
Chain rule for the Loss $L_T$ after time step T: 
$$\frac{\partial L_T}{\partial z_i}=\frac{\partial L}{\partial z_T}\cdot \frac{\partial h(z_{T-1})}{\partial z_{T-1}}\cdot \frac{\partial h(z_{T-2})}{\partial z_{T-2}}\cdot \frac{\partial h(z_{T-3})}{\partial z_{T-3}}\cdot \ldots \cdot \frac{\partial h(z_{i})}{\partial z_{i}}$$
Note: the Jacobians $\frac{\partial h(z_{i})}{\partial z_{i}}$ are the same for all $i$! $\Rightarrow$ If we write $Jh:=\frac{\partial h(z_{i})}{\partial z_{i}}$, we get: 

$$\frac{\partial L_T}{\partial z_i}=\frac{\partial L}{\partial z_T}\cdot J^{T-i}$$

$\Rightarrow \frac{\partial L_T}{\partial \theta}=\sum_{i=1}^T \frac{\partial L}{\partial z_i}\cdot \frac{\partial h(z_{i-1})}{\partial \theta}=\sum_{i=1}^T \frac{\partial L}{\partial z_T}\cdot J^{T-i}\cdot\frac{\partial h(z_{i-1})}{\partial \theta}$

This means that the farther you go back in time, the more often the same Jacobian matrix is multiplied over and over again! 

$\Rightarrow$ if the eigenvalues of the Jacobian are not close to 1, the activations will either explode or vanish. 

**Solution:** Stop going back in time after a fixed number of steps! 