# Linear Algebra

---
### In this lesson you'll learn:

- about vectors and matrices and how to do simple calculations with them in Python.
- how to calculate the derivative of simple functions.
- how the chain rule works and why it is so useful for neural networks.
---


Today we will explain the essential mathematical principles for neural networks.

The first essential mathematical concept is the **vector**.

A vector represents a point in a space that is described by several values.
For example, a molecule can be described by several descriptors. 

A vector is represented as follows:

$$\begin{bmatrix}3 & 4 & 0.5\end{bmatrix}$$ 

This vector contains exactly three values. We can use vectors to describe individual data points. For example, we could store the data of a house in this vector. The first value indicates how many bathrooms the house has, the second how many bedrooms, and the third value indicates the age of the heating system in years.

You may have noticed that a vector has amazing similarities to a 1-dimensional `array`.
`np.array([3,4,0.5])`. In fact, `np.arrays` are said to have the same functions as vectors. The mathematical rules that apply to vectors also apply to `arrays`.


For example, we can multiply a vector by a number: <br>


$$3\cdot\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix}= \begin{bmatrix}3\cdot 3 \\ 3 \cdot 4  \\ 3 \cdot 0.5 \end{bmatrix}= \begin{bmatrix}9 \\ 12 \\ 1.5\end{bmatrix} $$ 

<center> <i>For better overview we write the vector as a column. </i> </center>



In [None]:
import numpy as np
3 * np.array([3,4,0.5])

The same applies to addition and subtraction:
$$3+\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix}= \begin{bmatrix}3+3 \\ 3+4 \\ 3+0.5\end{bmatrix}= \begin{bmatrix}6 \\ 7 \\ 3.5\end{bmatrix} $$ 

In [None]:
3 + np.array([3,4,0.5])

We can add two vectors:
    
    
$$\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix} + \begin{bmatrix}0.3 \\ 3 \\ -0.2\end{bmatrix} = \begin{bmatrix}3 +0.3 \\ 4+3 \\ 0.5-0.2\end{bmatrix} =  \begin{bmatrix}3.3 \\ 7 \\ 0.3\end{bmatrix}$$

It is important that both vectors have the same length.

In [None]:
np.array([3,4,0.5]) + np.array([0.3,3,-0.2])

Vectors become really interesting when we multiply several together.

Especially the so-called scalar product is important for us and is calculated as follows:
$$\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix} \cdot \begin{bmatrix}0.3 \\ 3 \\ -0.2\end{bmatrix} = (3\cdot 0.3) + (4 \cdot 3 )+ (0.5\cdot -0.2) = 12.8  $$


Calculate the scalar product of the vectors by hand: 

$$\begin{bmatrix}8 \\ 0.25 \\ -1\end{bmatrix} \cdot \begin{bmatrix}0.1 \\ 12 \\ 8\end{bmatrix} = $$

<details>
<summary><strong>Solution:</strong></summary>

$$\begin{bmatrix}8 \\ 0.25 \\ -1\end{bmatrix} \cdot \begin{bmatrix}0.1 \\ 12 \\ 8\end{bmatrix} =(8\cdot 0.1) + (0.25 \cdot 12)+ (-1\cdot 8) = -4.2  $$
</details>
<br>


In `numpy` we use `np.dot()` to calculate the scalar product. 

In [None]:
np.dot(np.array([3,4,0.5]), np.array([0.3,3,-0.2]))

As you may have noticed, the scalar product is similar to a linear regression:

In [None]:
x    = np.array([3,4,0.5])
beta = np.array([0.3,3,-0.2])
np.dot(x,beta)

`x` is the input vector that contains the information for three variables. For example, for a house that has 3 bathrooms and 4 bedrooms. It was equipped with a new heating system half a year ago (`0.5`). The second vector contains the coefficients of the regression. So $\beta_1, \beta_2, \beta_3$. Using the regression, we can then find the value of the house in 100,000 €. 

In fact, the scalar product leads to a simplification of the formula. Instead of writing:
$$\hat{y} = \beta_1x_1 +\beta_2x_2 +\beta_3x_3$$
we can also write the formula as follows.

$$\hat{y} = x\beta$$

Here we have to assume that $x$ and $\beta$ are vectors. 
Of course the $t$ or $\beta_0$ is still missing. So the intersection of the y-axis. As explained above, single values can simply be added to vectors. 

So the complete formula is:

$$\hat{y} = x\beta+\beta_0$$

Can you write this formula using `numpy`? Calculate $\hat{y}$ for `x`. Use $\beta_0=-5$.

In [None]:
beta_0 =-5
y_hat = _____________________
y_hat

<details>
<summary><strong>Solution:</strong></summary>

```python
y_hat = np.dot(x,beta)+beta_0
    
```
</details>
<br>


Assuming we want to determine `y_hat` not only for one house but for several houses at the same time, we can do this with exactly the same formula. 

`X` now contains not only one vector, but several. As you have already learned, such data structures can be stored as a 2D array. In mathematics, a 2D array is comparable to a matrix. 

When we talk about matrices, we use capitalized variable names.

In the following `X` is given. You can see that `np.dot(X,beta) + beta_0` still gives the correct result. But this time for each of the 4 rows.

In [None]:
X = np.array([[3,4,0.5],
              [2,1,1.2],
              [4,2,0.12],
              [3,3,2]])

np.dot(X,beta) + beta_0

---
The notation with $\beta$s comes from traditional statistics. In machine learning, the coefficients are denoted by $w$, which stands for "weights". In addition, $\beta_0$, the y-axis intercept, is denoted by $b$ (bias).
Thus, the regression equation is:

$$Xw+b$$

We will keep this spelling from now on.

---

As you have already learned, the power of neural networks is that they perform more than one regression at a time.
That is, we have not just one set of regression coefficients, but several. How many?
That is up to you.

In [None]:
W =  np.array([beta,
              [6,0,-2],
              [1,0,3],
              [0,0,-1],
              [1,2,-1]])
b = np.array([beta_0,3,2,0.5,-2])

`W` now contains the weights for a total of five linear regressions. The first row still contains our `beta` coefficients from the first regression.  Each additional row contains new coefficients/weights for another regression. So by the number of rows we can see how many regressions we are running. 
Also `b` contains five values and is therefore now a vector instead of a scalar. For each regression it contains the y-axis intercept.

In the context of neural networks, the number of regressions performed corresponds to the number of nodes in the hidden layer of the neural network.

If we now use these two matrices, the following happens:

In [None]:
np.dot(X,W)+b

An error message:
```shapes (4,3) and (5,3) not aligned: 3 (dim 1) != 5 (dim 0)```

In fact, we can conclude from the error message what the problem is. 
First, we are given the dimensions (number of rows and columns). 
`X` has `4` rows and `3` columns. `W` has `5` rows and `3` columns. 

Then follows: `3 (dim 1) != 5 (dim 0)`. So, `3 (dim 1)`, the number of columns (`3 (dim 1)`) of the first matrix are not equal (`!=`) to the number of rows in the second matrix (`5 (dim 0)`).   

**The number of columns in the first matrix should be equal to the number of rows in the second column.**

For example, if we flip the `W` matrix by mirroring it "across the diagonal" we get rows as columns and columns as rows. Then the number of columns of the first matrix and rows of the second matrix match.

Converting columns to rows and vice versa is called the *transpose* of a matrix.
`W.tranpose()` performs this transformation. 

In [None]:
print(W, "\n")
print(W.transpose())

As you can see, the rows become columns. This also changes the dimensions of the matrix.

In [None]:
print(W.shape, "\n")
print(W.transpose().shape)

With the transposition of the matrix `W` the multiplication of the two matrices should work, because now the number of columns/rows is identical:

In [None]:
np.dot(X,W.transpose())+b

It actually works. For example, look at the first column. These values are indeed the results of the first regression we computed: `np.dot(X, beta)+beta_0`.
In fact, each row contains the five regression results for one of the four houses.

But how can it be that the regression works even though we have flipped the matrix `W`?

This is because of how the matrix multiplication has been defined. The scalar product is not calculated between the corresponding rows. The scalar product is calculated between the rows of the first matrix and the columns of the second matrix (row times column, i.e. dimensions of rows and columns must be equal). 

![Matthew Scroggs](https://www.mscroggs.co.uk/img/full/multiply_matrices.gif)
<center>Source: Matthew Scroggs - 2020 | www.mscroggs.co.uk/blog/73 |</center>


This is almost all that is needed for the forward pass in a neural network.

---

Until now we have always used `np.dot()` for a matrix multiplication. But there is an extra function `np.matmul()`. For large matrices `np.matmul` is faster and therefore we will also use this function. 

In [None]:
np.matmul(X,W.transpose())+b

#  Derivatives

In order to understand how neural networks learn, one should know at least roughly what derivatives are and how to calculate them.

The derivative of a function describes the slope of the original function. 
Suppose there is a function $f(x)=x^2$. Then the corresponding derivative $\frac{df}{dx}=2x$ (i.e.: *Derivative of f with respect to x*). 

In the picture $f(x)$ (*blue*) as well as the derivative $\frac{df}{dx}$ (*orange*) are drawn. <br>For example, for $x=-5$, $f(-5) = 25$. The slope at this point is: $\frac{df(-5)}{dx}=2\cdot -5= -10$. That is, the slope of the function $f(x)=x^2$ is $-10$ when $x=-5$.

<img src="Img/lin_alg/ableitung_1edit.png"></img>

There are some rules about derivation. First, a simple rule with an example: 
        $$f(x) = x^n \rightarrow \frac{df}{dx} = n \cdot x^{n-1}$$
        $$f(x) = x^2 \rightarrow \frac{df}{dx} = 2 \cdot x^{2-1}=2x^1= 2x $$
        

In principle, constants are always dropped in derivatives.

That is:
The derivative of $f(x)=x^2 + 5$ is only $2x$, since constants only shift the function, but do not affect its slope. 

Coefficients are handled differently:

$$f(x) = ax^n \rightarrow \frac{df}{dx} = (n \cdot a)\cdot x^{n-1}$$

Example:

$$f(x) = 4x^3 \rightarrow \frac{df}{dx} = 12x^2$$ 


**Try to find the derivative of the following functions (probably easier on paper):**

$$g(x)= 7x^5 - 3$$

$$h(x)= 0.5x^2 + 3x +12$$



<details>
<summary><strong>Solution:</strong></summary>

$$\frac{dg}{dx} = 35x^4 $$
$$\frac{dh}{dx} = x +3$$

</details>
<br>

# Chain Rule 

The most important rule for neural networks is the chain rule, where derivatives of chained functions calculated, i.e. general functions of the type: $$f(x) = g(h(x))$$ The derivative of such a function is then: $$\frac{df}{dx} = \frac{dg}{dh}\cdot \frac{dh}{dx}$$ Based on the formula it might be difficult to understand, but using an example it should be relatively easy.

$$\begin{align}f(x)&= (3x + 1)^2 \\g(h)&=h^2; \space\space\space\space\space\space h(x) = 3x+1\end{align}$$

$$\begin{align}
\frac{df}{dx} &= \frac{d}{dh} (h^2)\cdot \frac{d}{dx}h\\
&= 2 h\cdot \frac{d}{dx}(3x+1)\\
&= 2 h \cdot 3 \\
&= 6 \cdot (3x+1)
\end{align}$$



Previously it was said that the derivative describes the slope of the original function. One can also interpret the derivative $\frac{df}{dx}$ as follows: *By how much does $f(x)$ change if I change $x$?* Here, of course, the amount of change depends on $x$ itself. In the $x^2$ example, small changes in $x$ have a greater impact for values around $x=5$ than for values around $x=1$. 

If we want to optimize the weights of a neural net, we also need to know how a change in the weights causes a change in the loss. 

Here again is a schematic of a neural network.


<img src="Img/lin_alg/ableitung_3edit.png"></img>

For the following example we consider only the last part in more detail. The calculation of $\hat{y}$ is done in two steps. First $Z_2$ is calculated, then a nonlinear function is applied to it, which gives us $\hat{y}$. 

<img src="Img/lin_alg/ableitung_4edit.png"></img>

**For simplicity, we consider only single values in this example.**

So $a_1$ is not a vector at this moment, but only a single value, the same is true for $w_2$ and $b_2$.

<img src="Img/lin_alg/ableitung_5.png"></img>


The question is: What influence does $w_2$ / $b_2$ have on the loss $J$. Or how does the loss change when we change $w_2$ / $b_2$?

Mathematically, we can call this the derivative of $J$ with respect to $w_1$. 
We now use $\partial$ instead of $d$ since we are talking about functions with multiple parameters ($w_2$ and $b_2$).

$$\frac{\partial J}{\partial w_2}$$

However, there is no direct influence of $w_2$ on the loss. $w_2$ influences $z_2$ and $z_2$ has an effect on $\hat{y}$. And finally, $\hat{y}$ has an effect on the loss. So the functions to calculate $\hat{y}$ and $J$ respectively are *chained*.

The chain rule allows us to calculate $\frac{\partial J}{\partial w_2}$ in exactly this way.

First, we calculate the effect of $w_2$ on $z_2$:
$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}.... $$

Next, the effect of $z_2$ on $\hat{y}$:

$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}\frac{\partial \hat{y}}{\partial z_2} $$

Last, the effect of $\hat{y}$ on $J$:


$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}\frac{\partial \hat{y}}{\partial z_2}\frac{\partial J}{\partial \hat{y}} $$


The chain rule allows us to simply multiply these effects to get the desired derivative.
This chain can become arbitrarily long, therefore a network can also become arbitrarily large. 
Since, as you may recall, there is also a $w_1$ and $b_1$, their effect on $J$ can also be calculated. For this the chain rule works the same way, the "chain" only becomes longer.


## Example:

$$e_1 = 2x+3$$
$$e_2 = 0.5e_1^3$$

Try calculating $\frac{de_2}{dx}$.




<details>
<summary><strong>Solution:</strong></summary>

$$\frac{de_2}{dx}= \frac{de_1}{dx}\frac{de_2}{de_1} $$
$$\frac{de_2}{dx}= 2(1.5e_1^2) $$
    
Because we know that $e_1 = 2x+3$, we can write.
$$\frac{de_2}{dx}= 2(1.5(2x+3)^2) $$ 
$$\frac{de_2}{dx}= 2(1.5(4x^2+12x+9)) $$     
$$\frac{de_2}{dx}= 2(6x^2+18x+13.5) $$ 
$$\frac{de_2}{dx}= 12x^2+36x+27 $$   
</details>
<br>

# Practice Exercise

In this exercise you will also calculate the gradient for $w$ as in a neural network. 
Simplified, of course, and only for one value of $w$. In this example we use a simple loss function and also not a real non-linear function. The loss function would not work in the real application. The same is true for the nonlinear function, since it is linear. A non-linear function, would be beyond the scope of this exercise.


Please try to solve this exercise to the best of your ability. As we have said many times, it is not important for us that you get the correct result, but that you have studied the subject. Some people find math easier than others, we are aware of that. 



Back to our "faux" neural network.
Let's assume that the last layer of our network works as follows:

$$z_2 = a_1w_2+b_2$$
$$\hat{y} = z_2^3-3$$
$$J = \hat{y}^2- y^2$$


Calculate $\frac{\partial J}{\partial w_2}$, the "influence" $w_2$ has on $J$ (Loss).
For this we give the values:
<center>
$ a_1 = 2 $ <br> $ b_2=1.4 $ <br>   $ w_2 =0.6 $  <br>  $ y=1 $ 


In [None]:
# Calculate first z_2, y_hat, and J. A simplified forward pass. 
weight =0.6

z_2 = ___*weight+___

y_hat = (z_2**__)-___

J = ____-____

You have performed the forward pass, now follows the calculation of the gradients. To do this, we first need to calculate only the individual derivatives.

$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}\frac{\partial \hat{y}}{\partial z_2}\frac{\partial J}{\partial \hat{y}} $$

First you calculate $\frac{\partial z_2}{\partial w_2}$ which we will call `dw_2`.

In [None]:
dw_2 = 

Next, you calculate $\frac{\partial \hat{y}}{\partial z_2}$, we'll call it `dz_2`.

In [None]:
dz_2 = 

Finally, you'll calculate $\frac{\partial J}{\partial \hat{y}}$, we'll call it `dy_hat`.

In [None]:
dy_hat = 

To calculate the gradient, you now only need to multiply these three together.

In [None]:
gradient = dw_2*dz_2*dy_hat
gradient

That's it! You have calculated the gradients.

**You don`t have to submit the following task, but you can try your hand at it.**

If we put these derivatives in a `for-loop` and change the weighting a bit against the gradients, we can see that the loss slowly gets smaller, we are training the "neural network".

In [None]:
weight =0.6
for i in range(10):
    z_2 = ___*weight+___
    y_hat = (z_2**__)-___
    J = ____-____
    dw_2 = 
    dz_2 = 
    dy_hat = 
    gradient = dw_2*dz_2*dy_hat
    weight -=  0.0001* gradient # updating the weights
    print(J)