# Math Concepts for Machine Learning

---
### In this lesson you'll learn:

- about vectors and matrices and how to do simple calculations with them in Python.
- how to calculate the derivative of simple functions.
- the difference between a regression and a classification.
- how a linear regression functions and the meaning of its coefficients.
- about the *Mean Squared Error* and the loss function.
- what a logistic regression is and how it relates to linear regressions.
- what the Binary Cross Entropy Loss is.
- how linear algebra is applied in neural networks.
- how the chain rule works.
- why the chain rule is so useful for neural networks.

---

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
np.set_printoptions(suppress=True)

# Linear Algebra

Today we will explain the essential mathematical principles for neural networks.

## Vectors

The first essential mathematical concept is the **vector**.

A vector represents a point in a space that is described by several values.
For example, a molecule can be described by several descriptors. 

A vector is represented as follows:

$$\begin{bmatrix}3 & 4 & 0.5\end{bmatrix}$$

This vector contains exactly three values. We can use vectors to describe individual data points. For example, we could store the data of a house in this vector. The first value indicates how many bathrooms the house has, the second how many bedrooms, and the third value indicates the age of the heating system in years. This means that every individual data point can be represented as a vector in multidimensional space where each variable (column) represents its own dimension.

<table style="margin:auto">

 <tr>
    <th style="text-align:center">Bathrooms</th>
    <th style="text-align:center">Bedrooms</th>
    <th style="text-align:center">Age of the heating system </th>
  </tr>
  <tr>
    <td style="text-align:center">3</td>
    <td style="text-align:center">4</td>
    <td style="text-align:center">0.5</td>
  </tr>

</table>

<br>

<img src="Img/lin_alg/vector.png" width="600" style="display:block; margin:auto">

You may have noticed that a vector has amazing similarities to a 1-dimensional `array`.
`np.array([3,4,0.5])`. In fact, `np.arrays` are said to have the same functions as vectors. The mathematical rules that apply to vectors also apply to `arrays`.


For example, we can multiply a vector by a number: <br>


$$3\cdot\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix}= \begin{bmatrix}3\cdot 3 \\ 3 \cdot 4  \\ 3 \cdot 0.5 \end{bmatrix}= \begin{bmatrix}9 \\ 12 \\ 1.5\end{bmatrix} $$

<h5 style="text-align:center"><i>For better overview we write the vector as a column.</i></h5>

In [None]:
3 * np.array([3, 4, 0.5])

The same applies to addition and subtraction:
$$3+\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix}= \begin{bmatrix}3+3 \\ 3+4 \\ 3+0.5\end{bmatrix}= \begin{bmatrix}6 \\ 7 \\ 3.5\end{bmatrix} $$

In [None]:
3 + np.array([3, 4, 0.5])

We can add two vectors:
    
    
$$\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix} + \begin{bmatrix}0.3 \\ 3 \\ -0.2\end{bmatrix} = \begin{bmatrix}3 +0.3 \\ 4+3 \\ 0.5-0.2\end{bmatrix} =  \begin{bmatrix}3.3 \\ 7 \\ 0.3\end{bmatrix}$$

It is important that both vectors have the same length.

In [None]:
np.array([3, 4, 0.5]) + np.array([0.3, 3, -0.2])

Vectors become really interesting when we multiply several together.

Especially the so-called scalar product (also called a dot product) is important for us and is calculated as follows:
$$\begin{bmatrix}3 \\ 4 \\ 0.5\end{bmatrix} \cdot \begin{bmatrix}0.3 \\ 3 \\ -0.2\end{bmatrix} = (3\cdot 0.3) + (4 \cdot 3 )+ (0.5\cdot -0.2) = 12.8  $$


Calculate the scalar product of the vectors by hand: 

$$\begin{bmatrix}8 \\ 0.25 \\ -1\end{bmatrix} \cdot \begin{bmatrix}0.1 \\ 12 \\ 8\end{bmatrix} = $$

<details>
<summary><strong>Solution:</strong></summary>

$$\begin{bmatrix}8 \\ 0.25 \\ -1\end{bmatrix} \cdot \begin{bmatrix}0.1 \\ 12 \\ 8\end{bmatrix} =(8\cdot 0.1) + (0.25 \cdot 12)+ (-1\cdot 8) = -4.2  $$
</details>
<br>


In `numpy` we use `np.dot()` to calculate the scalar product. 

In [None]:
np.dot(np.array([3, 4, 0.5]), np.array([0.3, 3, -0.2]))

## Matrices

You probably already know the linear equation $ y = mx + t $ (or $ y = ax + b $).

Have a look at this example about the relationship between the height (German: Größe) and weight (German: Gewicht) of people. The relationship between the two variables $x$ and $y$ can be described via a linear regression.

<img src='Img/intro_stats/reg_3.png' alt="Drawing" width="500" style="display:block; margin:auto">

$$y = mx + t$$

- $x$ is the input variable, in our case the body height
- $y$ is the variable to be predicted (body weight)
- $m$ describes the slope of the straight line
- $t$ denotes the y-axis intercept, the value of $y$ at $x=0$

Write a function that calculates the weight using the straight line equation described above.

In [None]:
def reg(x, m, t):
    _________  # What is this function supposed to return?

<details>
<summary><b>Solution:</b></summary>
    
```python
def reg(x,m,t):
    return m*x+t
```
</details>    


The variable `x` contains the height in cm of 5 people. Calculate the weight for these five people using the function `reg`. Assume that the equation of the regression line is $ \hat{y} = 0.3x + 21 $.

In [None]:
x = [182, 167, 198, 132, 178]
y_hat = [reg(__, __, __) for ___ in _____ ]
y_hat

<details>
<summary><b>Solution:</b></summary>
    
```python
y_hat = [reg(height,0.3,21) for height in x ]
```
</details>    

However, we can expand this more common formula to include multiple inputs: $ x_1, x_2, x_3, \dots $ (a so-called **Multiple Regression**). That would require multiple coefficients which we can call $ \beta_1, \beta_2, \beta_3, \dots $ and we can substitute the y-intercept $t$ for $\beta_0$. We then call the prediction value $\hat{y}$ (`y_hat`) in order to denote that it represents a value calculated from a model and to signify its difference from the true value $y$. In general, this notation is more common when talking about a multiple regression.

This gives us the more general formula of:

$$ \hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \cdots $$

As you may have noticed, the scalar product is similar to a linear regression:

In [None]:
x    = np.array([3, 4, 0.5])
beta = np.array([0.3, 3, -0.2])
np.dot(x, beta)

However, here `x` is the input vector that contains the information for three variables. For example, for a house that has 3 bathrooms and 4 bedrooms. It was equipped with a new heating system half a year ago (`0.5`). The second vector contains the coefficients of the regression. So $\beta_1, \beta_2, \beta_3$. Using the regression, we can then find the value of the house in 100,000 €. 

In fact, the scalar product leads to a simplification of the formula. Instead of writing the long version as shown above, we can write it as follows:

$$\hat{y} = x\beta$$

Here we have to assume that $x$ and $\beta$ are vectors. 
Of course the $\beta_0$ (the y-intercept) is still missing. As explained above, single values can simply be added to vectors. 

So the complete formula is:

$$\hat{y} = x\beta+\beta_0$$

Can you write this formula using `numpy`? Calculate $\hat{y}$ for `x`. Use $\beta_0=-5$ and $ x $ and $ \beta $ from the cell above.

In [None]:
beta_0 = -5
y_hat = _____________________
y_hat

<details>
<summary><strong>Solution:</strong></summary>

```python
y_hat = np.dot(x,beta)+beta_0
    
```
</details>
<br>


Assuming we want to determine `y_hat` not only for one house but for several houses at the same time, we can do this with exactly the same formula. 

`X` now contains not only one vector, but several. As you have already learned, such data structures can be stored as a 2D array. In mathematics, a 2D array is comparable to a matrix. You can imagine a matrix as being multiple vectors that have been glued together.

When we talk about matrices, we use capitalized variable names.

In the following `X` is given. You can see that `np.dot(X,beta) + beta_0` still gives the correct result. But this time for each of the 4 rows.

In [None]:
X = np.array([[3, 4, 0.5],
              [2, 1, 1.2],
              [4, 2, 0.12],
              [3, 3, 2]])

np.dot(X, beta) + beta_0

We can extend this knowledge to generalize even further. We can multiply two matrices together.

In [None]:
B = np.array([beta,
              [6, 0, -2],
              [1, 0, 3],
              [0, 0, -1],
              [1, 2, -1]])
b_0 = np.array([beta_0, 3, 2, 0.5, -2])

`B` now contains the coefficients for a total of five linear regressions. The first row still contains our `beta` coefficients from the first regression.  Each additional row contains new coefficients/weights for another regression. So by the number of rows we can see how many regressions we are running. 
Also `b_0` contains five values and is therefore now a vector instead of a scalar. For each regression it contains the y-axis intercept.

If we now use these two matrices, the following happens:

In [None]:
np.dot(X, B) + b_0

An error message:
```shapes (4,3) and (5,3) not aligned: 3 (dim 1) != 5 (dim 0)```

In fact, we can conclude from the error message what the problem is. 
First, we are given the dimensions (number of rows and columns). 
`X` has `4` rows and `3` columns. `B` has `5` rows and `3` columns. 

Then follows: `3 (dim 1) != 5 (dim 0)`. So, `3 (dim 1)`, the number of columns (`3 (dim 1)`) of the first matrix are not equal (`!=`) to the number of rows in the second matrix (`5 (dim 0)`).   

**The number of columns in the first matrix should be equal to the number of rows in the second column. The dimensions must match.**

$$ A_{m \times p} B_{p \times n} = C_{m \times n} $$

For example, if we flip the `B` matrix by mirroring it "across the diagonal" we get rows as columns and columns as rows. Then the number of columns of the first matrix and rows of the second matrix match.

Converting columns to rows and vice versa is called the *transpose* of a matrix.
`B.tranpose()` performs this transformation. 

In [None]:
print(B, "\n")
print(B.transpose())

As you can see, the rows become columns. This also changes the dimensions of the matrix.

In [None]:
print(B.shape, "\n")
print(B.transpose().shape)

With the transposition of the matrix `B` the multiplication of the two matrices should work, because now the number of columns/rows is identical:

In [None]:
np.dot(X, B.transpose()) + b_0

It actually works. For example, look at the first column. These values are indeed the results of the first regression we computed: `np.dot(X, beta)+beta_0`.
In fact, each row contains the five regression results for one of the four houses.

But how can it be that the regression works even though we have flipped the matrix `B`?

This is because of how the matrix multiplication has been defined. The scalar product is not calculated between the corresponding rows. The scalar product is calculated between the rows of the first matrix and the columns of the second matrix (row times column, i.e. dimensions of rows and columns must be equal). 

<img src="https://www.mscroggs.co.uk/img/full/multiply_matrices.gif" style="display:block; margin:auto">
<h5 style="text-align:center">Source: Matthew Scroggs - 2020 | www.mscroggs.co.uk/blog/73 |</h5>


This is almost all that is needed for the so-called forward pass in a neural network.

---
The notation with $\beta$ comes from traditional statistics. In machine learning, the coefficients are denoted by $w$, which stands for "weights". In addition, $\beta_0$, the y-axis intercept, is denoted by $b$ (bias).
Thus, the regression equation is:

$$Xw+b$$

We will keep this spelling for our Machine Learning tasks.

As you have already learned, the power of neural networks is that they perform more than one regression at a time.
That is, we have not just one set of regression coefficients, but several. How many?
That is up to you.

In the context of neural networks, the number of regressions performed corresponds to the number of nodes in the hidden layer of the neural network.

Until now we have always used `np.dot()` for a matrix multiplication. But there is an extra function `np.matmul()`. For large matrices `np.matmul` is faster and therefore we will also use this function. 

In [None]:
np.matmul(X, B.transpose()) + b_0

# Linear Regression

Let's circle back to the simple linear regression we had at the start. We need a way to not only create but also to evaluate the predictive power of linear regressions. Therefore, we need some way to calculate the difference between the true value for $ y $ and the predicted value $ \hat{y} $. Then we can compare models between each other.

This difference between the actual and the predicted value ($y - \hat{y}$) is also called the residual. The symbol for the residual is usually the small epsilon ($\epsilon$), which is used to measure the magnitude of the error (**E**rror) of the prediction. 

<img src='Img/intro_stats/reg_4.png' alt="Drawing" width="500" style="display:block; margin:auto">

For example, to estimate how good a model is overall, we could simply sum the residuals.

In [None]:
x = [182, 167, 198, 132, 178]
y_hat = [reg(height,0.3,21) for height in x ]

y = np.array([78.2, 68.3, 81.0, 64.3, 70.1])
y_hat = np.array(y_hat)
residual = y - ___  # What do we substract from y?

sum(residual)

As you can see, the value is very close to zero, a very small error. The problem, however, is that the residuals can be both positive and negative. That is, when you add them together, they cancel each other out. You will always get values close to zero. To avoid this, we do not sum the residuals, but we sum the squares of the residuals. $$\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$ 

However, the sum alone would lead to models with more data points, i.e., with a larger $n$, automatically having larger error sums. Therefore, we take the mean of the squares instead of the sum: $\frac{1}{n}\sum_{i=1}^{n}(y_-\hat{y}_i)^2$. This value, called the *Mean Squared Error* (MSE), is useful to assess the quality of the predictions. If a model has a small MSE, one can conclude that the residuals must be small, i.e., the differences between the predicted and true values are small. 

As with the variance and standard deviation, there is also the root mean squared error (RMSE). As you can guess, this is simply taken by taking the square root of the MSE. Write a function that can calculate the RMSE. You can use `numpy`, i.e. you do not need a `for-loop`.

In [None]:
def RMSE(y, y_hat):
   MSE = np.sum(__________________) / len(_____)  # calculate the MSE here
   return ___________  # convert the MSE to the RMSE


RMSE(y, y_hat)

<details>
<summary><b>Solution:</b></summary>
    
```python
def RMSE(y,y_hat):
   MSE = np.sum((y-y_hat)**2)/len(y)
   return np.sqrt(MSE) 
```
</details>    

In machine learning or in the field of optimization in general, functions like the RMSE are also called loss functions. They measure how well a model fits the data. The loss calculated by these functions must be minimized. 

## Example

Up to now you have always been given the parameters `m` and `t`. In reality, you have to calculate them yourself. In the following example we deal with the prediction of the boiling point. For this we use a data set from the American *National Institute of Standards and Technology*. In the data set, the boiling temperatures for 72 simple alcohols are recorded. In addition, the molecular weight and the number of carbons are given. 
The data set is located in the folder `../data/boilingpoints/`.


In [None]:
data = pd.read_csv("https://uni-muenster.sciebo.de/s/qGVs59xsnWKKuIf/download").values
print("Dimensions of the data: ", data.shape)
data[:10, :]

The data set consists of 72 rows and three columns. Each row represents an alcohol and the three columns contain information for one of the three descriptors. The first column contains the boiling points, the second the molecular weight and the third column the number of carbons. 

Our goal is to predict the boiling point based on the molecular weight.
First, we store the first column (boiling points) in the variable `y` and the second column in the variable `x`.

In [None]:
y = data[:, 0]  # y the variable we want to predict (boiling points)
x = data[:, 1:2]  # we could also use data[:,1], but behaves differently

In [None]:
print(data[:5, 1])
print(data[:5, 1:2])

You can see that we select the same values in the example, but in the first variant we reduce the column to a 1-dimensional array of size `(72)`. So a vector of length 72. Some of the functions necessary for linear regression expect our variable `x` to be in the form of a 2-dimensional array. Therefore we select the column with `data[:,1:2]`. Thus we keep the 2D structure of the `array`.

We can also plot the data using the library `matplotlib`. With the function `plt.plot()` you can quickly create simple plots. Here you just have to specify what values belong on the x-axis (first position in the function), then specify what belongs on the y-axis (second position). Finally, you can specify whether the individual values should be plotted as a point `"o"` or connected with a line `"-"`.

In [None]:
import matplotlib.pyplot as plt
plt.plot(x, y, "o")

It can be clearly seen that as the weight increases, the boiling point of the alcohols also increases. 

In the next cell, we calculate the linear regression parameters that fit the data. 
For this we need the Python library `sklearn`, which provides many functions for statistical analysis and machine learning.

Regardless of which `sklearn` model you want to use, the general structure remains the same. 
First, the type of the model must be defined.
Using `model = LinearRegression()` tells Python to create a linear regression model.

Next, the model must be *fitted* to the data `(x,y)`. This is done with the `model.fit(x,y)` statement. This step leads to the calculation of the regression parameters.

We get the estimated parameters via `model.coef_[0]` for the slope (`m` or `beta`) and `model.intercept_` for the y-axis intercept (`t` or `beta_0`).


In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)  # calculates the linear regression
m = model.coef_[0]  # we can get m and t from the model
t = model.intercept_

print(m, t)

Calculate `y_hat` with the parameters and then the RMSE. Since we are now using `np.arrays`, no `for-loop` is required.

In [None]:
y_hat = reg(data[:, 1], ___, ____)
RMSE(y, ____)

<details>
<summary><b>Solution:</b></summary>
    
```python
y_hat = reg(data[:,1], m , t)
RMSE(y, y_hat) 
```
</details>    

Can you find other values for "m" and "t" that result in a lower RMSE?  

In [None]:
y_hat = reg(data[:, 1], ____,  _____)
RMSE(y, y_hat)

In fact, this does not work. When we speak of a linear regression, we usually mean an *ordinary least-square* regression. As the name implies, this regression minimizes squares, the error of the regression line. That is, the regression line is the optimal line that can be found for that data set. In other words, an OLS regression line minimizes the (R)MSE.

## Multiple Regression

As we've already seen, linear regression can also be performed with more than one $x$ variable. The formula expands to:

$$\hat{y}= \beta_0 +\beta_1x_1 +\beta_2x_2$$

The interpretation of these coefficients does not change.

We can use both the number of carbons and the weight to predict the boiling points.

For this to work, you must first select not only the second but also the third column of `data` in `x`:

In [None]:
x = data[:, 1: ___ ]  # Which columns do we need for x?
x

<details>
<summary><b>Solution:</b></summary>
    
```python
x = data[:,1:3]
```
</details>    

You can now have the regression coefficients estimated again with `LinearRegression`.

In [None]:
model_2 = LinearRegression()
model_2.fit(x, y)  # calculates the linear regression
print(model_2.coef_, model_2.intercept_)

As you can see, you now get a total of 3 parameters. The regression coefficient for the molecular weight is `-4.65` and for the number of carbon `83.18`. `sklearn` also has a function `predict()`. With it we can automatically make predictions with the previously estimated parameters. In the following example, we used this function to calculate `y_hat` for the `x` values. 

In [None]:
y_hat = model_2.predict(x)
RMSE(y, y_hat)

By using another variable in the regression, we were able to almost halve the loss (RMSE). This means that the model with two input variables leads to significantly better predictions than the first model with only one input variable.

## Logistic Regression

There are also problems where exact values are not to be predicted. For example, we want to decide whether a patient needs to be admitted to the intensive care unit or not. Here we only have to decide between `YES` and `NO`. Mathematically, however, we would speak of `1` or `0`. When a data point can belong to one of two groups, we speak of a **binary classification**. 

Here we have an example of a basketball player who throws at the hoop from different distances. 
If he scores, this throw is rated as a `1`. If he does not, the throw is rated with a `0`.

In [None]:
throws = np.array([1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])
distance = np.array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14.,
                     15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29.])

It is possible to calculate a simple regression line, but it does not fit the data very well because of the binary variable $y$. One solution is logistic regression. Here, a sigmoid function "after" linear regression is used to transform the predicted values. 

<table style="margin:auto"><tr>
<td> <img src='Img/intro_stats/log1.png' alt="Drawing" style="width: 250px;"> </td>
<td> <img src='Img/intro_stats/log2.png' alt="Drawing" style="width: 250px;"> </td>
<td> <img src='Img/intro_stats/log3.png' alt="Drawing" style="width: 250px;"> </td>
</tr></table>
<br>


---

<h2 style="text-align:center">Sigmoid Function</h2>


The sigmoid function is a non-linear function. Mathematically, the sigmoid function is written like this:
$$sigmoid(z)= \frac{1}{1+e^{-z}}$$

To understand what it does exactly, you can take a look at the example.

<td><img src='Img/intro_stats/sigmoid.png' alt="Drawing" style="width: 550px; display:block; margin:auto">
<h5 style="text-align:center">x-axis: before applying the sigmoid function<br>y-axis: after applying the sigmoid function</h5>

On the x-axis are values between -6 and 6, **before** the sigmoid function is applied to these values. On the y-axis are the same values, but this time after applying the sigmoid function. 
All values are now between 0 and 1. Values that were very far from 0 before are now very close to `0` or `1`.
    
The shape of this function fits much better to a binary classification.

To perform a logistic regression, we can build on what we have already learned.
We have the same situation, we want to make a prediction for `y` based on our inputs `x`.     

To do this, we simply substitute the values from the linear regression into the sigmoid function.
$$ z = \beta x + \beta_0 $$
$$\hat{y} = sigmoid(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(\beta x + \beta_0)}} $$

Now calculate `z` by applying `reg` to the `distance` values. Since you can now use `numpy`, you no longer need a `for-loop`.
For the example with the basketball player, the following parameters are given:
- `m` = -0.8
- `t` = 7

In [None]:
z = reg(____)  # it is no longer convention to call our input variable x
z

<details>
<summary><b>Solution:</b></summary>

```python
z = reg(distance,-0.8,7) 
```
</details>    

Next you need the sigmoid function. For this write a function in Python with `numpy`. $e^x$ can be written as `np.exp(x)` with `numpy`.

In [None]:
def sigmoid(value):
    return 1 / (___________)  # wrie the denominator of the sigmoid function here

<details>
<summary><b>Solution:</b></summary>
    
```python
def sigmoid(value):
    return 1/(1+np.exp(-value))
```
</details>    

In the last step, calculate `y_hat` using `z` and the `sigmoid` function. 

In [None]:
y_hat = sigmoid(_____)  # Which input do you need for the sigmoid function?

<details>
<summary><b>Solution:</b></summary>
    
```python
y_hat = sigmoid(z)
```
</details>    

As you can see, all values are now between `0` and `1`. Actually, we wanted values that are exacly `0` or `1`, not values in between. But the values of `y_hat` can be understood as a kind of probability. A predicted value of `0.99908895` means that, according to the model, the basketball player will score a basket 0.99% of the time. Conversely, a value of `0.00135852 means that, according to the model, there is only a 0.14% chance of scoring a basket.

The following figure shows the predicted values together with the predicted images. 

<img src='Img/intro_stats/log4.png' alt="Zeichnung" width="500px" style="display:block; margin:auto">

Normally, the probabilities are interpreted in such a way that the model predicts a `1`, i.e. a hit, from a value `>= 0.5` and a `0` (miss) below.

Thus, we can judge the accuracy of the model by the percentage of correctly classified throws. 
First, we round `y_hat`. This gives us only `0` and `1` as predictions.

In [None]:
pred = np.round(y_hat)
pred

You can now compare whether `pred` matches the original `y` variable `throws`. 

In [None]:
pred == throws

Write a function to calculate the accuracy (percentage of correctly classified throws). Remember that `booleans`, i.e., `True` and `False`, can also be written as `1` or `0` in Python.

In [None]:
def accuracy(y_true, y_pred):
    return np.sum(y_true == ___) / len(____)

accuracy(throws, pred)

<details>
<summary><b>Solution:</b></summary>
    
```python
def accuracy(y_true, y_pred):
    return np.sum(y_true==y_pred)/len(y_true)
```
</details> 

## Binary Cross Entropy Loss

An accuracy of 0.73 means that the model predicts the correct result 73% of the time. Similar to the RMSE, this is a metric to estimate how good our model is.

Often, however, not just one metric is used. The advantage of accuarcy is that it is very easy to interpret. But some mathematical properties of accuarcy make it unsuitable for certain machine learning methods. Therefore, at least two different metrics are usually used. 

The additional metric used in classification is the **Cross Entropy** Loss. In the case of a binary classification problem, it is usually referred to as **Binary Cross Entropy** (BCE) Loss. 

$$Loss =-\frac{1}{n}\sum_{i=0}^n[y_i\cdot log(\hat{y}_i) + (1-y_i)\cdot log(1-\hat{y}_i)]$$

The formula looks very complicated at a first glance, but it is relatively easy to understand with the help of examples.
Let's assume we want to calculate the loss for only one data point, e.g. for a single shot of the basketball player. Then $n = 1$ and the above formula simplifies:


$$Loss =-[y_i\cdot log(\hat{y}_i) + (1-y_i)\cdot log(1-\hat{y}_i)]$$


##### Assuming that the basketball player did not hit the shot, then $y_i=0$.

<img src='Img/intro_stats/bce_1.gif' alt="Drawing" width= "500px" style="display:block; margin:auto">

Resulting in:

$$\begin{align}
Loss&=-0\cdot log(\hat{y}_i) + (1-0)\cdot log(1-\hat{y}_i)\\
&=-log(1-\hat{y}_i)
\end{align}
$$


That is, the loss for this shot is the $log$ of the difference of 1 and $\hat{y}$ (the predicted probability).

You can try out what happens to the loss for different probabilities. Remember that the true value is $y_i=0$. So a good model would predict a low probability, so a small loss is expected.

In [None]:
# put different probabilities into the formula below and see what happens to the loss

np.log(1 - 0.___)

First of all, you will notice that the loss is always negative, which is why there is a minus in the actual formula from above to make the loss positive again. 

You can see that for particularly high probabilities, the loss moves away from zero. For particularly small probabilities, the loss approaches zero. This means that the more "wrong" our model is, the greater the loss, and that is exactly what we want.

##### Assuming that our basketball player has hit the shot, then $y_i=1$
<img src='Img/intro_stats/bce_2.gif' alt="Drawing" width= "500px" style="display:block; margin:auto">

$$\begin{align}Loss &=-1\cdot log(\hat{y}_i) + (1-1)\cdot log(1-\hat{y}_i)\\
Loss &=-log(\hat{y}_i)\end{align}$$

This time, a different but still simple part of the formula remains.
Try this term with different probabilities as well. 
This time a probability close to 1 would be correct, which should result in a small loss.

In [None]:
-np.log(0.___)  # try different probabilities

Again, the loss increases as the probability moves away from the true value. 

The loss is therefore only complex enough to cover both a true value of `1` and `0`.  The factor $log$ is used so that values further away from the true value have a disproportionate effect on the loss. The previously ignored part of the formula $\frac{1}{n}\sum_{i=1}^n$ only calculates the average over all data points in the data set. 

In the following, the formula for the BCE is defined using `numpy`.

In [None]:
def BCE(y_true, y_hat):
    return -np.mean(y_true*np.log(y_hat) + (1 - y_true)*np.log(1 - y_hat))

BCE(throws, y_hat)

# Calculus

## Derivatives

In order to understand how neural networks learn, one should know at least roughly what derivatives are and how to calculate them.

The derivative of a function describes the slope of the original function. 
Suppose there is a function $f(x)=x^2$. Then the corresponding derivative $\frac{df}{dx}=2x$ (i.e.: *Derivative of f with respect to x*). 

In the picture $f(x)$ (*blue*) as well as the derivative $\frac{df}{dx}$ (*orange*) are drawn. <br>For example, for $x=-5$, $f(-5) = 25$. The slope at this point is: $\frac{df(-5)}{dx}=2\cdot -5= -10$. That is, the slope of the function $f(x)=x^2$ is $-10$ when $x=-5$.

<img src="Img/lin_alg/ableitung_1edit.png" width="1000" style="display:block; margin:auto">

There are some rules about derivation. First, a simple rule with an example: 
        $$f(x) = x^n \rightarrow \frac{df}{dx} = n \cdot x^{n-1}$$
        $$f(x) = x^2 \rightarrow \frac{df}{dx} = 2 \cdot x^{2-1}=2x^1= 2x $$
        

In principle, constants are always dropped in derivatives.

That is:
The derivative of $f(x)=x^2 + 5$ is only $2x$, since constants only shift the function, but do not affect its slope. 

Coefficients are handled differently:

$$f(x) = ax^n \rightarrow \frac{df}{dx} = (n \cdot a)\cdot x^{n-1}$$

Example:

$$f(x) = 4x^3 \rightarrow \frac{df}{dx} = 12x^2$$

**Try to find the derivative of the following functions (probably easier on paper):**

$$g(x)= 7x^5 - 3$$

$$h(x)= 0.5x^2 + 3x +12$$

<details>
<summary><strong>Solution:</strong></summary>

$$\frac{dg}{dx} = 35x^4 $$
$$\frac{dh}{dx} = x +3$$

</details>
<br>

### Chain Rule 

The most important rule for neural networks is the chain rule, where derivatives of chained functions is calculated, i.e. general functions of the type: $$f(x) = g(h(x))$$ The derivative of such a function is then: $$\frac{df}{dx} = \frac{dg}{dh}\cdot \frac{dh}{dx}$$ Based on the formula it might be difficult to understand, but using an example it should be relatively easy.

$$\begin{align}f(x)&= (3x + 1)^2 \\g(h)&=h^2; \space\space\space\space\space\space h(x) = 3x+1\end{align}$$

$$\begin{align}
\frac{df}{dx} &= \frac{d}{dh} (h^2)\cdot \frac{d}{dx}h\\
&= 2 h\cdot \frac{d}{dx}(3x+1)\\
&= 2 h \cdot 3 \\
&= 6 \cdot (3x+1)
\end{align}$$



Previously it was said that the derivative describes the slope of the original function. One can also interpret the derivative $\frac{df}{dx}$ as follows: *By how much does $f(x)$ change if I change $x$?* Here, of course, the amount of change depends on $x$ itself. In the $x^2$ example, small changes in $x$ have a greater impact for values around $x=5$ than for values around $x=1$. 

### Example:

$$e_1 = 2x+3$$
$$e_2 = 0.5e_1^3$$

Try calculating $\frac{de_2}{dx}$.

<details>
<summary><strong>Solution:</strong></summary>

$$\frac{de_2}{dx}= \frac{de_1}{dx}\frac{de_2}{de_1} $$
$$\frac{de_2}{dx}= 2(1.5e_1^2) $$
    
Because we know that $e_1 = 2x+3$, we can write.
$$\frac{de_2}{dx}= 2(1.5(2x+3)^2) $$
$$\frac{de_2}{dx}= 2(1.5(4x^2+12x+9)) $$
$$\frac{de_2}{dx}= 2(6x^2+18x+13.5) $$
$$\frac{de_2}{dx}= 12x^2+36x+27 $$
</details>
<br>

## Chain Rule

Previously it was said that the derivative describes the slope of the original function. One can also interpret the derivative $\frac{df}{dx}$ as follows: *By how much does $f(x)$ change if I change $x$?* Here, of course, the amount of change depends on $x$ itself. In the $x^2$ example, small changes in $x$ have a greater impact for values around $x=5$ than for values around $x=1$. 

If we want to optimize the weights of a neural net, we also need to know how a change in the weights causes a change in the loss. 

Here again is a schematic of a neural network.


<img src="Img/lin_alg/ableitung_3edit.png" style="display:block; margin:auto">

For the following example we consider only the last part in more detail. The calculation of $\hat{y}$ is done in two steps. First $Z_2$ is calculated, then a nonlinear function is applied to it, which gives us $\hat{y}$. 

<img src="Img/lin_alg/ableitung_4edit.png" style="display:block; margin:auto">

**For simplicity, we consider only single values in this example.**

So $a_1$ is not a vector at this moment, but only a single value, the same is true for $w_2$ and $b_2$.

<img src="Img/lin_alg/ableitung_5.png" style="display:block; margin:auto">


The question is: What influence does $w_2$ / $b_2$ have on the loss $J$. Or how does the loss change when we change $w_2$ / $b_2$?

Mathematically, we can call this the derivative of $J$ with respect to $w_1$. 
We now use $\partial$ instead of $d$ since we are talking about functions with multiple parameters ($w_2$ and $b_2$) - this is a so-called partial derivative. In the example below, we fix the dimension of $b_2$ in place and only look at what effect a small change in $w_2$ has on the loss $J$.

$$\frac{\partial J}{\partial w_2}$$

However, there is no direct influence of $w_2$ on the loss. $w_2$ influences $z_2$ and $z_2$ has an effect on $\hat{y}$. And finally, $\hat{y}$ has an effect on the loss. So the functions to calculate $\hat{y}$ and $J$ respectively are *chained*.

The chain rule allows us to calculate $\frac{\partial J}{\partial w_2}$ in exactly this way.

First, we calculate the effect of $w_2$ on $z_2$:
$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}.... $$

Next, the effect of $z_2$ on $\hat{y}$:

$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}\frac{\partial \hat{y}}{\partial z_2} $$

Last, the effect of $\hat{y}$ on $J$:


$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}\frac{\partial \hat{y}}{\partial z_2}\frac{\partial J}{\partial \hat{y}} $$


The chain rule allows us to simply multiply these effects to get the desired derivative.
This chain can become arbitrarily long, therefore a network can also become arbitrarily large. 
Since, as you may recall, there is also a $w_1$ and $b_1$, their effect on $J$ can also be calculated. For this the chain rule works the same way, the "chain" only becomes longer.

## Practice Exercise

In this exercise you will also calculate the gradient for $w$ as in a neural network. 
Simplified, of course, and only for one value of $w$. In this example we use a simple loss function and also not a real non-linear function. The loss function would not work in the real application. The same is true for the nonlinear function, since here it is linear. A non-linear function would be beyond the scope of this exercise.


Please try to solve this exercise to the best of your ability. As we have said many times, it is not important for us that you get the correct result, but that you have studied the subject. Some people find math easier than others, we are aware of that. 



Back to our "faux" neural network.
Let's assume that the last layer of our network works as follows:

$$z_2 = a_1w_2+b_2$$
$$\hat{y} = z_2^3-3$$
$$J = \hat{y}^2- y^2$$


Calculate $\frac{\partial J}{\partial w_2}$, the "influence" $w_2$ has on $J$ (Loss).
For this we give the values:

$$ a_1 = 2 $$
$$ b_2=1.4 $$
$$ w_2 =0.6 $$
$$ y=1 $$



In [None]:
# Calculate first z_2, y_hat, and J. A simplified forward pass.
weight = 0.6

z_2 = ___*weight + ___

y_hat = (z_2**__) - ___

J = ____-____

You have performed the forward pass, now follows the calculation of the gradients. To do this, we first need to calculate only the individual derivatives.

$$\frac{\partial J}{\partial w_2} = \frac{\partial z_2}{\partial w_2}\frac{\partial \hat{y}}{\partial z_2}\frac{\partial J}{\partial \hat{y}} $$

First you calculate $\frac{\partial z_2}{\partial w_2}$ which we will call `dw_2`.

In [None]:
dw_2 = 

Next, you calculate $\frac{\partial \hat{y}}{\partial z_2}$, we'll call it `dz_2`.

In [None]:
dz_2 = 

Finally, you'll calculate $\frac{\partial J}{\partial \hat{y}}$, we'll call it `dy_hat`.

In [None]:
dy_hat = 

To calculate the gradient, you now only need to multiply these three together.

In [None]:
gradient = dw_2 * dz_2 * dy_hat
gradient

That's it! You have calculated the gradients.

**You don`t have to submit the following task, but you can try your hand at it.**

If we put these derivatives in a `for-loop` and change the weighting a bit against the gradients, we can see that the loss slowly gets smaller, we are training the "neural network".

In [None]:
weight = 0.6
for i in range(10):
    z_2 = ___*weight + ___
    y_hat = (z_2**__) - ___
    J = ____-____
    dw_2 = 
    dz_2 = 
    dy_hat = 
    gradient = dw_2 * dz_2 * dy_hat
    weight -=  0.0001 * gradient  # updating the weights
    print(J)