# Introduction to Statistics

Today we are going to look at some basics of statistics.
Statistics can help us to describe and explain data in a simple way. 

---

### In this lesson you'll learn:
* how to calculate the mean, variance, and standard deviation in Python.
* the difference between a regression and a classification.
* how a linear regression functions and the meaning of its coefficients.
* about the *Mean Squared Error* and the loss function.
* what a logistic regression is and how it relates to linear regressions.
* what the Binary Cross Entropy Loss is.
* about different metrics such as accuracy and the ROC-AUC.
---

In [None]:
import numpy as np
import pandas as pd
from random import shuffle
%matplotlib inline
np.set_printoptions(suppress=True)

For example, we can look at student grades (german: Noten) from a highschool class:

In [None]:
grades = [1.64, 2.35, 1.88, 2.48, 2.16, 3.92, 2.16, 2. , 1.76, 2.82, 1.81,
          2.59, 3.03, 1.7 , 2.87, 3.21, 2.65, 1.97, 1.2, 1.67, 1.77, 1.98,
          3.4 , 1.31, 1.72, 2.05, 1.12, 1.56, 2.01, 2.1]

However, it is very difficult to get an overview using only the data.
It is easier to plot the grades.


<img src='Img/intro_stats/noten_1.png'></img>

Although you now have a better overview, it can be difficult to compare two classes.

<img src='Img/intro_stats/noten2.1.png'></img>

We can use *density plots* to show the distribution easily. Here the y-axis is used to represent the density. That is, the higher the curve is at a point, the more data points are at that point.

<img src='Img/intro_stats/noten_3.1.png'></img>

Often, a purely visual view is not sufficient to make clear decisions.
Metrics that describe the distribution of the data points (in this example the grades) are needed for this.

The best known is probably the mean value, or more precisely the arithmetic mean. It describes the average of a distribution of data points. 
And to calculate the arithmetic mean, the sum of all values is divided by the number of values.


$$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$$

The mean is often denoted by $\bar{x}$.
Calculate the arithmetic mean in Python for the highschool class. *Without using Numpy*.

In [None]:
mean_grades = _____________# formula for the mean
mean_grades

<details>
<summary><b>Solution:</b></summary>
    
```python 
mean_grades = sum(grades)/len(grades)
```
</details>

However, the mean is not sufficient to adequately describe a distribution of values. The two [normal distributions](https://www.statista.com/statistics-glossary/definition/346/normal_distribution/) have the same mean in the example and yet are not identically distributed. 
<img src='Img/intro_stats/noten_3.png'></img>

We can see that the red distribution is much narrower than the black one. That is, the values of the red group are closer to their mean than those of the black group.

The width of a distribution is measured by the variance. The variance measures the average distance of the values from their mean. 
The variance ($s^2$) is calculated as follows:

$$s^2 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$$

Note that it is not the difference ($x_i-\bar{x}$) that is summed, but the square ($x_i -\bar{x})^2$ of the difference. Thus, larger distances have a larger effect on the variance. 

Calculate the variance of the `grades`:

In [None]:
squares = 0
for x in grades:
    squares = squares +((_____-______)**___)
variance_grades = squares/len(______) 
variance_grades

<details>
<summary><b>Solution:</b></summary>

    
```python
for x in grades:
    squares = squares +((x-mean_grades)**2)
variance_grades = squares/len(grades)     
```
</details>    


<details>
<summary><b>Solution using list comprehension:</b></summary>

```python
sum([(x - mean_grades)**2 for x in grades])/(len(grades))
```    
</details>   

Instead of the variance, the standard deviation is often used as a measure for the *width* of a distribution. The standard deviation is obtained by taking the square root of the variance. This brings the measure of variance to the scale of the original distribution.

In [None]:
std_grades = __________ # Calculate the standard deviation
std_grades 

<details>
<summary><b>Solution:</b></summary>
    
```python
std_grades= variance_grades**(0.5)
```
</details>    

Of course all functions already exist in `numpy`: `np.mean()`, `np.std()`, `np.var()`

In [None]:
import numpy as np
print("mean: ", np.mean(grades))
print("variance: ", np.var(grades))
print("standard deviation: ", np.std(grades))


With the measure of variance/standard deviation and the mean we can already describe some distributions. Of course not all, e.g. with multimodal distributions one would need even more information. 

<img src='Img/intro_stats/noten_4.png'></img>

## Inferential Statistics 

However, we do not always want to describe data, but also to obtain information from these data. For example, we can use correlation to describe the relationship between height (german: Größe) and weight (german: Gewicht). The taller a person is, the heavier he is. This model is not perfect, of course; body weight does not depend only on height. There are tall lightweight people and short heavy ones. But there exists a basic tendency.  

<table><tr>
<td> <img src='Img/intro_stats/reg_1.png' alt="Drawing" style="width: 250px;"/> </td>
<td> <img src='Img/intro_stats/reg_2.png' alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

<br>
<br>

**We can describe the relationship with a linear regression.**
You may still know the linear equation $y = mx+t$ (or $y = ax+b$) from school. 

<br>



- $x$ is the input variable, in our case the body height
- $y$ is the variable to be predicted (body weight)
- $m$ describes the slope of the straight line
- $t$ denotes the y-axis intercept, the value of $y$ at $x=0$

<img src='Img/intro_stats/reg_3.png' alt="Drawing" width="500"/>

Assuming that the equation of the regression line is $y=0.3x+21$, then for the 
example, the weight of a person with a height of 180 cm would be 75 kg 
($0.3\cdot180+21)$.The value for $m$ ($0.3$) indicates how much $y$ increases when $x$ increases by 1.
So, according to the model, a person's body weight increases by 0.3 kg when height increases by 1 cm. 

The value for $t$ indicates how much a person weighs who is 0 cm tall ($x=0$). In the case of height, it makes little sense to interpret the value for $t$. However, suppose we estimate the value of a house based on the size of the terrace. The value for $t$ gives the value of a house when the size of the terrace is $0$. Thus, the value of a house without a terrace is $t$.

Back to the previous example: 

Of course, not every 180 cm tall person weighs 66 kg. This is only the predicted value 
of our regression equation. To make this clear, we write $\hat{y}$ instead of $y$.
This makes the straight line equation $\hat{y}=mx+t$.

---

Write a function that calculates the weight using the straight line equation described above.



In [None]:
def reg(x,m,t):
    _________# What is this function supposed to return?

<details>
<summary><b>Solution:</b></summary>
    
```python
def reg(x,m,t):
    return m*x+t
```
</details>    


The variable `x` contains the height in cm of 5 people. Calculate the weight for these five people using the function `reg`. 

In [None]:
x = [182,167,198,132,178]
y_hat = [reg(__,__,__) for ___ in _____ ]
y_hat

<details>
<summary><b>Solution:</b></summary>
    
```python
y_hat = [reg(height,0.3,21) for height in x ]
```
</details>    

As already mentioned, the values are only an estimate of the weight and differ from the actual weight of the person. To assess how well our model can determine the weight, we also need the actual measured weight of the persons. These are given in `y`. For example, we can calculate the difference between `y_hat` and `y`. But for this we first have to convert the lists into `numpy` arrays:

In [None]:
y = np.array([78.2,68.3, 81.0,64.3, 70.1 ])
y_hat = np.array(y_hat)
residual = y - ___ # What do we substract from y?
residual

This difference between the actual and the predicted value ($y - \hat{y}$) is also called the residual. The symbol for the residual is usually the small epsilon ($\epsilon$), which is used to measure the magnitude of the error (**E**rror) of the prediction. 

<img src='Img/intro_stats/reg_4.png' alt="Drawing" width="500"/>

For example, to estimate how good a model is overall, we could simply sum the residuals.

In [None]:
sum(residual)

As you can see, the value is very close to zero, a very small error. The problem, however, is that the residuals can be both positive and negative. That is, when you add them together, they cancel each other out. You will always get values close to zero. To avoid this, we do not sum the residuals, but, as with the variance, we sum the squares of the residuals. $$\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$ 

However, the sum alone would lead to models with more data points, i.e., with a larger $n$, automatically having larger error sums. Therefore, we take the mean of the squares instead of the sum: $\frac{1}{n}\sum_{i=1}^{n}(y_-\hat{y}_i)^2$. This value, called the *Mean Squared Error* (MSE), is useful to assess the quality of the predictions. If a model has a small MSE, one can conclude that the residuals must be small, i.e., the differences between the predicted and true values are small. 

As with the variance and standard deviation, there is also the root mean squared error (RMSE). As you can guess, this is simply taken by taking the square root of the MSE. Write a function that can calculate the RMSE. You can use `numpy`, i.e. you do not need a `for-loop`.

In [None]:
def RMSE(y,y_hat):
   MSE = np.sum(__________________) /len(_____) # calculate the MSE here 
   return ___________ # convert the MSE to the RMSE
RMSE(y, y_hat)    

<details>
<summary><b>Solution:</b></summary>
    
```python
def RMSE(y,y_hat):
   MSE = np.sum((y-y_hat)**2)/len(y)
   return np.sqrt(MSE) 
```
</details>    

In machine learning or in the field of optimization in general, functions like the RMSE are also called loss functions. They measure how well a model fits the data. The loss calculated by these functions must be minimized. 

## Example

Up to now you have always been given the parameters `m` and `t`. In reality, you have to calculate them yourself. In the following example we deal with the prediction of the boiling point. For this we use a data set from the American *National Institute of Standards and Technology*. In the data set, the boiling temperatures for 72 simple alcohols are recorded. In addition, the molecular weight and the number of carbons are given. 
The data set is located in the folder `../data/boilingpoints/`.


In [None]:
data = pd.read_csv("https://uni-muenster.sciebo.de/s/qGVs59xsnWKKuIf/download").values
print("Dimensions of the data: ",data.shape)
data[:10,:] 

The data set consists of 72 rows and three columns. Each row represents an alcohol and the three columns contain information for one of the three descriptors. The first column contains the boiling points, the second the molecular weight and the third column the number of carbons. 

Our goal is to predict the boiling point based on the molecular weight.
First, we store the first column (boiling points) in the variable `y` and the second column in the variable `x`.

In [None]:
y = data[:,0] # y the variable we want to predict (boiling points)
x = data[:,1:2] # we could also use data[:,1], but behaves differently

In [None]:
print(data[:5,1])
print(data[:5,1:2])

You can see that we select the same values in the example, but in the first variant we reduce the column to a 1-dimensional array of size `(72)`. So a vector of length 72. Some of the functions necessary for linear regression expect our variable `x` to be in the form of a 2-dimensional array. Therefore we select the column with `data[:,1:2]`. Thus we keep the 2D structure of the `array`.

We can also plot the data using the library `matplotlib`. With the function `plt.plot()` you can quickly create simple plots. Here you just have to specify what values belong on the x-axis (first position in the function), then specify what belongs on the y-axis (second position). Finally, you can specify whether the individual values should be plotted as a point `"o"` or connected with a line `"-"`.

In [None]:
from matplotlib import pyplot as plt
plt.plot(x, y, "o")

It can be clearly seen that as the weight increases, the boiling point of the alcohols also increases. 

In the next cell, we calculate the linear regression parameters that fit the data. 
For this we need the Python library `sklearn`, which provides many functions for statistical analysis and machine learning.

Regardless of which `sklearn` model you want to use, the general structure remains the same. 
First, the type of the model must be defined.
Using `model = LinearRegression()` tells Python to create a linear regression model.

Next, the model must be *fitted* to the data `(x,y)`. This is done with the `model.fit(x,y)` statement. This step leads to the calculation of the regression parameters.

We get the estimated parameters via `model.coef_[0]` for the slope (`m`) and `model.intercept_` for the y-axis intercept (`t`).


In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y) # calculates the linear regression
m = model.coef_[0] # we can get m and t from the model
t = model.intercept_

print(m,t)

Calculate `y_hat` with the parameters and then the RMSE. Since we are now using `np.arrays`, no `for-loop` is required.

In [None]:
y_hat = reg(data[:,1], ___ , ____)
RMSE(y, ____) 

<details>
<summary><b>Solution:</b></summary>
    
```python
y_hat = reg(data[:,1], m , t)
RMSE(y, y_hat) 
```
</details>    

Can you find other values for "m" and "t" that result in a lower RMSE?  

In [None]:
y_hat = reg(data[:,1], ____  ,  _____  )
RMSE(y, y_hat) 

In fact, this does not work. When we speak of a linear regression, we usually mean an *ordinary least-square* regression. As the name implies, this regression minimizes squares, the error of the regression line. That is, the regression line is the optimal line that can be found for that data set. In other words, an OLS regression line minimizes the (R)MSE.

## Multiple Regression

Linear regression can also be performed with more than one $x$ variable. The formula expands to:

$$\hat{y}= \beta_0 +\beta_1x_1 +\beta_2x_2$$

In general, the notation with $\beta$ is more common. Here $\beta_0$ stands for the $t$ and $\beta_1$ for the regression coefficient belonging to the first input $x_1$.

However, the interpretation of these coefficients does not change.

We can use both the number of carbons and the weight to predict the boiling points.

For this to work, you must first select not only the second but also the third column of `data` in `x`:

In [None]:
x = data[:,1: ___ ] # Which columns do we need for x?
x

<details>
<summary><b>Solution:</b></summary>
    
```python
x = data[:,1:3]
```
</details>    

You can now have the regression coefficients estimated again with `LinearRegression`.

In [None]:
model_2 = LinearRegression()
model_2.fit(x,y) # calculates the linear regression
print(model_2.coef_, model_2.intercept_ ) 

As you can see, you now get a total of 3 parameters. The regression coefficient for the molecular weight is `-4.65` and for the number of carbon `83.18`. `sklearn` also has a function `predict()`. With it we can automatically make predictions with the previously estimated parameters. In the following example, we used this function to calculate `y_hat` for the `x` values. 

In [None]:
y_hat = model_2.predict(x)
RMSE(y, y_hat) 

By using another variable in the regression, we were able to almost halve the loss (RMSE). This means that the model with two input variables leads to significantly better predictions than the first model with only one input variable.

# Logistic Regression

There are also problems where exact values are not to be predicted. For example, we want to decide whether a patient needs to be admitted to the intensive care unit or not. Here we only have to decide between `YES` and `NO`. Mathematically, however, we would speak of `1` or `0`. When a data point can belong to one of two groups, we speak of a **binary classification**. 

Here we have an example of a basketball player who throws at the hoop from different distances. 
If he scores, this throw is rated as a `1`. If he does not, the throw is rated with a `0`.

In [None]:
throws = np.array([1,1,1,1,1,1,0,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0])    
distance = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.,13.,14.,
                    15.,16.,17.,18.,19.,20.,21.,22.,23.,24.,25.,26.,27.,28.,29.])

It is possible to calculate a simple regression line, but it does not fit the data very well because of the binary variable $y$. One solution is logistic regression. Here, a sigmoid function "after" linear regression is used to transform the predicted values. 

<table><tr>
<td> <img src='Img/intro_stats/log1.png' alt="Drawing" style="width: 250px;"/> </td>
<td> <img src='Img/intro_stats/log2.png' alt="Drawing" style="width: 250px;"/> </td>
<td> <img src='Img/intro_stats/log3.png' alt="Drawing" style="width: 250px;"/> </td>
</tr></table>
<br>


---

<center>
<h2>Sigmoid Function</h2>
</center>

The sigmoid function is a non-linear function. Mathematically, the sigmoid function is written like this:
$$sigmoid(z)= \frac{1}{1+e^{-z}}$$

To understand what it does exactly, you can take a look at the example.

<td> <center><img src='Img/intro_stats/sigmoid.png' alt="Drawing" style="width: 250px;"/> </center>
<h8><center>x-axis: before applying the sigmoid function<br>y-axis: after applying the sigmoid function</center></h8>

On the x-axis are values between -6 and 6, **before** the sigmoid function is applied to these values. On the y-axis are the same values, but this time after applying the sigmoid function. 
All values are now between 0 and 1. Values that were very far from 0 before are now very close to `0` or `1`.
    
The shape of this function fits much better to a binary classification.

To perform a logistic regression, we can build on what we have already learned.
We have the same situation, we want to make a prediction for `y` based on our inputs `x`.     

To do this, we simply substitute the values from the linear regression into the sigmoid function.
$$ z = mx+t $$
$$\hat{y} = sigmoid(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(mx+t)}} $$    

Now calculate `z` by applying `reg` to the `distance` values. Since you can now use `numpy`, you no longer need a `for-loop`.
For the example with the basketball player, the following parameters are given:
- `m` = -0.8
- `t` = 7

In [None]:
z = reg(____)  # it is no longer convention to call our input variable x
z

<details>
<summary><b>Solution:</b></summary>

```python
z = reg(distance,-0.8,7) 
```
</details>    

Next you need the sigmoid function. For this write a function in Python with `numpy`. $e^x$ can be written as `np.exp(x)` with `numpy`.

In [None]:
def sigmoid(value):
    return 1/(___________) # wrie the denominator of the sigmoid function here

<details>
<summary><b>Solution:</b></summary>
    
```python
def sigmoid(value):
    return 1/(1+np.exp(-value))
```
</details>    

In the last step, calculate `y_hat` using `z` and the `sigmoid` function. 

In [None]:
y_hat = sigmoid(_____)# Which input do you need for the sigmoid function?

<details>
<summary><b>Solution:</b></summary>
    
```python
y_hat = sigmoid(z)
```
</details>    

As you can see, all values are now between `0` and `1`. Actually, we wanted values that are exacly `0` or `1`, not values in between. But the values of `y_hat` can be understood as a kind of probability. A predicted value of `0.99908895` means that, according to the model, the basketball player will score a basket 0.99% of the time. Conversely, a value of `0.00135852 means that, according to the model, there is only a 0.14% chance of scoring a basket.

The following figure shows the predicted values together with the predicted images. 

<img src='Img/intro_stats/log4.png' alt="Zeichnung" width="500px"/> 

Normally, the probabilities are interpreted in such a way that the model predicts a `1`, i.e. a hit, from a value `>=0.5` and a `0` (miss) below.

Thus, we can judge the accuracy of the model by the percentage of correctly classified throws. 
First, we round `y_hat`. This gives us only `0` and `1` as predictions.

In [None]:
pred = np.round(y_hat)
pred

You can now compare whether `pred` matches the original `y` variable `throws`. 

In [None]:
pred==throws

Write a function to calculate the accuracy (percentage of correctly classified throws). Remember that `booleans`, i.e., `True` and `False`, can also be written as `1` or `0` in Python.

In [None]:
def accuracy(y_true, y_pred):
    return np.sum(y_true==___) / len(____) 

<details>
<summary><b>Solution:</b></summary>
    
```python
def accuracy(y_true, y_pred):
    return np.sum(y_true==y_pred)/len(y_true)
```
</details> 

In [None]:
accuracy(throws, pred)

## Binary Cross Entropy Loss

An accuracy of 0.73 means that the model predicts the correct result 73% of the time. Similar to the RMSE, this is a metric to estimate how good our model is.

Often, however, not just one metric is used. The advantage of accuarcy is that it is very easy to interpret. But some mathematical properties of accuarcy make it unsuitable for certain machine learning methods. Therefore, at least two different metrics are usually used. 

The additional metric used in classification is the **Cross Entropy** Loss. In the case of a binary classification problem, it is usually referred to as **Binary Cross Entropy** (BCE) Loss. 

$$Loss =-\frac{1}{n}\sum_{i=0}^n[y_i\cdot log(\hat{y}_i) + (1-y_i)\cdot log(1-\hat{y}_i)]$$

The formula looks very complicated at a first glance, but it is relatively easy to understand with the help of examples.
Let's assume we want to calculate the loss for only one data point, e.g. for a single shot of the basketball player. Then $n = 1$ and the above formula simplifies:


$$Loss =-[y_i\cdot log(\hat{y}_i) + (1-y_i)\cdot log(1-\hat{y}_i)]$$


##### Assuming that the basketball player did not hit the shot, then $y_i=0$.

<img src='Img/intro_stats/bce_1.gif' alt="Drawing" width= "500px" style="display:block; margin:auto"/> 

Resulting in:

$$\begin{align}
Loss&=-(0\cdot log(\hat{y}_i) + (1-0)\cdot log(1-\hat{y}_i))\\
&=-log(1-\hat{y}_i)
\end{align}
$$


That is, the loss for this shot is the $log$ of the difference of 1 and $\hat{y}$ (the predicted probability).

You can try out what happens to the loss for different probabilities. Remember that the true value is $y_i=0$. So a good model would predict a low probability, so a small loss is expected.

In [None]:
# put different probabilities into the formula below and see what happens to the loss

- np.log(1 - 0.___ ) 


First of all, you will notice that the loss is always negative, which is why there is a minus in the actual formula from above to make the loss positive again. 

You can see that for particularly high probabilities, the loss moves away from zero. For particularly small probabilities, the loss approaches zero. This means that the more "wrong" our model is, the greater the loss, and that is exactly what we want.

##### Assuming that our basketball player has hit the shot, then $y_i=1$
<img src='Img/intro_stats/bce_2.gif' alt="Drawing" width= "500px" style="display:block; margin:auto"/>

$$
\begin{align}
Loss &=-(1\cdot log(\hat{y}_i) + (1-1)\cdot log(1-\hat{y}_i))\\
Loss &=-log(\hat{y}_i)
\end{align}
$$

This time, a different but still simple part of the formula remains.
Try this term with different probabilities as well. 
This time a probability close to 1 would be correct, which should result in a small loss.

In [None]:
- np.log(0.___) # try different probabilities

Again, the loss increases as the probability moves away from the true value. 

The loss is therefore only complex enough to cover both a true value of `1` and `0`.  The factor $log$ is used so that values further away from the true value have a disproportionate effect on the loss. The previously ignored part of the formula $\frac{1}{n}\sum_{i=1}^n$ only calculates the average over all data points in the data set. 

In the following, the formula for the BCE is defined using `numpy`.

In [None]:
def BCE(y_true, y_hat):
    return -np.mean(y_true*np.log(y_hat) +(1-y_true)* np.log(1-y_hat))

In [None]:
BCE(throws, y_hat)

## ROC-AUC 

Last, we present the ROC-AUC as an alternative to the accuracy. You may know the AUC from an HPLC or NMR. It denotes the *Area Under the Curve*. In this case, we are talking about the area under the ROC curve. 

Before we get into more detail about this ROC curve, let's clarify why we are using an alternative for accuarcy in the first place. 

Let's say you are writing a program to distinguish between dogs and cats.
You have nine images of dogs and only one of a cat. 

<img src='Img/intro_stats/catvdogs.png' alt="Zeichnung" width="500px"/> 

In [None]:
y = np.array(["DOG", "CAT", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG"])

There is a big difference between the number of cats and dogs in the dataset. 
Can you find a way to always achieve 90% prediction accuracy without ever seeing the images, and if they are randomly ordered?
The "shuffle" function arranges the elements in random order every time.

In [None]:
shuffle(y) # shuffle the elements of the array
y_pred = np.array([___,____,____,____,_____,_____,____,____,____,_____]) # write your solution here
accuracy(y, y_pred)

<details>
<summary><b>Solution:</b></summary>
    
```python
y_pred = np.array([["DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "DOG"]]) 
```
</details> 

If you simply classify each image as a dog, you will always get an Accuracy of 0.9. 
This means that a model that recognizes nothing in the image can achieve an Accuracy of 0.9. 
So we can't really tell from the accuracy whether our model has learned something or just always recognizes `DOG`. 
The greater the sample size difference between the different classes (*class imbalance*), e.g. `dog` versus `cat`, the less valuable accuracy is as a metric. 

There are alternative metrics that are more suitable for classifications with *class imbalance*. One of them is the ROC-AUC.

ROC is the Receiver Operator Characteristic, a curve that describes the relationship between the *true positive rate* and the *false positive rate*. The AUC is the area under the ROC curve.

<img src='Img/intro_stats/roc_auc.png' alt="Drawing" width= "300px"/> 

*What do true and false positve rate mean?*<br><br>
Suppose we coded dogs as `1` and cat as `0`. The True Positive Rate (TPR) would then reflect the percentage of correctly identified dog images.<br><br>
$$TPR = \frac{\textrm{Number of correctly classified dogs images}}{\textrm{Number of all dogs images}}$$

Assuming the model recognizes each image as a dog, what is the True Positive Rate?

In [None]:
TPR = ___/___ 
TPR

<details>
<summary><b>Solution:</b></summary>
    
```python
TPR = 9/9
```
</details> 

As you can imagine, the false positive rate (FPR) is very similar. This time we take the cats.

$$FPR = \frac{\textrm{Number of cats classified as dogs}}{\textrm{Number of all cat images}}$$

Assuming the model recognizes each image as a dog, what is the True Positive Rate?

In [None]:
FPR = ___/___
FPR

<details>
<summary><b>Solution:</b></summary>
    
```python
FPR = 1/1
```
</details> 

Now more formally, the ROC AUC provides information about the relationship between a model's performance in dogs and its performance in cats. The calculation of the ROC AUC is a bit more complicated than the calculation of FPR and TPR. 
But it is important to know these dependencies. 
A ROC AUC value is always between 0 and 1. A value of 1 means perfect classification, and a value of 0.5 means random classification. 
To calculate the ROC-AUC, we can use the `roc_auc_score` function from `sklearn`. 

In [None]:
from sklearn.metrics import roc_auc_score
y_true = np.array([1,0,1,1,1,1,1,1,1,1]) # we have recoded dogs and cats into 1 and 0 this time
y_pred = np.array([1,1,1,1,1,1,1,1,1,1])
roc_auc_score(y_true ,y_pred )

You can see that the ROC AUC value is only 0.5. The model is no better than a random decision.
However, in practice we work with predicted probabilities, i.e. values between 0 and 1, instead of just `0` and `1`. This can also be used to calculate the ROC-AUC score.

Try changing the probabilities for the cat (second position).
Remember that we classify an image as a dog above values of 0.5. 

In [None]:
y_true = np.array([1,0,1,1,1,1,1,1,1,1]) # we have recoded dogs and cats into 1 and 0 this time
y_hat = np.array([0.91,____,0.99,0.99,0.99,0.98,0.8,0.7,0.8,0.97])
roc_auc_score(y_true ,y_hat )

# Practice Exercise

There are also logistic regressions with more than one `x` variable. 

The data used is the *Iris-dataset*. 
[Here](https://en.wikipedia.org/wiki/Iris_flower_data_set) you can find more information.
The aim is to distinguish between two types of iris flowers. *Iris setosa* (`0`) vs. *Iris versicolor* (`1`).

The model parameters have already been estimated. Three regression coefficients are given in the following cells.

Your task is to use these coefficients to see how well the model works. You determine the membership of five flowers (`x`). You can compare the estimates of the model with the true values in `y`. 

`beta_1` belongs to the variable in the first column of `x` and so on.

In [None]:
beta_1 = 3.0786959
beta_2 = -3.0220097
beta_0 = -7.306345489594484


x =  np.array([[5.1, 3.5],
               [5. , 3.6],
               [5.4, 3.4],
               [6.7, 3.1],
               [5.1, 2.5]])

y = np.array([0,0,0,1,1])

First calculate `z`:

In [None]:
z = beta_0 + ___*____ +_____*______

Convert `z` to probabilities with the `sigmoid` function:

In [None]:
y_hat = _____
y_hat

Calculate the accuracy:

In [None]:
y_pred = _____(y_hat)
accuracy(______,____)

The last thing you do is calculate the the ROC-AUC:

In [None]:
# write your code to calculate the Code for the ROC_AUC