<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/101_iris-perceptron.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Perceptron Learning for Iris Data
___

In this notebook, we will implement the same learning algorithm for the **iris** dataset that you have implemented in Excel during the first session. It is important to note that the way we implement the algorithm in this notebook is neither efficient, nor is it the way you would truly implement a neural network in practice (indeed, a perceptron model is a special case of a neural network). This example is set up to be pedagogical, such that you better understand the mechanism underlying the perceptron.

### 🧑‍💻 <font color=green>**Your Task**</font>

Please go through the below code and solve the <font color=green>**Questions**</font> contained in the notebook (indicated with a green heading). You will also need the Excel implementation of the iris learning engine from our first class. You find a clean Excel solution on Canvas under Files > Data. It may also help to look again at the instructions for the Excel task in the slides for the first class.

**It's probably most productive if you work in small groups.** If you get stuck, please let us know and we drop by. We are there to support your learning experience. Sitting on your desk in the status stuck for a longer time is not very productive. 

Please note your answers to the questions on a (digital) piece of paper or directly code/write it into code/markdown cells below (depending whether it is a thinking or coding question). We will discuss the solutions in class. To start the discussion, we may randomly call some of you to share your thoughts and solution ideas :-)


___
## Data pre-processing

In [None]:
# Import necessary packages
import numpy as np # Numerical computation package
import pandas as pd # Dataframe package
import matplotlib.pyplot as plt # Plotting package
np.random.seed(1) # Set the random seed for reproduceability

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

To simplify the task, we are going to use only two features, the **sepal width** and the **petal width**. Furthermore, we will reduce our dataset to only contain **setosa** and **versicolor** species, i.e., we will drop the **virginica** species.

In [None]:
# Read in the iris dataset
iris = pd.read_csv(f"{DATA_PATH}/iris.csv")
# Keep only sepal width, petal width, and species
iris.drop(columns=["sepal length (cm)", "petal length (cm)"], inplace=True)
# Drop all observations of the species virginica
iris = iris.loc[iris["species"] != "virginica"]
# Shuffle the dataset
iris = iris.sample(frac=1)
iris # Display the dataset

___
#### 🤔 <font color=green>**Question 1**</font>
Notice how, in the above code, we use `iris.sample(frac=1)` to shuffle our data. Why would we prefer the data to be randomly ordered? How would things turn out if we left out the reshuffling?
___

### From `pandas` to `numpy`

While `pandas` is very intuitive when it comes to handling tabular data, e.g., for data pre-processing and visualization, `numpy` really shines when it comes numerical computing and it is somewhat closer to the mathematical formulation. Because of this, we will transform our data to `numpy` arrays which are useful to represent vectors and matrices.

We have a dataset of size $N=100$ with with two features $\mathbf{x}_1$ and $\mathbf{x}_2$ (in our case petal length and sepal length), and a target $\mathbf{y}$ (in our case species) which we are trying to predict based on the information from the features. Both features $\mathbf{x}_1$ and $\mathbf{x}_2$ are numerical variables, so they are ready for mathematical and statistical calculations. However, we cannot say the same about out target variable species which contains string values. Machine learning completely happens in the language of numbers mathematics, so we need to translate the information in species into numbers. One possible way to do so is to assign the value $-1$ to every **setosa** observation and the value $+1$ to every **versicolor** observation. This is somewhat arbitrary, as there are many other ways to encode this variable in numbers. They would all result in a learning process that is equally good. E.g., an alternative example would be using target labels $0$ and $+1$. Both types of "translations" ($-1$ and 1, and $0$ and $1$) are very common. For now, we use the first version since this allows us naturally to talk about "negatives" (setosa) and "positives" (versicolor).

In [None]:
# Create the matrix of features
X = np.array(iris[["sepal width (cm)", "petal width (cm)"]])
X.shape

We mentioned above how $\mathbf{x}_1$ and $\mathbf{x}_2$ are our two features. We can also use a matrix $\mathbf{X} = [\mathbf{x}_1 \, \mathbf{x}_2]$ to represent our features. This matrix is a simple $100 \times 2$ matrix, i.e., the first colum is the sepal width and the second one is the petal width. Every row represents a different observation. The matrix with the feature value is often called a **design matrix**.

In [None]:
# Create the vector of labels / targets
y = np.where(iris["species"] == "versicolor", 1, -1)

# If you forgot what np.where() is good for, go back to notebook 01d_numpy and study the section on filtering

y.shape

For the labels, we dont need a matrix, a column vector of length $N=100$ will do just fine. 

### The learning engine

Now we implement the learning engine (and *you* will find out which lines of code below are the core part of that engine!). We could first discuss the theory of how and why this learning engine works. But we intentionally choose an "experience-first" approach and discuss the theory later. So all the guidance you have for making sense of the below code is our Excel implementation of the learning machine from the first lecture. (You can find a clean Excel solution on Canvas under Files > Data. It may also help to look again at the instructions for the Excel task in the slides for the first class.)

Still, let's introduce a small piece of theory. In the first class with Excel, we called the items that are learned "weights". We now refine that language a bit, so it is in line with neural network terminology. If you go back to our Excel solution, you see that an important element in the construction of a learning machine was a score. We calculated this as
$$
score = w_0 + w_1 \; petal\_width + w_2 \; sepal\_width
$$

In line with neural-network terminology, we now denote $w_0$ as $b$ and call it a "bias". Some people also refer to it of a constant. In any case, it's still a parameter that is going to be learned and that plays a special role. More on this below. The advantage of the change in naming is also that the code becomes a bit easier to understand. The formula for the score becomes then

$$
score = b + w_1 \; petal\_width + w_2 \; sepal\_width
$$


As for the two weights $w_1$ and $w_2$, we collect them in a single vector $w$, as you will see in the below code.

In [None]:
# Initialize parameters for the learning process

eta = 0.01 # The learning rate, this is an example of what is called a HYPERPARAMETER
b = 0 # The bias
w = np.zeros(X.shape[1]) # The weights (one for each feature)

# Initialize lists for bookkeeping
bias_list = []
weights_list = []


# Iterate over each iris case

for i in range(X.shape[0]):

    # Extract the ith row of the features matrix
    x_i = X[i, :]
    # Extract the ith row of the label vector 
    # (because it is a vector and not a matrix, there is no second dimension!)
    y_i = y[i]
    
    # Compute the score
    score_i = b + w[0] * x_i[0] + w[1] * x_i[1]
    # We could have used matrix multiplication for this, but in the interest of transparancy, we did not do so
    
    # Make a prediction based on the score
    # -1 (setosa) if score is negative (or zero), +1 (versicolor) if score is positive
    pred_i = 1 if score_i >= 0 else -1  # Note this elegant shortcut of a conditional statement!
    
    # Bookkeeping of current weights and biases before we update them
    bias_list.append(b)
    weights_list.append(w.copy())  # See further below for an explanation on the use of w.copy. It's a bit tricky.
    
    # Update the weights and bias
    b += eta * (y_i - pred_i)
    w += eta * (y_i - pred_i) * x_i
    
# Get the output
print(f"The resulting bias is {b}, the resulting weights are {w}.")

___
#### ➡️ ✏️<font color=green>**Question 2**</font>

1. The above piece of code contains a loop. So it's a bit hard to just "click around" to find out what this code does. Obviously, to better understand, it would be nice to see what the code does for the first value of `i`, the second value of `i` etc., until you have a good grasp of what is going on. Find a procedure that allows you to do so. You can be creative, everything is allowed, as long as you do not destroy the code. **Hint.** if you mark several lines of code and then use the keyboard shortcut `ctrl + /` (do not type the plus, it means press `ctrl` and press `/`), you can "out-comment" these lines and they will not be executed if you run a code cell with `shift + enter`.
2. Does the above loop somewhow also manifest in our Excel solution from the first class? How?
3. In the above code, what lines would you say are the core of our learning engine?
___


#### ⚠️ Lists and copies (you can skip this in a first reading)
Do you see how we used `w.copy()` instead of just appending `w` to our `weights_list`? This is because of a fairly complicated concept that really relates more to computer science than data science. In essence, Python lists are truly **arrays of pointers**. This surely doesn't mean much to you now, but try running the following cells to see why using a copy makes sense:

In [None]:
wlist = [] # Create an empty list, e.g., this would be weights_list above
wl = np.array([0, 0, 0]) # Create an array of weights (e.g., w above)
wlist.append(wl) # Append the weights to our main list
wlist # Display the value of wlist

In [None]:
wl[1] += 1 # Change the value of the middle element
wlist # Display the value of wlist again

But wait, didn't we actually store the array `[0, 0, 0]` and not `[0, 1, 0]`!? Let's do it again...

In [None]:
wlist.append(wl) # Append the new weights to our main list
wl[1] += 1 # Change the value of the middle element again
wlist # Display the value of wlist again

Uh oh... that's not good. We are *retroactively* affecting the weights we have already stored in our main list! What if we use a copy instead?

In [None]:
wlist.append(wl.copy()) # Append the weights COPY to our main list
wl[1] += 1 # Change the value of the middle element again
wlist # Display the value of wlist again

The third item of our list was not impacted by the increment. This is because it's not the weight array anymore but a copy thereof. Make sure you understand what happened in this small section, this is a dangerous pitfall of Python that surely happens to every programmer at least once (and probably many more times...).

Intuitively, you can think that Python keeps in mind that the elements of `wlist` are the arrays `wl` and it assumes that we want to keep a link between the two. So `wlist` should remember that it contains `wl` and that if `wl` gets changed, so should `wlist`. In fact, Python does not keep `wlist` in memory as such. All it has in memory is the instruction that `wList` is a list that contains `wl` and to fill in whatever `wl` happens to be at the moment. If you want to break the link between `wlist` and `wl` (which, as a data scientist, you want most of the time) then you use `.copy()`. Does it now make sense?

But wait, if this is so with lists, why didn't we treat the bias `w` in the same way? Why use `weights_list.append(w.copy())` but **NOT** `bias_list.append(b.copy())`? What we just explained only applies if the object that is appended is already a type of list, e.g. a numpy array. If it is a primitive object like `b` (a simple variable that stores a single number), then there is no need for `.copy()`. In fact, if you try, you get an error. That's nice, so you know you don't have to use it!


**Main takeaway**: Be careful when storing results from lists, arrays, dictionaries, etc. in other containers (typically lists or dictionaries). Changing the item ex-post will also change the value stored in your *larger* container. If you want to avoid this (which you want most of the time) make sure to use a copy!
<br />
<br />

#### Calculating learning diagnostics

In [None]:
# Now we evaluate how the learning machine performs in its task...
# We do so using certain standard "performance indicators" that serve as learning diagnostics

# Create some empty lists as containers that are going to be filled
misclassifications = []
false_positives = []
false_negatives = []

# Iterate over the learning steps of the perceptron algorithm
for i in range(len(bias_list)):
    # Compute score over FULL dataset (notice the matrix multiplication!)
    score = bias_list[i] + X @ weights_list[i].T
    
    # Compute the prediction over the full dataset
    pred = np.where(score >= 0, 1, -1)
    
    # Compute missclassification, false positives, false negatives
    error = y - pred
    misclassifications.append(sum(error != 0))
    false_positives.append(100 * sum(error < 0) / sum(y == 1))
    false_negatives.append(100 * sum(error > 0) / sum(y == -1))

print(misclassifications)


___
#### ➡️ ✏️<font color=green>**Question 3**</font>

1. Explain in your own words what false positives, false negatives and misclassifications are.
2. Use your tricks identified in Question 3 to inspect what is going on in the loop. How does `weights_list[i]` look for $i = 5$? What is its precise dimension? What is the dimension of the multiplication `X @ weights_list[i].T`? What does the `.T` mean? Would you really need it?
3. What is the dimension of `error` for any $i\in \lbrace 0, 1, \ldots, 99 \rbrace$?
4. Does the above loop somewhow also manifest in our Excel solution from the first class? How?



___
#### ➡️ ✏️<font color=green>**Question 4**</font>
Create a scatter plot of the data such that:
+ the sepal length is displayed on the x-axis
+ the petal length is displayed on the y-axis
+ setosa data points are colored in blue and versicolor data points are colored in green
+ there is a legend showing which color belongs to which iris species
+ the x- and y-axis are labeled

In [None]:
# Enter your code here


#### Solution

___
#### ➡️ ✏️<font color=green>**Question 5**</font>
In fact, the perceptron algorithm as used here separates the feature space linearly, i.e., it *draws a line* in the above plot. Using the final weights and bias obtained by our algorithm, can you characterize this line in a mathematical equation? *Hint:* Express the equation in the form $x_2 = a + m \cdot x_1 $. Do not use any coding but use paper and pencil (maybe a digital version of those).

In [None]:
# Print out the final optimal values obtained by the perceptron


___

___
#### ➡️ ✏️<font color=green>**Question 6**</font>

1. Using the line equation you have determined in the previous task, augment the plot you created in Question 4 by drawing the perceptron classification line.

    *Hint:* Think of your plot as an x-y-plane. You want to define an array of x values (horizontal variable) going from the minimum to maximum of the sepal width. Then, using the equation derived above, you want to map a y value (vertical variable) for each of those points, e.g.,

    ```python
    # Create a linearly space vector of x's from the minimum to the maximum sepal width
    x_values = np.linspace(X[:, 0].min(), X[:, 0].max())
    # Create the linear relationship derived above (replace a and m by the values you found!)
    y_values = a + m * x_values
    ```

    With `x_values` and `y_values` we refer generically to a variable on the horizontal and on the vertical axis in a 2D plane, respectively. (I.e. we do not mean features and target here.)


2. Why actually do we benefit from including the bias $b$ in the perceptron model? Try to answer this questions based on your calculation in Question 5 and the plot you just created.

In [None]:
# Enter your code here


#### Solution

___
#### ➡️ ✏️<font color=green>**Question 7**</font>

Create a visualization of the learning process. This plot should contain the following:
+ the iteration numbers on the horizontal axis
+ a dashed line with the number of misclassifications in blue
+ a dash-dotted line with the false positive rates in green
+ a dotted line with the false negative rates in orange
+ add a grid
+ don't forget to label your axes and add a legend!

Discuss and interpret the results of this plot with your classmates. Are you surprised by what you see? Using your common sense, do you like the result?


#### Solution

___
#### ➡️ ✏️<font color=green>**Question 8**</font>
Where do you see the biggest advantage of using Python over Excel?

___
## The mathematics of perceptron learning

We will go through the mathematics of perceptron learning together based on the slides for this class. However, for those who like to dig a little deeper, we have written down the math in a somewhat more general form below.

The perceptron model actually represents a special case of a neural network, generally considered the simplest one. Understanding how the perceptron works will not be enough to understand all neural networks, but it is a necessary first step, so let's dive into it!

We have a dataset consisting of $N$ pairs of features and labels: $D = \{(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \dots, (\mathbf{x}^{(N)}, y^{(N)})\}$, where $\mathbf{x}^{(i)} \in \mathbb{R}^p$ is vector with dimension $p$ ($p$ is equal to the number of features, i.e., in our example $p=2$), and $y^{(i)} \in \{-1, +1\}$. (Note: when we do math, we prefer start counting at 1, not at 0; we stick to that convention in the course.)

Using the *activation function*: $\begin{aligned}f({z}) = \begin{cases}1 \, &\text{if } {z} > 0 \\ -1 &\text{otherwise} \end{cases} \end{aligned}$, the perceptron's goal is to find a scalar bias $b$ and a vector $\mathbf{w} \in \mathbb{R}^p$, such that the objective $\sum_{i=1}^N |y^{(i)} - f(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)|$ is minimized (the two vertical bars indicate the absolute value of the difference).

To minimize this quantity, the algorithm takes the following steps:
1. Initialize the bias and weights arbitrarily, e.g., $b = 0$ and $\mathbf{w} = [w_1 \, w_2]^\top = [0 \, 0]^\top$. Define a learning rate $\eta$.
2. For each example $i$ in the dataset $D$, do the following:
    1. Compute the output given the current weights and bias:  
    $\begin{align}\hat{y}^{(i)} &= f(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) \\ &= f(w_1 \cdot x_1^{(i)} + w_2 \cdot x_2^{(i)} + b)\end{align}$
    2. Update the weights and bias ($\eta$ is the learning rate):  
    $b \leftarrow b + \eta \cdot \left(\hat{y}^{(i)} - y^{(i)}\right)$  
    $\mathbf{w} \leftarrow \mathbf{w} + \eta \cdot \left(\hat{y}^{(i)} - y^{(i)}\right) \cdot \mathbf{x}^{(i)}$
3. We may want to repeat the second step until the prediction error is *good enough*.