# Notebook 5: Artificial Neural Networks and Deep Learning

### Machine Learning Basic Module
Florian Walter, Tobias Jülg, Pierre Krack

Please obey the following implementation guidelines.

## General Information About Implementation Assignments
We will use the Jupyter Notebook for our implementation exercises. The task description will be provided in the notebook. The code is also run in the notebook. However, the implementation itself is done in additional files which are imported in the notebook. Please do not provide any implementation that you want to be considered for correction in this notebook, but only in Python files in the marked positions. A content of a python file could for example look similar as shown below:
```python
def f():
    ########################################################################
    # YOUR CODE
    # TODO: Implement this function
    ########################################################################
    pass
    ########################################################################
    # END OF YOUR CODE
    ########################################################################
```
To complete the notebook, remove the `pass` command and only use space inside the `YOUR CODE` block to provide a solution. Other lines within the file may not be changed in order to deliver a valid submission.

## General Information About Theory Assignments
This Jupyter Notebook also includes one or more theory assignments. The theory assignments have to be solved and submitted in a PDF file which has to be named **theory.pdf** and has to include the solutions of all assignments.

You can either typeset your solution in $\LaTeX$/Word or hand-in a digital written or scanned solution. 
Please make sure to always submit a **PDF file**.  If you decide to submit a handwritten solution please create it in a way that it is possible for externals to read. We will not consider solutions which we cannot read. Thus, we recommend to typeset your solution.

### Imports

In [None]:
%reload_ext autoreload
%autoreload 2
from dataclasses import dataclass
from enum import Enum
from typing import NamedTuple

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import torch
import random
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from micrograd.autograd import add, mul, backward, tanh, neuron
from micrograd.graph import topological_sort, draw_dot, DataNode, Graph, LABELS
from macrograd.dataset import SimpleDataset
from macrograd.layers import Linear, ReLULayer, Sigmoid, BinaryCrossEntropyLoss, CategoricalCrossEntropyLoss
from macrograd.optimizer import SGDOptimizer
from macrograd.trainer import BinaryTrainer, CategoricalTrainer
from macrograd.utils import seeding

seeding(42)

$$\newcommand{\diff}[1]{\frac{\text{d}}{\text{d}#1}}$$

# Autograd from Scratch

In this assignment you will implement the method upon which most of modern machine learning builds upon: automatic differentiation.
We will keep it simple and focus on the univariate case.
The multivariate case is similar, but more complex to implement—engineers from Meta did that job for you and called it PyTorch.

## Automatic differentiation and backpropagation

As the name implies, the goal of automatic differentiation (AD) is to compute derivatives algorithmically.

Multiple methods co-exist in the broader field of computer science: source code transformations, automatic differentiation with dual numbers, forward and reverse mode accumulation.

The machine learning community mostly uses the algorithm you will implement: a special case of reverse mode accumulation where the function being differentiated has a scalar (one-dimensional) output.

What is this function that always has a scalar output? Why is automatic differentiation different if we have a function where the output is multi-dimensional? Discuss with each other and/or ask us if you are unsure!

In machine learning, we are interested in how much our loss changes *for a specific input*.
Therefore, we only need to evaluate derivatives at specific inputs and never actually write down the full derivative function.

### Reverse and forward mode AD
Assume you have three functions $f$, $g$ and $h$ that are chained together such that $y = f(g(h(x)))$ and let $a$ and $b$ be intermediate results: $h(x) = a$ and $g(h(x)) = g(a) = b$.
Reverse mode AD (so also backpropagation), finds the derivative of $y$ with respect to $x$ by first computing $$\diff{b} f(b)\text{ i.e. how much does $y$ change when $b$ changes, then}$$
$$\diff{a} f(g(a)) = \diff{b} f(b) \cdot \diff{a} g(a)\text{ i.e. how much does $y$ change when $a$ changes, and finally}$$
$$\diff{x} f(g(h(x))) = \diff{b} f(b) \cdot \diff{a} g(a) \cdot \diff{x} h(x) \text{ i.e. how much does $y$ change when $x$ changes.}$$
Forward mode AD goes the other way around.
It starts with $\diff{x} h(x)$ then computes $\diff{x} g(h(x))$ etc.

Does it make sense why it is called *reverse* mode AD and *back*propagation?

## The compute graph

The very first step is to know what the functions $f_i$ are.
We achieve this by defining up front which functions are available and restricting the user to use those functions.
Whenever these functions are used, we build the "computational graph". Let us start with some definitions.

Each node in our graph will represent either a value with a gradient, e.g. the intermediate value $b = g(h(x))$ with the gradient $\diff{b} f(g(h(x)))$,
```python
@dataclass(eq=False)
class DataNode:
    data: float
    gradient: float = 0
```

... or an operation (i.e. a function, e.g. $f$).

```python
class Op(Enum):
    ADD = 0
    MUL = 1

@dataclass(eq=False)
class OpNode:
    op: Op
```

A graph is then, just as it is defined in every introductory computer science lecture, a tuple $G = (V, E)$ where $V$ is a set of nodes ($V$ stands for the mathematical term vertex) and $E$ is a set of edges (tuples $(u, v)$ with $u, v \in V$).
```python
Node = DataNode | OpNode # A node is either a value with a gradient or an operation
class Edge(NamedTuple):
    u: Node
    v: Node

class Graph(NamedTuple):
    V: set[Node]
    E: set[tuple[Node]]
```
> **Task 1** Open [`micrograd/autograd.py`](micrograd/autograd.py) and implement the functions `add` and `mul`. `add` takes as input a Graph $G=(V, E)$ and two nodes $x$ and $y$, and returns a tuple $(G^\prime, z)$, where $G^\prime$ is the updated graph and z is the new resulting node with data equal to the sum of the data in $x$ and $y$. 
In the new graph, the nodes should include a new `OpNode` and a new `DataNode`, the edges should include new edges from $x$ and $y$ to the new `OpNode` and from the new `OpNode` to the new result `DataNode`.
For inspiration, have a look at the implementation of the tanh activation function given in the source code.

In [None]:
def f(x, w1, w2):
    return mul(*add(Graph(set(), set()), DataNode(x), DataNode(w1)), DataNode(w2))
g, out = f(1, .5, .7)
draw_dot(g)

Try to fill in the gradients manually.
At each node $v$, the gradient should tell you how the output of the function changes if you change the value of $v$.
Start at the last node (value 1.05) continue with the middle nodes (values 0.7 and 1.5) and finish with the first two nodes (values 0.5 and 1.0).

For example, if you add $1$ to the node with data $0.7$, then the output of the function will be $1.5 \cdot 1.7 = 2.55$.
The change in the output is then $2.55 - 1.05 = 1.5$
Be careful though, the technique "add one and check how much the output changes" only works with linear operations.

You can also verify your result by using the definition of the derivative:
$$
\frac{\text{d}}{\text{d}x} f(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}
$$
Using the same example from before:
$$
\begin{align}
\frac{\text{d}}{\text{d}x} 1.5x &= \lim_{h \to 0} \frac{1.5(x+h) - 1.5x}{h} \\
                                &= \lim_{h \to 0} \frac{1.5h}{h} \\
                                &= 1.5
\end{align}
$$

If you want a sufficient approximation to verify your results use a small $h$ e.g. $h = 0.0001$ and compute $\frac{f(x) - f(x+h)}{h}$. 

The key to understanding backpropagation is in the first two nodes (with data 1.0 and 0.5)!
Hopefully you have now understood how the backpropagation algorithm works.
If not, make sure to ask us!

>**Task 2** Use this insight to implement the functions `add_back` and `mul_back` which take two input data nodes $x$ and $y$ and an output node `out` and sets the gradients on $x$ and $y$.

In the example above, we gave you the order in which you should compute these nodes.
There are multiple node orderings that would work for computing gradients but not all of them.
There is an ordering called "topological ordering" that that fulfills the conditions necessary for the backpropagation algorithm to work.
Do a quick internet search on the topological ordering. 
Do you understand why this method of sorting nodes is appropriate?

We implemented the `topological_sort` function in [`micrograd/graph.py`](micrograd/graph.py) for you, you can have a look at it if you want.

We used the pseudo code on [wikipedia](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm) to implement Kahn's algorithm which maps nicely to functional python code.

The final step is now quite simple.

> **Task 3** Implement the `backward` function. We implemented the skeleton for you. It iterates through the reversed topological order of the compute graph and identifies all triplets of (input nodes, operation nodes, output node). You just need to apply the correct backward function now.
>
>Hint:
>* You can use the argument unpacking operator * to have a nice one-liner. e.g.
>```python
f(*range(3)) == f(0,1,2)
>```

In [None]:
backward(g)
draw_dot(g)

Here is a simple neuron using our autograd engine:

In [None]:
g, out = neuron(Graph(set(), set()), *(DataNode((i+1)/10) for i in range(5)))
backward(g)
draw_dot(g)

Now here is the same neuron with different input values. Why are the gradients all zero? Can you name this phenomenon?

In [None]:
g, out = neuron(Graph(set(), set()), *(DataNode(i+1) for i in range(5)))
backward(g)
draw_dot(g)

# PyTorch

In this exercise we will introduce `PyTorch`, one of the most commonly used frameworks for training deep neural networks as it provides a simple pythonic way of defining network models. Its main components include:
- **Pytorch Tensor** Data container similar to `np.array` with the difference that it can be moved to a GPU to execute operations there.
- **Autograd Engine** Automatic differentiation with a dynamic compute graph similar to the implementation shown in class but implemented in C++ for fast execution. Autograd can also be extended with new functions as we will see later on.
- **`torch.nn.Module`** Base class to define neural networks in a composition-based fashion, e.g. a layer can also be seen as a neural network and is, thus, also a `nn.Module`. Many layers such as the linear layer (`nn.Linear`) or loss functions such as the categorical cross entropy loss (`nn.CrossEntropyLoss`) are already implemented in `nn` as subclassed `nn.Module`s.
- **`torch.utils.data.Dataset`** Base class to define your data.
- **`torch.utils.data.Dataloader`** Loads the data from a dataset. Provides functionality to parallelize data loading. Shuffles and splits the data into batches.
- **`torchvision.transforms.v2`** Although in its own package, these functions can be used for preprocessing such as normalizing data samples.
- **`torch.optim`** Implementation of weight update algorithms also called optimizers.

We will reimplement parts of PyTorch in order to better understand what is going on under the hood when you are using it in future projects.

## Tensors and Autograd
You can think of PyTorch simply as a NumPy-like library which can execute calculations on the GPU and provides automatic gradient calculation with a few other utilities commonly used when training neural networks. It has its own container type, the tensor, but provides functions to convert from and to numpy arrays.

In [None]:
a = np.array([2.])
t = torch.from_numpy(a)
t

Currently this tensor os on the CPU, as numpy only uses the CPU. If you have a GPU and the correct installation of PyTorch, you can move the tensor to your GPU and perform calculations over there:

In [None]:
print(t.device)
# checking if a GPU is available
if torch.cuda.is_available():
    t = t.to("cuda:0")
    print(t.device)
else:
    print("No GPU available")

# moving back to CPU
t = t.cpu()
print(t.device)

By default, PyTorch does not track gradients for arbitrary tensors. You can tell PyTorch to track the gradient of a tensior by setting its `requires_grad` attribute to `True`.
To calculate the gradients after the forward pass you need to call the `backward()` method on the root node of the computational graph (typically you call `backward()` on the loss).
After the backward pass the gradients can be accessed with the `grad` attribute.
Before calling `backward()` the gradients are `None`.

In the example below we want to calculate the derivative $f(x) = x^3$ for $x=2$ which should be $\frac{d f(x)}{x} = 3*x^2 \Rightarrow 3*2*2 = 12$.

In [None]:
t.requires_grad = True
print(t.grad)

# perform calculation
y = t**3
# you will see the leaf node of the computation graph as the power operation
print(y)

# gradient is still None
print(t.grad)

# backward pass
y.backward()

# gradient is now available
print(t.grad)

By default, PyTorch tracks the gradients of each tensor which has `requires_grad` set to `True`.
If you do not need gradients then you should deactivate it as the forward computation becomes more efficient.
A typical use case for this is the validation pass.
You can deactivate gradient tracking by using `with torch.no_grad()`.

In [None]:
a = np.array([2.])
t = torch.from_numpy(a)
t.requires_grad = True
with torch.no_grad():
    # no gradient tracking in here, even if requires_grad is True
    y = t**3

# backward pass would fail because no computational graph has been created
# y.backward()

We have seen in class that we can calculate the gradients of arbitrary complex functions composed of simple base functions such as plus, minus, multiplication, etc. by using the computational graph with the chain rule. This is also how PyTorch computes the gradients: It overloads functions such as the exp function shown above and tracks the computational graph.

PyTorch already provides the implementation for many use cases. However, sometimes, you might have to implement your own function along with its gradient because you might want to use advanced mathematics which cannot be expressed by simple arithmetic or there exists a simple derivative which is much more efficient than using the computational graph. We will explore the second reason for an example later on in this exercise.

## Differentiable functions in PyTorch
Let's explore the API provided by PyTorch to define autograd functions.
To add a differentiable function to PyTorch, we need need to inherit from `torch.autograd.Function` and implement static forward and backward functions.
An example for the sigmoid function is given below:

```python
class Sigmoid(torch.autograd.Function):
    @staticmethod
    def forward(ctx, logit: torch.Tensor):
        # for numeric stability we use the tanh formula here
        sigmoid = 1/2 * (1+torch.tanh(logit/2))
        ctx.save_for_backward(sigmoid)
        return out

    @staticmethod
    def backward(ctx, grad_output: torch.Tensor):
        (sigmoid,) = ctx.saved_tensors
        return grad_output * sigmoid * (1 - sigmoid)
```

The first argument of both functions is the autograd context object which is provided by PyTorch.
It can be used to store information about the forward pass to reuse it in the backward pass.
For example, the derivative of the sigmoid function can be expressed in terms of the sigmoid function itself:
$$\frac{d\sigma(x)}{dx} = \sigma(x)(1-\sigma(x))$$
We can therefore store the output of the sigmoid function with `ctx.save_for_backward` to reuse it in the backward pass with `ctx.saved_tensors`.

The other arguments in the forward pass are the function's arguments (in case of the sigmoid only one).

In the backward pass the second argument is the accumulated upstream gradient which we need to multiply with our own gradient according to the chain rule.
The backward function needs to output the gradient with respect to each parameter.
Since we only have one parameter in the example above, it only outputs one gradient.
If you are not interested in gradients for potential other parameters you can simply return `None` as the gradient value.
This would for example be the case for loss functions where the second argument are the ground truth values.

# Your turn!
All the functions that we are going to implement in the following sections have to accept batches as input data, meaning that the shape of the input data is `(N, D)` where `N` is the batch size and `D` the data dimension. You are **not** allowed to use the equivalent pytorch functions for the following implementations and should rather implement them using with the use of base functions.

The first function we implement is the ReLU activation function (REctified Linear Unit), one of the most commonly used activation functions in deep learning due to its simplistic non-linearity.
> **Task 4** Calculate the derivative of the ReLU function give as
> $$\begin{aligned}&\text{ReLU}: \mathbb{R} \rightarrow \mathbb{R_0^+} \\ &\text{ReLU}(x) = \max(x, 0)\end{aligned}$$
> You can use the unit tests [`macrograd/test_layers.py`](macrograd/test_layers.py) to check whether you implementations returns the correct gradients.


$$\frac{\partial \text{ReLU}(x)}{\partial x} = ...$$




> **Task 5** Implement the `forward()` and `backward()` functions of the `ReLU` class in [`macrograd/layers.py`](layers.py).

In order to solve classification tasks with two classes we usually combine two layers for the loss which are the Sigmoid function (already implemented above) to create a probability from the output neuron and the binary cross entropy which calculates the negative entropy between the output of the sigmoid $\hat{y}$ and the ground truth values $y$ that we would like to learn. We have seen the entropy already in the logistic regression assignment. It shows the log probably that the $\hat{y}$ is from the probability distribution of $y$. We thus want to maximize it i.e. minimize its negative.

> **Task 6** Calculate the partial derivative (w.r.t $\hat{y}$) of the negative binary cross entropy given as
> $$\begin{aligned}&\text{BCE}: (0, 1)\times (0, 1) \rightarrow \mathbb{R_0^+} \\ &\text{BCE}(\hat{y}, y) = -\left( y \log \hat{y} + (1-y) \log (1-\hat{y}) \right)\end{aligned}$$


$$\frac{\partial \text{BCE}(\hat{y}, y)}{\partial \hat{y}} = ...$$




> **Task 7** Implement the `forward()` and `backward()` functions of the `BCE` class in [`macrograd/layers.py`](layers.py). Note that this function takes two inputs: the ground labels $y$ and the output of the network $\hat{y}$ which means that the backward function also needs to return two gradients $\frac{\partial \text{BCE}(\hat{y}, y)}{\partial \hat{y}}$ and $\frac{\partial \text{BCE}(\hat{y}, y)}{\partial y}$, however the latter is irelevant for backprobagation and you can, thus, just return `None` for it. Also note that you are receiving a batch but want to output a single loss value. In this sheet we will use `mean` as our reduction method. Thus, the actual function becomes the following, which you have to also consider in your `backward` function:
> $$\begin{aligned}&\text{BCE}_\text{impl}: (0, 1)^{N\times 1}\times (0, 1)^{N\times 1} \rightarrow \mathbb{R_0^+} \\ &\text{BCE}_\text{impl}(\hat{y}, y) = -\frac{1}{N} \sum_{i=1}^{N}\left( y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right)\end{aligned}$$
> You can use the unit tests [`macrograd/test_layers.py`](macrograd/test_layers.py) to check whether you implementations returns the correct gradients.

$$\newcommand{\deriv}[2]{\frac{\partial #1}{\partial #2}}$$
Now, we want to go multi class also called categorical. Thus, instead of one output neuron, we will now have $C$ where $C$ is the number of classes in our dataset. Like before the output neurons can have arbitrary values and, thus, we need to constraint them in order to get a valid probability distribution (all values between zero and one and all values sum up to one). To do so we use the softmax function, which is a generalization of the sigmoid function. We denote the function also with $\sigma$ and its $j$-th component is given as $\sigma(x)_j$.

$$\begin{aligned}&\sigma: \mathbb{R}^C \rightarrow (0, 1)^C \\ \text{Softmax}(x)_j =\ &\sigma(x)_j = \frac{\exp(x_j)}{\sum_{i=1}^{C} \exp(x_i)}\end{aligned}$$
which is the output of the $j$-th component.
Note, that this function takes in a vector and outputs a vector thus, its derivative is a Jacobian Matrix. For $f:\mathbb{R}^n\rightarrow \mathbb{R}^m$ the Jacobian Matrix is defined as
$$J_f = \begin{bmatrix}
\deriv{f_1}{x_1} & \dots & \deriv{f_1}{x_n}\\
\vdots & \ddots & \vdots\\
\deriv{f_m}{x_1} & \dots & \deriv{f_m}{x_n}\\
\end{bmatrix}$$
 
How would the jacobi matrix of the softmax function look like?
Hint: The softmax is the generalization of the sigmoid function. Try to write the derivative of the softmax in terms of the softmax function itself.



Solution:

We look at two cases: Derivative with respect to $x_j$ and derivative with respect to $x_k$ where $i\neq k$.

Case 1:
$$
\begin{aligned}
\deriv{\sigma(x)_j}{x_j} &= \frac{(\sum_{i=1}^{C}\exp(x_i))\exp(x_j) - \exp(x_j)\exp(x_j)}{(\sum_{i=1}^{C}\exp(x_i))^2}\\
&= \underbrace{\frac{\exp(x_j)}{\sum_{i=1}^{C}\exp(x_i)}}_{\sigma(x)_j}\cdot \underbrace{\frac{\sum_{i=1}^{C}\exp(x_i) - \exp(x_j)}{\sum_{i=1}^{C}\exp(x_i)}}_{1-\sigma(x)_j}\\
&= \sigma(x)_j(1-\sigma(x)_j)
\end{aligned}
$$

Case 2:
$$
\begin{aligned}
\deriv{\sigma(x)_j}{x_k} &= \frac{-\exp(x_j)\exp(x_k)}{(\sum_{i=1}^{C}\exp(x_i))^2}\\
&= -\underbrace{\frac{\exp(x_j)}{\sum_{i=1}^{C}\exp(x_i)}}_{\sigma(x)_j} \cdot \underbrace{\frac{\exp(x_k)}{\sum_{i=1}^{C}x_i}}_{\sigma(x)_k}\\
&= -\sigma(x)_j\sigma(x)_k
\end{aligned}
$$
We can then simplify this even further by combing both cases with the Kornecker delta $\delta_{ij}$:
$$\delta_{ij} = \begin{cases}1\quad i=j\\0\quad i\neq j\end{cases}$$
For $i\in[C]$
$$\deriv{\sigma(x)_j}{x_i} = (\delta_{ij}-\sigma(x_j))\sigma(x_i)$$


Even though single layers, such as the softmax function above, have a jacobian matrix for its derivative, we can actually avoid dealing with jacobians in deep learning by the following trick: The final layer, our loss function, always maps the output vector to a single scalar value. So when calculating the gradient with respect to the input for a layer and including all following layers up to the loss function, the jacobian matrix will always resolve into a gradient vector. We will see this effect in the following task where we calculate the gradient for a function which combines the softmax function and the categorical cross entropy loss.

The categorical cross entropy loss is a generalization to the binary cross entropy loss to arbitrary many classes. It is defined as follows:
$$
\begin{aligned}&\text{CCE}: (0, 1)^C\times (0, 1)^C \rightarrow \mathbb{R_0^+} \\
&\text{CCE}(\hat{y}, y) = -\sum_{i=1}^{C}\left( y_i \log \hat{y}_i\right)
\end{aligned}
$$
> **Task 8** Calculate the derivative of the function that combines softmax and categorical cross entropy into one layer. Simplify as much as possible.
> $$\begin{aligned}&\text{CCE}(\text{Softmax}(x), y) = -\sum_{i=1}^{C}\left( y_i \log \sigma(x)_i\right)\end{aligned}$$
> Hint: Take a one-hot encoding for $y$ as given.



$$\frac{\partial }{\partial x_j}  -\sum_{k=1}^{C}\left( y_k \log \frac{\exp(x_k)}{\sum_{i=1}^{C} \exp(x_i)}\right)  = ...$$




Having done the derivation (and simplified as much as possible), you should see the second reason given above for why you would want to implement your own backward function: In this case the gradient is much easier when we fuse the two layers and, thus, more efficient to compute than using standard backpropagation which would calculate the gradient for each layer (and in this case having to deal with the more complex gradient of the softmax function).

> **Task 9** Implement the `forward()` and `backward()` functions of the `CrossEntropy` class in [`macrograd/layers.py`](layers.py). This layer combines softmax and categorical cross entropy. You can use your derivation from above for the backward function.
> Use the log-sum-exp trick to avoid numerical instability. Think how you can reformulate the joined formula to get a log-sum-exp expression. Why can the softmax be problematic in terms of numerical stability? Answer with a comment in the python file.
> Like in BCE, the function receives two inputs $y$ and $\hat{y}$. Be reminded that you are receiving a batch but want to output a single loss value. Like before we will use `mean` as our reduction method. Thus, the actual function becomes the following, which you have to also consider in the gradient of your `backward()` function:
> $$\begin{aligned}&\text{CrossEntropy}_\text{impl}: \mathbb{R}^{N\times C}\times (0, 1)^{N\times 1} \rightarrow \mathbb{R_0^+} \\ &\text{CrossEntropy}_\text{impl}(x, y) = -\frac{1}{N} \sum_{i=1}^{N}\sum_{k=1}^{C}\left( y_k \log \frac{\exp(x_k)}{\sum_{i=1}^{C} \exp(x_i)}\right)\end{aligned}$$
> You can use the unit tests [`macrograd/test_layers.py`](macrograd/test_layers.py) to check whether you implementations returns the correct gradients.

The only missing layer that we need to build a small neural network is a linear layer (also called fully connected or dense layer) which is essentially a Perceptron.
Since this is a layer with parameters, we are going to use PyTorch's autograd for the gradient calculation.
`torch.nn.Parameter` is a special type of tensor meant for model parameters which already have the `requires_grad` attribute set to `True`.

> **Task 10** Implement the `__init__()`, `reset_parameters` and `forward()` methods of the `Linear` class in [`macrograd/layers.py`](macrograd/layers.py).
Use `torch.nn.Parameter` to create parameters for weights and bias to implement the following linear function:
> $$\text{Linear}(x) = w^T x + b$$
> `reset_parameters` should initialize the parameters as follows: $b$ should be initialized constantly to zero and $w$ randomly with the standard distribution. (We will have a closer look a proper initialization in the next exercise)
>
> Hint: Again you should be careful to handle batches in the correct way. *Broadcasting* should help you to avoid for-loops.

With all the things that we have implemented so far we can perform a forward pass, perform a backward pass using autograd and observe the parameter's gradients. The only thing left to implement is the algorithm that updates the weights given the gradient, also called the optimizer. We will simply use Stochastic Gradient Descent (SGD). Recall, that SGD uses a random subset of the dataset called the (mini) batch and updates the weights according to the negative gradient direction of the loss function with respect to the weights, multiplied by some learning rate $\alpha$:

$$w' = w - \alpha \nabla_w L(y, x)$$

> **Task 11** Implement the `__init__()`, `step()` and `zero_grad()` methods of the `SGDOptimizer` class in the [`macrograd/optimizer.py`](macrograd/optimizer.py) file.

So now, we have all the ingredients to create and train a small neural network even with several layers. What is still missing is the data. We use the same data as in our first exercise. You will find it in `data.csv`. In PyTorch data are abstracted into `Dataset` classes which essentially implement a Python iterator returning the $i$-th data sample.
> **Task 12** Implement the `__init__()`, `__getitem__()` and `__len__()` functions of the `SimpleDataset` class in [`macrograd/dataset.py`](macrograd/dataset.py).

If you are done, let's load the data. The following cell should not output an error if your implementation is correct.

In [None]:
d_path = "macrograd/data.csv"
data = SimpleDataset(d_path)

# these checks should not fail if your implementation is correct
assert len(data) == 200
rnd_idx = random.randint(0, 199)
assert isinstance(data[rnd_idx], tuple)
assert isinstance(data[rnd_idx][0], torch.Tensor)
assert isinstance(data[rnd_idx][1], torch.Tensor)
assert data[rnd_idx][0].shape == torch.Size([2])
assert data[rnd_idx][1].shape == torch.Size([1])
assert data[rnd_idx][0].type() == 'torch.FloatTensor'
assert data[rnd_idx][1].type() == 'torch.FloatTensor'
assert {i[1].item() for i in data} == {0., 1.}

You should have learned the problem of overfitting in this week. The problem arises when the network learnt the training data by heart, thus, performing really good on it, but on the otherhand fails to generalize and, thus, performs rather bad on unseen data. To detect and mitigate this effect, we usually split the data into three categories:
- Training data: The data that we are training on.
- Testing data: During training we use this subset to check how well the model is performing on unseen data to avoid overfitting.
- Validation data: Through us, the researchers, the model might also be optimized for the validation data in a hyper optimization as we select choices where the model performs better on the validation data. Thus, we use the test data at the very end to create an unbiased statement of the model performance on unseen data.

It is very important that the splits still belong to the same feature distribution as deep neural networks can only predict within this distribution (out-of-distribution prediction / extrapolation is not possible with neural networks). Thus, one usually gathers one dataset and splits these subsets apart. However, here comes the other part the one has to be careful about: Splitting is data dependent. The splits are not allowed to have any information about data from the other splits, i.e. they must be independent.
For example in time series data, one sequence should always be only in one split and not distributed among splits.

Since we do not have this issue here, we can simply split the data randomly with a split of 80/20 as we also won't use a test data set. `torch.utils.data.random_split` can be used to split a torch `Dataset` into subsets as shown in the cell below:

In [None]:
train, val = torch.utils.data.random_split(data, [int(len(data)*0.8), int(len(data)*0.2)])

With the current setting, we can only iterate over the dataset sample by sample. However, for training the neural network we would like to use mini-batches. Furthermore, these mini-batches should be uncorrelated to avoid that the neural network learns a correlation between the samples in the same batch. Thus, after each epoch we would like to create new random mini-batches.

PyTorch provides the `torch.utils.data.DataLoader` class which does exactly this while using multi processing to fetch the data samples:

In [None]:
train_data = DataLoader(train, batch_size=32, shuffle=True)
val_data = DataLoader(val, batch_size=32)

Now we would like to use all our previous code to actually train a neural network on the data.
Have a look at the `Trainer` class in [`macrograd/trainer.py`](macrograd/trainer.py). It is meant as an abstract class for training deep neural networks. Concrete implementations might be different e.g. binary classification vs categorical classification and, thus, have to subclass and implement the concrete training process by implementing the `train_epoch()` and `val_epoch()` functions. You do not have to change anything at the `Trainer` code. However, if you did not manage to do the optimizer task you can use pytorch's SGD optimizer in `configure_optimizer`.

> **Task 13** Implement the `train_epoch()` and `val_epoch()` methods of the `BinaryTrainer` class in the [`macrograd/trainer.py`](macrograd/trainer.py) file.
> - `train_epoch`: Loop once over the train dataset (epoch), perform forward and backward pass, do an optimizer step and don't forget to zero the gradients after you are done. Calculate the average loss and accuracy over the whole training set and save it in the corresponding lists that you find in the base `Trainer` class.
> - `val_epoch`: Loop once over the val dataset, perform only the forward pass without gradient tracking (!). Calculate the average loss and accuracy over the whole validation set and save it in the corresponding lists that you find in the base `Trainer` class.


In [None]:
model = Linear(2, 1)
loss = BinaryCrossEntropyLoss()
optim = SGDOptimizer(model, lr=0.001)

# Use this the following two lines if you did not manage to
# create the Linear and BCE layers:
# model = torch.nn.Linear(2, 1)
# loss = torch.nn.BCEWithLogitsLoss()
# optim = torch.optim.SGD(model.parameters(), lr=0.001)

trainer = BinaryTrainer(model, loss, (train_data, val_data), optim)

In [None]:
# let's train for 1000 epochs
trainer.fit(epochs=1000)

The following cell should show decreasing loss, if everything is implemented correctly. The train accuracy is 100%, however the validation accuracy stagnates at 98%. You will see in the cell after the next that the reason for this is that the validation data (which we do not use for classification) lies close to the boarder. This effect can be mitigated if more training data is used as then training and validation data should be distributed equally.

In [None]:
# plot training and validation curves
def plot_results(train, val, title, y_axis):
    fig, ax = plt.subplots(1, 1)
    ax.plot(train, label="train")
    ax.plot(val, label="val")
    ax.set_title(title)
    ax.set_xlabel('epochs')
    ax.set_ylabel(y_axis)
    ax.legend()
    plt.show()

plot_results(trainer.train_losses, trainer.val_losses, "Losses", "loss")
plot_results(trainer.train_accs, trainer.val_accs, "Accuracies", "accuracy")


In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

d = [i for i in data]
colors = np.array(tuple("tab:blue" if x[1].item() == 0 else "tab:orange" for x in data))
ax.scatter(np.array([i[0][0].item() for i in data]),
           np.array([i[0][1].item() for i in data]), color = colors)

x_min, x_max = ax.get_xlim()
y_min, y_max = ax.get_ylim()
step = 0.1

x, y = np.meshgrid(
        np.arange(x_min, x_max, step), np.arange(y_min, y_max, step)
    )
z = (
        model.weight[0].detach().numpy() * x
        + model.weight[1].detach().numpy() * y
        + model.bias.detach().numpy()
    )
ax.contour(x, y, z, levels=[0], colors="black")
plt.show()

## Multiclass
Now it is time to also test the multi class case of our implementation (i.e. CrossEntropyLoss). For that we will use a slightly more complex dataset called MNIST. MNIST includes images of handwritten digits (numbers from zero to nine). There is already a Dataset implemented by PyTorch which we can use to download the dataset. In the following cell we also apply a transform to normalize the data.

In [None]:
train = datasets.MNIST('data', train=True, download=True, transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]))
val = datasets.MNIST('data', train=False, download=True, transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]))
# lets look at the shape of a single sample
print(train[0][0].shape)
print(f"Ground truth: {train[0][1]}")
plt.imshow(train[0][0][0])


train_data = DataLoader(train, batch_size=32, shuffle=True)
val_data = DataLoader(val, batch_size=32)


> **Task 14** Implement the `train_epoch()` and `val_epoch()` methods of the `CategoricalTrainer` class in the [macrograd/trainer.py](macrograd/trainer.py) file. Hint: Both are very similar to the binary case. Be careful to use one-hot encoding for the ground truth data.
> - `train_epoch`: Loop once over the train dataset (epoch), perform forward and backward pass, do an optimizer step and don't forget to zero the gradients after you are done. Calculate the average loss and accuracy over the whole training set and save it in the corresponding lists that you find in the base `Trainer` class.
> - `val_epoch`: Loop once over the val dataset, perform only the forward pass without gradient tracking (!). Calculate the average loss and accuracy over the whole validation set and save it in the corresponding lists that you find in the base `Trainer` class.

This time we are using a several layer in our neural network, also called a Multi Layer Perceptron (MLP).
It is important to add non-linearities (in this case the ReLU function) in between to let the network
approximate non-linear functions. Check out how one could also implement the MLP module class below using [`nn.Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html).

In [None]:
class MLP(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        self.linear1 = Linear(input_size, hidden_size)
        self.linear2 = Linear(hidden_size, output_size)
        self.relu = ReLULayer()

        # In case you did not manage to create the Linear
        # and ReLU layers use the following lines:
        # self.linear1 = torch.nn.Linear(input_size, hidden_size)
        # self.linear2 = torch.nn.Linear(hidden_size, output_size)
        # self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = x.view(x.shape[0], -1)
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

model = MLP(28*28, 100, 10)
loss = CategoricalCrossEntropyLoss()
optim = SGDOptimizer(model, lr=0.01)

# In case you did not manage to create the CategoricalCrossEntropyLoss
# or SGD optimizer use the following lines:
# loss = torch.nn.CrossEntropyLoss()
# optim = torch.optim.SGD(model.parameters(), lr=0.01)

trainer = CategoricalTrainer(model, loss, (train_data, val_data), optim, 10)
# train the network for 10 epochs
trainer.fit(5)

In [None]:
plot_results(trainer.train_losses, trainer.val_losses, "Losses", "loss")
plot_results(trainer.train_accs, trainer.val_accs, "Accuracies", "accuracy")

The plots above should show that we can achieve even around 90% of accuracy with a very simple neural network and only 5 epochs of training.