In [None]:
import torch

# Creating a neural network

Short description:
* A neural network consists of neurons
* Most common function a neuron calculates is linear: $z = \mathbf{w}*\mathbf{x} + b$, where $\mathbf{x} \in \mathbb{R}^{nÂ \times 1}$ is the inputs to a neuron, $\mathbf{w} \in \mathbb{R}^{1 \times n}$ are trainable weights and $b$ is a scalar trainable bias term
* To give a neuron more representational power (i.e. learning non-linear functions), each neuron is followed by a non-linear *activation function f*, e.g. $y = f(z) = sigmoid(z)$. 
* Neurons are arranged in layers (see image below) that can be stacked
* The math of a single neuron then generalizes to a layer of neurons via $\mathbf{z} = W * \mathbf{x} + \mathbf{b}$ with the weights for all neurons collected in a matrix $W \in \mathbb{R}^{m \times n}$, the bias of all neurons as a vector $b \in \mathbb{R}^m$ and the input to the layer as vector $\mathbf{x} \in \mathbb{R}^n$. The output of a layer will then be a vector $\mathbf{z} \in \mathbb{R}^m$ - each column represents the weights w.r.t one input connection, each row one neuron, yielding the transformation $\mathbb{R}^{n \times 1} \rightarrow \mathbb{R}^{m \times 1}$. 
* The activation function is then applied element-wise to $\mathbf{z}$

<img src="https://miro.medium.com/max/1400/1*ZB6H4HuF58VcMOWbdpcRxQ.png" width=500/>\
source: https://miro.medium.com/max/1400/1*ZB6H4HuF58VcMOWbdpcRxQ.png

Let's define an example input (input layer), which is just a vector:

In [None]:
x = torch.ones(4) # 4-dim vector
print(x)

In PyTorch, a linear layer can be instantiated by giving the input size (n) and output size (m) of the layer:

In [None]:
layer = torch.nn.Linear(4, 5, bias=True) # 4 inputs, 5 outputs -> 5 neurons
# W
print("W =", layer.weight) # 5x4 tensor connecting all inputs to all outputs
# b
print("b =", layer.bias) # tensor of size 5, one bias for each output

Let's forward the input x through our layer:

In [None]:
z1 = layer(x)
print(z1)

We're now only missing an activation function:

In [None]:
y1 = torch.sigmoid(z1)
print(y1)

... and that's our first layer! 
We can create the other layers in the same fashion.
But to organize everything together into one model, it is useful to subclass from `torch.nn.Module`:

In [None]:
class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # create layers: input 4, output 5
        self.layer1 = torch.nn.Linear(4,5)
        # input 5, output 7
        self.layer2 = torch.nn.Linear(5,7)
        # input 7, output 3
        self.layer3 = torch.nn.Linear(7,3)
    
    def forward(self, x):
        # forward pass (with sigmoid activations) will be called on self()
        y1 = torch.sigmoid(self.layer1(x)) # layer 1 + activation
        y2 = torch.sigmoid(self.layer2(y1)) # layer 2 + activation
        y3 = self.layer3(y2) # layer 3: no activation function, with sigmoid we could only get outputs between [0,1]
       
        return y3

We can tell PyTorch where tensors are stored and therefore where the computations run.\
A GPU is usually much faster!

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
model = MyModel().to(device) # instantiate model and move to device
x = x.to(device) # note that tensors of an operation need to be on the same device
y = model.forward(x) # directly call the forward function, don't use this!
y = model(x) # same as model.forward(x) but with extended pre-processing
print(y)
print(y.shape)

* Learning works by adjusting the weights w.r.t to an objective function 
    * calculate the objective function based on calculated output vs. real labels from dataset
    * take gradients of objective function w.r.t. network weights and biases
    * use a gradient-based optimizer to update the existing weights using the gradients
* An example objective function could be quadratic loss:
    $L_2(y, y') = (y - \hat{y})^2$
* Extending the loss to multiple samples could e.g. be Mean-Squared-Error loss: $\frac{1}{M}\sum_{i=1}^{M} (y_i - \hat{y_i})^2$

In [None]:
loss_func = torch.nn.MSELoss() # PyTorch knows several loss/cost/objective functions
y_hat = torch.ones(3, device=device) # our dummy label (3-dim) to compare output to - note that we can also specify the device on tensor creation
loss = loss_func(y, y_hat)
print(loss)

Now, we can do the backward pass (calculating gradients)

In [None]:
loss.backward()
print("dMSE/dW =", model.layer1.weight.grad) # print gradients for W1

Updating the weights of the model is now a matter of choosing the step size for subtracting the gradients from the respective weights.
PyTorch comes with several optimizer functions, e.g. stochastic gradient descent (SGD) or more complex ones, e.g. Adam

In [None]:
print("W_0 =", model.layer1.weight)
optim = torch.optim.SGD(params=model.parameters(), lr=0.3) # tell the optimizer what weights to update and how much
optim.step() # update weights by applying gradients to the weights
print("W_1 =", model.layer1.weight) # updated gradients for W1 

Now we can verify that our parameters are "more optimal" with respect to our loss function:

In [None]:
y = model(x)
loss = loss_func(y, y_hat)
print(loss)