# Imports

In [1]:
import torch

# Creating a Neural Network

Short description:
* A neural network consists of neurons
* Most common function a neuron calculates is linear: $z = \mathbf{w}*\mathbf{x} + b$, where $\mathbf{x} \in \mathbb{R}^{n \times 1}$ is the inputs to a neuron, $\mathbf{w} \in \mathbb{R}^{1 \times n}$ are trainable weights and $b$ is a scalar trainable bias term
* To give a neuron more representational power (i.e. learning non-linear functions), each neuron is followed by a non-linear *activation function f*, e.g. $y = f(z) = sigmoid(z)$. 
* Neurons are arranged in layers (see image below) that can be stacked
* The math of a single neuron then generalizes to a layer of neurons via $\mathbf{z} = W * \mathbf{x} + \mathbf{b}$ with the weights for all neurons collected in a matrix $W \in \mathbb{R}^{m \times n}$, the bias of all neurons as a vector $b \in \mathbb{R}^m$ and the input to the layer as vector $\mathbf{x} \in \mathbb{R}^n$. The output of a layer will then be a vector $\mathbf{z} \in \mathbb{R}^m$ - each column represents the weights w.r.t one input connection, each row one neuron, yielding the transformation $\mathbb{R}^{n \times 1} \rightarrow \mathbb{R}^{m \times 1}$. 
* The activation function is then applied element-wise to $\mathbf{z}$

<img src="https://miro.medium.com/max/1400/1*ZB6H4HuF58VcMOWbdpcRxQ.png" width=500/>\
source: https://miro.medium.com/max/1400/1*ZB6H4HuF58VcMOWbdpcRxQ.png

Let's define an example input (input layer), which is just a vector:

In [None]:
x = torch.tensor([0, 0, 1, 0], dtype=torch.float) # 4-dim float vector
print(x)

## Possible Layer Types in PyTorch

PyTorch provides a variety of layer types that can be used to build neural networks. Some of the most commonly used layers include:

1. **Linear Layer (`torch.nn.Linear`)**:
    - Applies a linear transformation to the incoming data: $y = xA^T + b$
    - Parameters: `in_features` (size of each input sample), `out_features` (size of each output sample), `bias` (if set to False, the layer will not learn an additive bias)

2. **Convolutional Layer (`torch.nn.Conv2d`)**:
    - Applies a 2D convolution over an input signal composed of several input planes
    - Parameters: `in_channels`, `out_channels`, `kernel_size`, `stride`, `padding`, `dilation`, `groups`, `bias`

3. **Recurrent Layers (`torch.nn.RNN`, `torch.nn.LSTM`, `torch.nn.GRU`)**:
    - `torch.nn.RNN`: Applies a simple recurrent neural network (RNN) to an input sequence
    - `torch.nn.LSTM`: Applies a Long Short-Term Memory (LSTM) network to an input sequence
    - `torch.nn.GRU`: Applies a Gated Recurrent Unit (GRU) network to an input sequence
    - Parameters: `input_size`, `hidden_size`, `num_layers`, `bias`, `batch_first`, `dropout`, `bidirectional`

4. **Dropout Layer (`torch.nn.Dropout`)**:
    - Randomly zeroes some of the elements of the input tensor with probability `p` using samples from a Bernoulli distribution
    - Parameters: `p` (probability of an element to be zeroed)

5. **Batch Normalization Layer (`torch.nn.BatchNorm1d`, `torch.nn.BatchNorm2d`)**:
    - Applies Batch Normalization over a 2D or 4D input
    - Parameters: `num_features`, `eps`, `momentum`, `affine`, `track_running_stats`

6. **Activation Functions**:
    - `torch.nn.ReLU`: Applies the rectified linear unit function element-wise
    - `torch.nn.Sigmoid`: Applies the sigmoid function element-wise
    - `torch.nn.Tanh`: Applies the hyperbolic tangent function element-wise
    - `torch.nn.Softmax`: Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0, 1] and sum to 1

These layers can be combined to create complex neural network architectures.

## Creating a Single Layer

Let's create a linear layer of input size `n` and output size `m`:

In [None]:
layer = torch.nn.Linear(4, 5, bias=True) # 4 inputs, 5 outputs -> 5 neurons
# W
print("W =", layer.weight) # 5x4 tensor connecting all inputs to all outputs
# b
print("b =", layer.bias) # tensor of size 5, one bias for each output

Let's forward the input x through our layer:

In [None]:
z1 = layer(x)
print(z1)

We're now only missing an activation function:

In [None]:
y1 = torch.sigmoid(z1)
print(y1)

... and that's our first layer!\
We can create the other layers in the same fashion.\
But to organize everything together into one model, it is useful to subclass from `torch.nn.Module`.

## `torch.nn.Module` Class

The `torch.nn.Module` class is the base class for all neural network modules in PyTorch. It provides a way to encapsulate parameters, layers, and methods for building and training neural networks. Here are some key features and functionalities of the `torch.nn.Module` class:

1. **Initialization (`__init__` method)**:
    - This method is used to define the layers and parameters of the model.
    - Layers are typically instances of other `torch.nn` modules (e.g., `torch.nn.Linear`, `torch.nn.Conv2d`).

2. **Forward Pass (`forward` method)**:
    - This method defines the computation performed at every call.
    - It takes an input tensor and returns an output tensor.
    - The `forward` method is called automatically when the module is called (e.g., `output = model(input)`).

3. **Parameters**:
    - Parameters are instances of `torch.nn.Parameter`, which are tensors that are considered as model parameters.
    - They are automatically registered as parameters when assigned as attributes of the module.

4. **Submodules**:
    - Modules can contain other modules (submodules), which are registered as submodules when assigned as attributes.
    - This allows for building complex models by composing simpler modules.

5. **Device Management**:
    - Modules can be moved to different devices (e.g., CPU, GPU) using the `to` method.
    - This ensures that all parameters and buffers are moved to the specified device.

6. **Training and Evaluation Modes**:
    - Modules can be set to training mode using `model.train()` and evaluation mode using `model.eval()`.
    - This affects certain layers like dropout and batch normalization, which behave differently during training and evaluation.

Example of a simple neural network using `torch.nn.Module`:

```python
class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(4, 5)
        self.layer2 = torch.nn.Linear(5, 7)
        self.layer3 = torch.nn.Linear(7, 3)
    
    def forward(self, x):
        y1 = torch.sigmoid(self.layer1(x))
        y2 = torch.sigmoid(self.layer2(y1))
        y3 = self.layer3(y2)
        return y3
```

### Task

In this task, define a simple neural network with three linear layers and ReLU activations by subclassing `torch.nn.Module`.\
Since we want to perform binary classification, the output layer should have one output and a sigmoid activation function.\
You can use the following skeleton:

In [6]:
class SimpleNN(torch.nn.Module):
    def __init__(self, input_dim: int):
        super().__init__()
        # define the layers

        # create input layer: 5 neurons
        self.layer1 = ...
        # hidden layer with 3 neurons
        self.layer2 = ...
        # output layer with 1 neuron for binary classification
        self.layer3 = ...
    
    def forward(self, x):
        # define the forward pass
        ...
       
        return ...

We can tell PyTorch where tensors are stored and therefore where the computations run.\
A GPU is usually much faster!

In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# we know that the input has 4 dimensions, we could also infer this from the data, e.g., x.shape[0], x.size(dim=-1), len(x), x.dim()
model = SimpleNN(4).to(device) # instantiate model and move to device
x = x.to(device) # note that tensors of an operation need to be on the same device
y = model.forward(x) # directly call the forward function, don't use this!
y = model(x) # same as model.forward(x) but with extended pre-processing
print(y)
print(y.shape)

* Learning works by adjusting the weights w.r.t to an objective function 
    * calculate the objective function based on calculated output vs. real labels from dataset
    * take gradients of objective function w.r.t. network weights and biases
    * use a gradient-based optimizer to update the existing weights using the gradients
* An example objective function could be quadratic loss:
    $L_2(y, y') = (y - \hat{y})^2$
* Extending the loss to multiple samples could e.g. be Mean-Squared-Error loss: $\frac{1}{M}\sum_{i=1}^{M} (y_i - \hat{y_i})^2$

## PyTorch Loss Functions

Loss functions, also known as cost functions or objective functions, are used to evaluate how well a model's predictions match the target values. PyTorch provides several built-in loss functions that can be used for different types of tasks, such as regression, classification, and more. Here are some commonly used loss functions in PyTorch:

1. **Mean Squared Error Loss (`torch.nn.MSELoss`)**:
    - Measures the average squared difference between the predicted and target values.
    - Commonly used for regression tasks.
    - Formula: $L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

2. **Binary Cross-Entropy Loss (`torch.nn.BCELoss`)**:
    - Measures the binary cross-entropy between the predicted and target values.
    - Commonly used for binary classification tasks.
    - Formula: $L(y, \hat{y}) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$

3. **Cross-Entropy Loss (`torch.nn.CrossEntropyLoss`)**:
    - Combines `LogSoftmax` and `NLLLoss` in one single class.
    - Commonly used for multi-class classification tasks.
    - Formula: $L(y, \hat{y}) = -\sum_{i=1}^{n} y_i \log(\hat{y}_i)$

In [None]:
# These loss functions can be easily used in PyTorch by instantiating the corresponding class and passing the predicted and target values to it.
# For our binary classification task, we use the binary cross-entropy loss, which is implemented as `torch.nn.BCELoss`.

loss_func = torch.nn.BCELoss() # PyTorch knows several loss/cost/objective functions
y_hat = torch.tensor([0.], device=device) # we assume that the target is 0
loss = loss_func(y, y_hat)
print(loss)

Now, we can do the backward pass (calculating gradients)

In [None]:
loss.backward()
print("dMSE/dW =", model.layer1.weight.grad) # print gradients for W1

Updating the weights of the model is now a matter of choosing the step size for subtracting the gradients from the respective weights.\
PyTorch comes with several optimizer functions, e.g. stochastic gradient descent (SGD) or more complex ones, e.g. Adam

In [None]:
print("W_0 =", model.layer1.weight)
optim = torch.optim.SGD(params=model.parameters(), lr=0.3) # tell the optimizer what weights to update and how much
optim.step() # update weights by applying gradients to the weights
print("W_1 =", model.layer1.weight) # updated gradients for W1 

Now we can verify that our parameters are "more optimal" with respect to our loss function:

In [None]:
y = model(x)
loss = loss_func(y, y_hat)
print(loss)

# Batching

So far, we have only used a single input.\
We want to perform mini-batch SGD and therefor need to forward multiple inputs through the neural network.\
For this, PyTorch allows for a batch dimenion for many objects and functions:

In [None]:
# forward a single input rhought the NN
print(model(x))
# add a batch dimension and forward the batch through the NN
x_batched = x.unsqueeze(0)
print(model(x_batched))
# this becomes important when training the model with mini-batches
X = torch.tensor([[0, 0, 1, 0], [0, 1, 1, 0], [1, 0, 1, 0]], dtype=torch.float).to(device) # 3 samples with 4 features
print(model(X))