# Fully Connected Networks

## About

- Powerful function approximators
- Universal approximators (assuming infinite width and/or depth)
- Essentially stacked linear regressions with multiple outputs

## How should we use them?

Previously we used polynomial features for non-linear datasets, but this comes up with downsides:
- what degree of polynomial should we use?
- maybe other functions would be better (they usually are)?
- if so, what those functions are?

![](./images/complex-fn.png)

We might come to the conclusion that:

> it would be best to learn those functions directly from data

... and that's what neural networks do.

## Perceptron and it's limitations

> Perceptron is binary logistic regression, neural network with one output neuron and one layer

It is able to learn __linearly separable data__ but it will fail for non-linear data.

> __Let's see how it does in a case of XOR problem:__

In [1]:
# Data and targets for XOR

XOR = [
    ([0, 0], 0),
    ([0, 1], 1),
    ([1, 0], 1),
    ([1, 1], 0)
]

Let's see how minimal XOR looks like for a neural network:

![](images/xor_neural_net.png)

> __Code below trains a neural network for the XOR problem__

> __WE WILL LATER CREATE PROPER TRAINING LOOP FROM SCRATCH, TREAT IT AS A BLACK-BOX FOR NOW__

In [2]:
import torch


def xor_problem(model):

    inputs = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]]).float()
    targets = torch.tensor([0, 1, 1, 0]).float()

    optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

    for _ in range(10000):
        outputs = model(inputs).squeeze()

        loss = torch.nn.functional.binary_cross_entropy_with_logits(outputs, targets)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

    print("Predicted logits:\n", model(inputs))
    print("Predicted labels:\n", (model(inputs) > 0).float())
    print("Targets:\n", targets)
    print("Weights:\n")

    for parameter in model.parameters():
        print(parameter.data)


xor_problem(torch.nn.Linear(2, 1))

Predicted logits:
 tensor([[ 0.0057],
        [ 0.0015],
        [-0.0037],
        [-0.0079]], grad_fn=<AddmmBackward>)
Predicted labels:
 tensor([[1.],
        [1.],
        [0.],
        [0.]])
Targets:
 tensor([0., 1., 1., 0.])
Weights:

tensor([[-0.0094, -0.0042]])
tensor([0.0057])


This simple model is unable to learn classification for XOR data, __as it didn't have enough depth__.

Let's see how shallow and deep neural networks compare:

![](./images/shallow-vs-deep.png)

Let's try adding another layer with size equal to `2` in order to replicate the structure seen at the beginning, but before that...:

### nn.Sequential

> __`torch.nn.Sequential` is a container like layer which STACKS multiple layers together__

What essentially happens:
- We have our input data
- We pass it through first layer (for now linear)
- Output from first layer is passed to the next one
- The whole thing continues until we reach the end

Below is a stack of two linear layers:

In [None]:
xor_problem(torch.nn.Sequential(torch.nn.Linear(2, 2), torch.nn.Linear(2, 1)))

A single linear transformation (multiplication by weights of model):
- stretches the input space by a certain factor in some direction
- adding a constant (bias) shifts it

> If we add more linear layers the whole transformation is still linear!

![](./images/factor-proof.png)

## Activations

If we plot XOR problem we can see that __we cannot create a linear decision boundary which separates the data!__

![](images/xor_decision_boundary.png)

To combat this phenomena, we need to apply __non-linear transformation__ after __hidden layer__:

![](./images/activation.png)

> Composition of non-linear functions makes the whole transformation non-linear

> __ACTIVATION FUNCTION ACTS ELEMENT-WISE ON OUR DATA!__

Okay, let's see how we do after applying `sigmoid`

In [3]:
xor_problem(
    torch.nn.Sequential(
        torch.nn.Linear(2, 2), torch.nn.Sigmoid(), torch.nn.Linear(2, 1)
    )
)

Predicted logits:
 tensor([[ 0.0147],
        [ 0.0407],
        [-0.0475],
        [-0.0253]], grad_fn=<AddmmBackward>)
Predicted labels:
 tensor([[1.],
        [1.],
        [0.],
        [0.]])
Targets:
 tensor([0., 1., 1., 0.])
Weights:

tensor([[-0.4231,  0.2809],
        [ 0.5266, -0.2147]])
tensor([-0.0147,  0.3777])
tensor([[-0.0692, -0.5851]])
tensor([0.3962])


# Activation variations

There are multiple available activation functions, including (but not limited to):
- sigmoid
- tanh
- ReLU
- Leaky ReLU

![](./images/activ-fns.png)

> Activation functions were introduced in order to combat issues of previous dominant activation functions

## Sigmoid

- Initial neural network activation function
- Squashes input to `[0, 1]` range (neuron on or off)

But this activation function has the following drawbacks:
- Non zero centered
- __Oversaturation__ (most severe drawback)

### Non-zero centered

> Neural networks expect data to be zero-centered

If the data coming into neural network is always positive (as is the case with sigmoid) gradient will become either all positive for every neuron or negative.

This leads to zig-zagging during training, especially for smaller batches

> Larger batches mostly mitigate this issue, as the gradient will be averaged across many examples (some positive, some negative for different weights)

## Tanh 

Hyperbolical tangens solves this issue, but:

> Tanh also has oversaturation drawbacks

## Oversaturation

> When neuron activation saturates at the tails (large positive/negative values) local gradient becomes zero (close to zero)

> This local gradient is multiplied by the previous layers local gradients and dies, phenomena known as __dying gradient__ (especially for deeper networks)

## ReLU

`ReLU` was designed to combat oversaturation and dying gradient problem.

> `ReLU` is given by `max(0, x)` equation

> In the linear part of activation, gradient will always be `1` or `0` (for negative and zero values)

> Combination of piece-wise linear function can approximate any non-linearity (especially with increasing depth)

### Advantages

- Reportedly much faster training times (initially `6x` improvements)
- Faster implementation (thresholding values on zero)
- No oversaturation

### Dead Neurons

> __When a neuron outputs `0` FOR ALL EXAMPLES in our dataset__

- __Invariant to the input__ (no matter the input features it returns `0`).
- Due to that __it does not differentiate any part of our dataset__, hence useless
- Too high learning rate may be a cause
- Large updates to neural networks may "knock off" a neuron into negative regime from which it won't recover
- __Even 50% of neurons might be dead in some architectures!__
- __Might be useful for prunning (removing unnecessary parts of the neural network)__

> __In practice it seems not to be too much of a problem (if the network is large it compensates with other neurons)__

## Leaky ReLU

> Leaky ReLU solves dead neurons problem using __a small negative slope__ for negative values

$$
\text{LeakyReLU}_s(x) = max(0, x) + s \times min(0, x)
$$

> `s` is usually around `0.01`

## Disadvantages

- Higher computational cost
- Solves a problem which is not a problem in many cases

## Advantages

- Neurons can recover from "dead" state and be useful for neural network

# Neural Network as a whole

Let's take a moment to see the neural network as a whole:

![](./images/nn.png)

- __Depth:__ - how many layers are in a neural network
- __Width:__ - `out_features` in PyTorch, how many neurons are in a certain layer

> In general, we create a bottleneck with `nn.Linear` layers, starting with `N` features and finishing with `M` outputs

> __This is a rule of thumb, do not treat is a concrete science!__

## Quick look at backpropagation

When `backward` is called, gradient is calculated on a per-layer basis and passed to the previous ones.

![](./images/backprop.png)

# Exercise

## First part

### Provided code

First, let's talk about `print_gradient` function:
- Calculates loss of `inputs` w.r.t. `targets`
- __Runs backpropagation POPULATING GRADIENTS of parameters (weights and biases of linear model)__
- __Prints gradients of all parameters contained in the model__

> __Gradients of parameters are printed in the reversed order (LAST LAYER'S GRADIENTS ARE PRINTED FIRST!)__

About the data:
- `inputs` - random matrix of shape `(examples, features)`
- `targets` - random __integer__ matrix which specifies `5` classes (from `0` to `5` __exclusive__) 

### Your task

> __Analyze OVERSATURATION based on provided `inputs` and `targets`__

After your analysis you should answer the following questions:
- How does neural network width affects oversaturation?
- How different choices of activation functions in the hidden layers affect the oversaturation?
- How does depth influence the oversaturation?

In [None]:
def print_gradient(model, inputs, targets):
    outputs = model(inputs)

    loss = torch.nn.functional.cross_entropy(outputs, targets)
    loss.backward()

    for i, parameter in reversed(enumerate(model.parameters())):
        print(f"\n\n--------- PARAMETER {i} GRADIENT ---------\n\n")
        print(parameter.grad)

In [None]:
inputs = torch.rand(32, 10)
targets = torch.randint(low=0, high=5, size=(32,))

In [None]:
# Your experiments

## Second part

> __In this part we will try to see `dead` neurons in our data__

### Provided code

> __`fit` function trains our model based on provided `inputs`, `targets` and for specified number of steps (`epochs`)__

> `lr` specifies how fast we will update our neural network (more about that in the next section)

General procedure is the following:
- `fit` your model of choice and specify `lr`
- Run `dead.check(model, input)` to see how many of the neurons are dead for this data

> __Do you know how `DeadNeurons` object works?__

### Your task

After your analysis you should answer the following questions:
- How does `lr` affect dead neurons?
- How size of the neural network affects number of dead neurons (depth, width)
- Does it make a difference whether the network is deeper or shallower in terms of dead neurons?
- How does data size affect percentage of dead neurons?

In [None]:
def fit(model, inputs, targets, epochs, lr):

    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    for _ in range(epochs):
        outputs = model(inputs)

        loss = torch.nn.functional.cross_entropy(outputs, targets)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        
class DeadNeurons:
    def __init__(self):
        self._counter = 0

    def __call__(self, module, inputs, outputs):
        neuron_activations = torch.sum(outputs, dim=0)
        total_neurons = neuron_activations.numel()
        zeros = total_neurons - torch.count_nonzero(neuron_activations)
        percentage = (zeros / total_neurons) * 100
        print(f"Layer: {self._counter} | Name: {module} | Dead: {percentage} %")
        self._counter += 1

    def check(self, module, *args, **kwargs):
        hooks = [
            submodule.register_forward_hook(self)
            for submodule in module.children()
            if not isinstance(submodule, torch.nn.Linear)
        ]

        module(*args, **kwargs)

        for hook in hooks:
            hook.remove()

In [None]:
dead = DeadNeurons()

In [None]:
# Your experiments

## Summary

- Linear layers work like multiclass logistic regression with `in_features` and `out_features`.
- Perceptron has __no hidden layers__.
- __Multilayer Perceptron (MLP)__ is standard neural network with multiple layers.
- We need to use multiple layers interspreded with activations in order to achieve non-linear behaviour
- Main activation function is currently `ReLU` or `LeakyReLU`
- Main problem with `sigmoid` and `tanh` is saturation and dying gradient (though there are used in some neural network blocks like recurrent)
- `ReLU` may suffer from dying neurons phenomena which may impact neural network
- Though it is not the most probable cause of poor network performance
- Remember to use wide layers (say `50`, `100` neurons), depending on task (if the model does not learn, it might need more parameters)
- Sufficiently wide and/or deep neural networks can approximate any function.

# Challenges

## Assessment

- How does [Parametric ReLU](https://pytorch.org/docs/stable/generated/torch.nn.PReLU.html) (a.k.a. `PReLU`) works?
- What are the pros and cons of using this activation?
- What is neural network prunning? See [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/pruning_tutorial.html) about it

## Non-assessment
- Play around with [Tensorflow Neural Network playground](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.97988&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)
- How Maxout activation function works? What are the upsides and downsides of using it? 
- How SeLU activation function works? What are the upsides and downsides of using it?
- Can you somehow show the impact of non-zero centered activation?