# Appreciating Activation Functions

Here, we show that lack of activation functions in your networks can cause __linear collapse__: a situation where your deep network (with some $L$ layers) collapses to a single layer, being unable to learn complex behaviours in your complex, definitely non-linear data.

In [None]:
import torch

Suppose we have a 3 layer network. This means we have 3 sets of weights, $W^{[1]}$, $W^{[2]}$, $W^{[3]}$.

In [None]:
W1 = torch.tensor([
    [1.4, 0.6],
    [0.8, 0.6]
])

W2 = torch.tensor([
    [2.1, -0.5],
    [0.7, 1.9]
])

W3 = torch.tensor([
    [1.2, -2.2],
    [1.2, 1.3]
])

x = torch.tensor([
    [-1.2],
    [1.0]
])

In [None]:
# WITHOUT SiLU

z1 = W1.T @ x
z2 = W2.T @ z1
out = W3.T @ z2

print (out)

tensor([[-2.0640],
        [ 4.5260]])


If we look specifically at the weights, they simply form a chain of matrix products. We can reduce them all to a single $2 \times 2$ matrix that could have been the initial weight.

> This is the same situation as having a single Linear/Dense/Fully-connected layer.

In [None]:
"""
The original 3-layer network will now perform similarly to
a 1-layer network with `W_init`.

This can be generalised to any DEEP network with some `n` layers:
we can collapse it to a single weight matrix by matrix multiplication.
"""

W_init = W3.T @ W2.T @ W1.T
print (W_init)
print ()

out2 = W_init @ x
print (out2) # same result as the 3 layer network!!!

tensor([[ 4.5600,  3.4080],
        [-6.8200, -3.6580]])

tensor([[-2.0640],
        [ 4.5260]])


In [None]:
# WITH SiLU

def silu(z):
    return z * torch.sigmoid(z)

z1 = silu(W1.T @ x)
z2 = silu(W2.T @ z1)
z2 = W3.T @ z2

print (z2)

tensor([[-0.2369],
        [ 0.4730]])
