In [1]:
import torch

Deep learning deals with **tensors**. They are just a fancier way to call multi-dimensional array. Matrices and vectors are also tensors (rank-2 tensor and rank-1 tensor respectively)

To use **autograd** (automatic gradient) feature in PyTorch, we must perform operations on `torch.Tensor`

$$x = 5$$

$$y = x^2$$

$$\frac{dy}{dx} = 2x = 10$$

In [2]:
x = torch.tensor(5., requires_grad=True)        # create a tensor. set requires_grad=True to enable differentiation wrt to x
x

tensor(5., requires_grad=True)

In [3]:
y = x**2            # calculate y from x
y

tensor(25., grad_fn=<PowBackward0>)

In [4]:
y.backward()        # this will calculate dy/dx

In [5]:
x.grad              # the result is stored in x.grad i.e. 2 x 5 = 10

tensor(10.)

This becomes more powerful when we deal with a lot of numbers e.g. vectors and matrices

$$\vec{x} = [1 \space 2 \space 3]$$

$$y = \sum_{i=1}^{3}x_i^2 = (x_1^2 + x_2^2 + x_3^2) = 14$$

Gradient of $y$ with respect to $x$ (this is a vector)

$$\nabla y = \left[\frac{\partial y}{\partial x_1}\space \frac{\partial y}{\partial x_2} \space \frac{\partial y}{\partial x_3} \right] = [2 x_1 \space 2 x_2 \space 2 x_3] = [2 \space 4 \space 6]$$

Note: `.backward()` can only be called on a scalar (such as output of loss function)

In [6]:
x = torch.tensor([1,2,3], dtype=torch.float32, requires_grad=True)
y = (x**2).sum()
y

tensor(14., grad_fn=<SumBackward0>)

In [7]:
y.backward()
x.grad

tensor([2., 4., 6.])

Linear regression

Input $\vec{x}$

$$\vec{x} = [1 \space 2 \space 4]$$

Linear projection (matrix multiplication, $W$ is a matrix)

$$\vec{y} = W\vec{x}$$

Loss

$$L = MSE(\vec{y}, \vec{y_{true}}) = \frac{(y_1 - y_{true,1})^2 + \dots + (y_n - y_{true,n})^2}{n}$$


In [8]:
x = torch.tensor([1,2,4], dtype=torch.float32)          # don't set requires_grad=True here because this is input data  
W = torch.randn((4,3), requires_grad=True)              # weight, sampled from normal distribution N(0,1)
W

tensor([[-0.6930,  0.7489,  0.3976],
        [ 0.5207,  1.3825,  0.0206],
        [-0.5965,  1.6456, -0.9929],
        [-0.2841, -1.7786,  0.0020]], requires_grad=True)

In [9]:
y = torch.matmul(W, x)
y

tensor([ 2.3953,  3.3681, -1.2770, -3.8335], grad_fn=<MvBackward>)

In [10]:
y_true = torch.tensor([1,2,3,4])
y_true

tensor([1, 2, 3, 4])

In [11]:
loss = ((y - y_true)**2).mean()
loss

tensor(20.8689, grad_fn=<MeanBackward0>)

In [12]:
loss.backward()

In [13]:
W.grad

tensor([[  0.6976,   1.3953,   2.7905],
        [  0.6840,   1.3681,   2.7362],
        [ -2.1385,  -4.2770,  -8.5540],
        [ -3.9168,  -7.8335, -15.6671]])

Gradient descent ($\alpha$ is the learning rate)

$$W := W - \nabla L \cdot \alpha$$

In [14]:
W = W.detach() - W.grad * 0.03
W

tensor([[-0.7139,  0.7070,  0.3139],
        [ 0.5002,  1.3415, -0.0615],
        [-0.5324,  1.7739, -0.7363],
        [-0.1666, -1.5436,  0.4720]])

Loss is now smaller

In [15]:
y = torch.matmul(W, x)
loss = ((y - y_true)**2).mean()
loss

tensor(9.7922)

2-layer Neural network

Input $\vec{x}$

$$\vec{x} = [1 \space 2 \space 4]$$

Layer 1:

- Linear projection (matrix multiplication, $W_1$ is a matrix)

$$\vec{y_1} = W_1\vec{x}$$

- Non-linear activation function (ReLU)

$$\vec{a_1} = ReLU(\vec{y_1}) = ReLU(W_1\vec{x})$$ 

Layer 2: (output layer, don't apply activation)

- Linear projection

$$\vec{y} = \vec{y_2} = W_2\vec{a_1}$$

Loss

$$L = MSE(\vec{y}, \vec{y_{true}}) = \frac{(y_1 - y_{true,1})^2 + \dots + (y_n - y_{true,n})^2}{n}$$


In [16]:
x = torch.tensor([1,2,4], dtype=torch.float32)

# layer 1
W1 = torch.randn((4,3), requires_grad=True)     # initialize weight
y1 = torch.matmul(W1, x)                        # linear projection
a1 = torch.relu(y1)                             # apply relu

print("y1:", y1)
print("a1:", a1)

# layer 2
W2 = torch.rand((4,4), requires_grad=True)      # initialize weight
y2 = torch.matmul(W2, a1)                       # linear projection

print("y2:", y2)

y1: tensor([ 1.4714,  2.9440,  8.2453, -1.2481], grad_fn=<MvBackward>)
a1: tensor([1.4714, 2.9440, 8.2453, 0.0000], grad_fn=<ReluBackward0>)
y2: tensor([ 5.7500,  6.1875, 11.3977,  6.4486], grad_fn=<MvBackward>)


In [17]:
loss = ((y2 - y_true)**2).mean()
loss.backward()
loss

tensor(29.1538, grad_fn=<MeanBackward0>)

Gradient of Loss with respect to layer 1's weights

In [18]:
W1.grad

tensor([[ 7.9427, 15.8855, 31.7709],
        [ 6.1575, 12.3151, 24.6302],
        [ 6.3732, 12.7465, 25.4929],
        [ 0.0000,  0.0000,  0.0000]])

Gradient of Loss with respect to layer 2's weights

In [19]:
W2.grad

tensor([[ 3.4945,  6.9921, 19.5826,  0.0000],
        [ 3.0807,  6.1641, 17.2638,  0.0000],
        [ 6.1781, 12.3615, 34.6208,  0.0000],
        [ 1.8014,  3.6043, 10.0947,  0.0000]])

Gradient descent and see new loss

In [20]:
W1 = W1 - W1.grad * 0.03
W2 = W2 - W2.grad * 0.03

y1 = torch.matmul(W1, x)
a1 = torch.relu(y1)
W2 = torch.rand((4,4), requires_grad=True)
y2 = torch.matmul(W2, a1)

loss = ((y2 - y_true)**2).mean()
loss

tensor(4.7818, grad_fn=<MeanBackward0>)