**<font color="red">IMPORTANT</font>**: The notebook follows certain conventions when it comes to inserting your input in it:

* Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Other cells that are not supposed to be edited are set read-only to prevent you from accidentally editing them. 
* Feel free to overwrite `NotImplemented` and `raise NotImplementedError()` when providing your solution, otherwise leave it as it is.
* The comments starting with `#!` provide hints that need to be followed. For example, `#! A = NotImplemented` is a directive to define a variable `A`.

---

# 🔥 Torch Intro 🔥

## What is PyTorch?
PyTorch (or just torch) is a python deep learning library, by now getting widely adopted both in research and industry. Like most deep learning libraries it employs GPU accelaration; unlike most deep learning libraries, it supports dynamic computational graphs and deep python integration, enabling easy experimentation and code inspection. It provides high-level abstractions but also allows for low-level access to its primitives. You can read more about PyTorch at the official [site](https://pytorch.org/).

## Installing PyTorch on your machine
Install PyTorch by following the guidelines here [here](https://pytorch.org/get-started/locally/).

<div class="alert alert-block alert-info">
<b>Note:</b> If you have an nVidia GPU, installing a CUDA version of PyTorch will allow you to utilize it, significantly speeding up computation.
</div>

# This Tutorial
This tutorial will take you through PyTorch's main functionalities. It only aims to give you some insight on how to use PyTorch and is by no means a full tutorial on neural networks. Prior knowledge of neural netowrks and their inner workings (i.e. linear algebra, gradient-based optimization, back-propagation, etc.) will certainly prove useful. You are assumed to be largely fluent in python.

# Table of Contents
1. [Tensors](#1)
    1. [Tensor Types](#1a)
    2. [Instantiating Tensors](#1b)
    3. [Basic Tensor Operations](#1c)
    4. [Exercises](#1d)
2. [Automatic Differentation](#2)
    1. [Autograd](#2a)
    2. [Exercises](#2b)
3. [Neural Networks](#3)
    1. [Custom Neural Networks](#3a)
    2. [Loss Functions](#3b)
    3. [Optimizers](#3c)
    4. [Exercises](#3d)
4. [Putting Everything Together](#4)


For a more in-depth overview of PyTorch's capabilities, refer to the [official documentation](https://pytorch.org/docs/stable/index.html) (this link will prove handy for your assignments -- keep it close and use it often).

---

#### Getting started
Let's verify your torch installation is working by trying to import it and printing its version.

In [2]:
import torch
print(f"torch version = {torch.__version__}")

torch version = 1.10.1+cu113


<a id='1'></a>
## 1. Tensors

A [Tensor](https://pytorch.org/docs/stable/tensors.html) is the building block of any PyTorch program; it is the abstraction that stores n-ary arrays of numbers (i.e. tensors) and provides various functionalities for processing them. 

<a id='1a'></a>
### A. Tensor Types

There are 16 types of Tensors, distinguished by their `dtypes` (the sort of numbers stored within them) and the `device` they can be accessed by (either GPU or CPU).

The different Tensor types and their corresponding classes are shown below:

| Description | dtype | CPU Tensor Class | GPU Tensor Class |
| --- | --- | --- | --- |
| Full precision float | `torch.float32` | `torch.FloatTensor` | `torch.cuda.FloatTensor`| 
| Half precision float | `torch.float16` | `torch.HalfTensor` | `torch.cuda.HalfTensor` |
| Double precision float | `torch.float64` | `torch.DoubleTensor` | `torch.cuda.DoubleTensor` |
| 8-bit unsigned integer | `torch.uint8` | `torch.ByteTensor` | `torch.cuda.ByteTensor` |
| 8-bit signed integer | `torch.int8` | `torch.CharTensor` | `torch.CharTensor` |
| 16-bit signed integer | `torch.int16` | `torch.ShortTensor` | `torch.cuda.ShortTensor` |
| 32-bit signed integer | `torch.int32` | `torch.IntTensor` | `torch.cuda.IntTensor` |
| 64-bit signed integer | `torch.int64` | `torch.LongTensor` | `torch.cuda.LongTensor` |

<div class="alert alert-block alert-warning">
<b>Warning:</b>
Interaction between Tensors of different devices or dtypes is not permitted (so make sure you are consistent). 
</div>

We are mostly interested in full precision floats and long integers (on either device), so we can forget about the rest of them for now.

For the sake of convenience, we will now specify the device used by the rest of the tutorial. If you have the cuda version installed but would rather not use it, change the snippet below.

In [3]:
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

print(f"Using {device}")

Using cuda


<a id='1b'></a>
### B. Instantiating Tensors
Tensors can be instantiated in a number of ways. When we want to construct a tensor of fixed dimensionality (shape) with random values, we may simply call the appropriate class constructor with the desired dimensionality as the argument.

In [None]:
my_first_long_tensor = torch.LongTensor(5)  # a vector of 5 longs
print(my_first_long_tensor.shape)
print(my_first_long_tensor.dtype)
print(my_first_long_tensor)

In [None]:
my_first_float_tensor = torch.FloatTensor(5, 5)  # a 5 by 5 matrix of floats
print(my_first_float_tensor.shape)
print(my_first_float_tensor.dtype)
print(my_first_float_tensor)

There are some useful shorthands for constructing tensors with commonly used values. Let's use some of them. For more details, look up the corresponding keywords in the pyTorch documentation.

In [None]:
a = torch.zeros((2, 3, 4))  # a 2 by 3 by 4 tensor of zeros
b = torch.ones(42)  # a vector of 42 ones
c = torch.eye(3)  # a 3 by 3 identity matrix
d = torch.rand((32, 10, 300))  # a 32 by 10 by 300 tensor of random numbers from a uniform distribution on the interval [0,1)
e = torch.randint(low=0, high=10, size=(3, 3))  # a 3 by 3 matrix of random integers between 0 (incl.) and 10 (excl.)
f = torch.arange(10)  # a vector containing the numbers 0 to 9 in ascending order

We can explicitly set the `dtype` and `device` arguments to specify the tensor's type and device (most often these default to torch.float and cpu).

In [None]:
a_long = torch.zeros((2, 3, 4), dtype=torch.long, device=device)
f_float = torch.arange(10, dtype=torch.float, device=device)

We can always query a tensor's contents, shape, dtype and device.

In [None]:
for te in [a, a_long, b, c, d, e, f, f_float]:
    print(te.shape)
    print(te.dtype)
    print(te.device)
    # print(te)
    print()

We can create a tensor with the same size as an already existing one.

In [None]:
d = torch.rand((32, 10, 300))

like_d_rand = torch.rand_like(d) # a 32 by 10 by 300 tensor of random values
print(like_d_rand.shape)

like_d_ones = torch.ones_like(d) # a 32 by 10 by 300 tensor of ones 
print(like_d_ones.shape)

like_d_zeros = torch.zeros_like(d) # a 32 by 10 by 300 tensor of zeros
print(like_d_zeros.shape)

We can also specify the values of a tensor by passing a list (of lists*) of values during its construction.

In [None]:
a = torch.tensor([[1,2,3], [4,5,6], [7,8,9]], device=device)
# FYI, the following does the same for the continuous sequence of numbers
# a = torch.arange(1, 10, device=device).reshape(3, 3)
print(a)
print(a.shape)
print(a.dtype)

Notice that PyTorch automatically assumed that the tensor we specified should be of type long (because we only provided integers as the tensor's contents). We could of course avoid this by manually specifying the dtype. Alternatively, we can alter the dtype and/or device post-construction.

In [None]:
a = a.to(torch.float)
print(a.dtype)
a = a.to("cpu")  # or alternatively, a = a.cpu()
print(a.device)

Finally, a torch tensor can also be directy constructed by (or converted to) a numpy array. Converting to a numpy array only works for cpu tensors. Note that when printing torch tensors, the default precision of floats is 4 while for numpy is 8. You can change this default behaviour of torch with [set_printpoints](https://pytorch.org/docs/master/generated/torch.set_printoptions.html#torch.set_printoptions).

In [None]:
import numpy as np
a_np = np.random.random((2, 2))
a_torch = torch.tensor(a_np, device=device)
print(a_np)
print()
print(a_torch)
print()
a_np_2 = a_torch.cpu().numpy()
print(a_np_2)

In [None]:
# deleting above-defined variables
del (
    my_first_long_tensor, my_first_float_tensor, a, b, c, d, e, f, te, a_long, 
    f_float, like_d_rand, like_d_ones, like_d_zeros, a_np, a_torch, a_np_2
)

<a id='1c'></a>
### C. Basic Tensor Operations
Tensors and their contents are not hidden by the framework -- they are immediatelly accessible to us and we can interact with them in many ways, while being able to inspect the results of our actions. Let's walk through some of the most common usecases.

#### Indexing and Slicing
Standard python indexing and slicing applies to torch tensors. Let's remember how that works -- first we will need a random matrix to experiment with.

In [4]:
a = torch.rand((5, 3), device=device)  
print(a)

tensor([[0.4293, 0.1903, 0.8374],
        [0.0252, 0.1757, 0.8026],
        [0.5624, 0.3368, 0.0963],
        [0.0417, 0.8957, 0.6450],
        [0.2909, 0.0028, 0.9862]], device='cuda:0')


<div class="alert alert-block alert-warning">
<b>Remember!</b>
Indexing starts from zero
</div>

Now let's try retrieving the 3rd item of the 1st row.

In [6]:
b = a[0][2]
print(b)

tensor(0.8374, device='cuda:0')


Note that even though it's just one element, it's still a tensor (with 0 dimension) and not a float value. In order to get the value itself, we can call `item()`.

In [7]:
print(type(b) is float)
print(type(b.item()) is float)
b.item()

False
True


0.8374311327934265

What if we wanted the first three rows of the matrix instead?

In [None]:
c = a[:3]
print(c)

Or its third column?

In [None]:
c = a[:, 2]
print(c)

Or every second element of the first column, starting from the second?

In [None]:
c = a[1::2, 0]
print(c)

But now in reverse, starting from the last?!

In [None]:
c = a[-1::-2, 0]
print(c)

Well, perhaps not... see [the issue](https://github.com/pytorch/pytorch/issues/604) about negative stepping in tensor slicing. One can achieve the reverse order with [flip](https://pytorch.org/docs/stable/generated/torch.flip.html#torch.flip).

We can also select the maximum element from the tensor with `argmax` or the top K elements with `topk`.

In [8]:
print(a.argmax(dim=1)) # returns the indices of the maximum values across the specified dimension
print()
print(a.topk(dim=1, k=2)) # returns both the indices and the values of the top K elements

tensor([2, 2, 0, 1, 2], device='cuda:0')

torch.return_types.topk(
values=tensor([[0.8374, 0.4293],
        [0.8026, 0.1757],
        [0.5624, 0.3368],
        [0.8957, 0.6450],
        [0.9862, 0.2909]], device='cuda:0'),
indices=tensor([[2, 0],
        [2, 1],
        [0, 1],
        [1, 2],
        [2, 0]], device='cuda:0'))


#### Value Assignment
We can use the exact same scheme to assign values to tensors.

In [9]:
# Set the top left item of the matrix to zero.
a[0,0] = 0
print(a)
print()
# Construct another random matrix of the same shape.
b = torch.rand_like(a, device=device)
print(b)
print()
# Set the second row of matrix a to be the third row of matrix b.
a[1] = b[2]
print(a)

tensor([[0.0000, 0.1903, 0.8374],
        [0.0252, 0.1757, 0.8026],
        [0.5624, 0.3368, 0.0963],
        [0.0417, 0.8957, 0.6450],
        [0.2909, 0.0028, 0.9862]], device='cuda:0')

tensor([[0.4600, 0.2284, 0.8060],
        [0.4876, 0.7980, 0.3841],
        [0.0650, 0.8973, 0.4593],
        [0.2824, 0.9043, 0.5481],
        [0.8349, 0.9966, 0.3432]], device='cuda:0')

tensor([[0.0000, 0.1903, 0.8374],
        [0.0650, 0.8973, 0.4593],
        [0.5624, 0.3368, 0.0963],
        [0.0417, 0.8957, 0.6450],
        [0.2909, 0.0028, 0.9862]], device='cuda:0')


#### Elementwise Arithmetic
Elementwise operations (most importantly comparison, addition, subtraction, multiplication and division) can be applied on tensors of compatible shapes (i.e. shapes that can be [broadcasted](https://pytorch.org/docs/stable/notes/broadcasting.html)). 

Two tensors are compatible if any of the two below conditions hold:
* their shapes are the same 
* their trailing (i.e. last) N dimensions are the same (excluding missing dimensions and dimensions of size 1)

Scalars (single values) are compatible with tensors of any shape. Let's see some examples.

First, some fresh tensors.

In [None]:
a = torch.zeros((2, 3, 4), device=device)  
b = torch.ones((2, 3, 4), device=device)  
c = torch.ones((3, 4), device=device)  
d = torch.ones((2, 3, 1), device=device)
e = torch.rand((4, 3, 2), device=device)

We can add a scalar to tensor $a$ 

In [None]:
a = a + 0.3
print(a.shape)

We can subtract $b$ from $a$ (matching shapes)

In [None]:
f = a - b
print(f.shape)

We can elementwise multiply $a$ with $c$ (dimensions of $c$ are the same as the last dimensions of $a$)

In [None]:
g =  a * c
print(g.shape)

We can elementwise divide $a$ by $d$ (the last dimension of $d$ is 1, the rest of the dimensions match)

In [None]:
h = a / d
print(h.shape)

We can elementwise raise $a$ to $d$ (the last dimension of $d$ is 1, the rest of the dimensions match)

In [None]:
i = a**d
print(i.shape)

And we can compare $a$ with any of $f$, $g$, $h$

In [None]:
j = a == f
print(j.shape)

.. but torch complains when we try to do that with $e$ (the shapes are incompatible)

In [None]:
e == a

If in doubt for what any of the elementwise operators actually do, try them out below on some tensors of your own making.

A convenient shortcut for a particular sequence of elementwise operations is [`torch.where`](https://pytorch.org/docs/stable/generated/torch.where.html#torch.where). Given two tensors $x$ and $y$ and a boolean tensor $condition$, it compiles a new tensor selecting the elements from either $x$ or $y$, depending on $condition$.

In [None]:
x = torch.rand(2, 2)
y = torch.rand(2, 2)
print(x)
print(y)
condition = torch.tensor([[True, False], [False, True]])
z = torch.where(condition, x, y)
z

In [None]:
del a, b, c, d, e, f, g, h, i, j, x, y, z, condition

#### Linear Algebra
Tensor algebra of course goes well beyond elementwise operations -- matrix multiplication is the bread and butter of machine learning, so we better get familiar with how torch does it.

As usual, we begin by instantiating our matrices. Matrix multiplication is defined between matrices A and B of shapes [M, N] and [N, O] respectively and yields a matrix C of shape [M, O]. The torch function that implements matrix multiplication is `torch.mm`.

In [None]:
A = torch.rand([5, 3], device=device)
B = torch.rand([3, 8], device=device)
C = torch.mm(A, B)  # or alternatively, C = A @ B
print(C.shape)

What if we had several matrices (i.e. a batch of matrices) to each be multiplied with a matching B? We can use `torch.bmm` for efficient batch matrix multiplication.

<div class="alert alert-block alert-info">
<b>Tip:</b> Rather than writing slow and ugly `for` loops, employ array programming to write your machine learning code. This will make it much more concise and dramatically more efficient.
</div>

In [None]:
bA = torch.rand([128, 5, 3], device=device)
bB = torch.rand([128, 3, 8], device=device)
bC = torch.bmm(bA, bB)  # bC = bA @ bC also works here!
print(bC.shape)

<div class="alert alert-block alert-warning">
<b>Warning:</b>
Be careful not to confuse matrix multiplication `A@B` with the <a href="https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29">Hadamard product</a> `A*B`
</div>

#### Shape Manipulation
As we have seen, what we can do with tensors is largely dictated by their shapes. Adjusting a tensor's shape to allow for broadcasting or batching is therefore often necessary. The following functions should suffice for the bulk of shape manipulation tasks you might encounter.

Tensor tansposition is the generalization of matrix transposition. Since there are now more than 2 dimensions, we additionally need to specify the transposed dimensions. Take for instance a tensor of shape [M, N, O]. Converting it to a tensor of shape [N, M, O] requires transposing the first and second dimensions.

In [None]:
A = torch.rand([128, 5, 3], device=device)
A = A.transpose(0, 1)
print(A.shape)

A generalization of transposing is permutation of the tensor's dimensions. In calling `permute`, we must provide the new order of all the dimensions.

In [None]:
A = torch.rand([128, 5, 3], device=device)
A = A.permute(1, 0, 2)
print(A.shape)
A = A.permute(2, 0, 1)
print(A.shape)

We may also choose to create a reshaped view of a tensor; for instance we may collapse two or more tensor dimensions into one...

In [None]:
A = torch.rand([128, 5, 3], device=device)
A_collapsed = A.view(A.shape[0]*A.shape[1], A.shape[-1])
print(A_collapsed.shape)

... or expand one dimension into two or more.

In [None]:
A_expanded = A_collapsed.view(128, 5, 3)
print(A_expanded.shape)

Let's convince ourselves that the back and forth between dimensions has left our tensor unaffected. First let's elementwise compare A with A_expanded.

In [None]:
comp = A == A_expanded
print(comp)

It seems to be okay! But what if there is a False somewhere in there? Python's [`all`](https://pytorch.org/docs/stable/generated/torch.all.html?highlight=all#torch.all) can be used directly on torch bools to help us here.

In [None]:
all([all(row) for matrix in comp for row in matrix])

or simply we can use [torch.equal](https://pytorch.org/docs/stable/generated/torch.equal.html) for comapring tensors.

In [None]:
torch.equal(A, A_expanded)

Views are useful, but as the name suggests they only change our view of a tensor. Different views of a tensor have the same number of elements; the view is just changing in what order these are read.

For cases where we would like to repeat a tensor across one or more of its axes (actually creating a larger tensor), we can use the function of the same name. Let's consider a tensor of shape [M, N] which we would like to turn into a tensor that repeats itself K times across the first dimension (i.e. a tensor of shape [K $\cdot$ M, N].

In [10]:
A = torch.rand([5, 12], device=device)
A_repeat = A.repeat(3, 1)  # note that we are specifying the number of repeats per dimension
print(A_repeat.shape)

torch.Size([15, 12])


If you need to repeat a singleton dimension (i.e. a dimension of size 1), you can do that with the `expand` function which can be more efficient for large tensors since it does not allocate new memory.

In [11]:
A = torch.rand([1, 12], device=device)
A_exp = A.expand(15, 12)  # note that we are specifying the expected dimensions of the new tensor
print(A_exp.shape)
#
A = torch.rand([1, 12], device=device)
A_exp = A.expand(15, -1)  # -1 means not changing the size of that dimension
print(A_exp.shape)

torch.Size([15, 12])
torch.Size([15, 12])


Another convenient pair of functions is `squeeze`-`unsqueeze`. 

`squeeze` removes all the dimensions of size 1, or a specific dimension of size 1 if it's specified in the `dim` parameter.

In [12]:
A = torch.rand([1, 12, 1], device=device)
print(A.shape)

A_squeezed = A.squeeze()
print(A_squeezed.shape)

A_squeezed_first = A.squeeze(dim=0)
print(A_squeezed_first.shape)

torch.Size([1, 12, 1])
torch.Size([12])
torch.Size([12, 1])


Conversely, `unsqueeze` inserts a new dimension of size 1 into a specified position.

In [None]:
A = torch.rand([2, 12], device=device)
print(A.shape)

A_unsqueezed = A.unsqueeze(1)
print(A_unsqueezed.shape)

#### Combining Tensors
Sometimes we may want to construct a big tensor out of two small ones. There's a few ways to accomplish that, but the most reliable one is through `torch.cat` 🐈 (shorthand for concatenate).

Two tensors may be concatenated if they agree on all their dimensions, except for the concatenation dimension.

In [None]:
A = torch.rand([4, 2], device=device)
B = torch.rand([1, 2], device=device)
C = torch.cat((A, B), dim=0)
print(C.shape)

In [None]:
del (
    A, B, C, comp, A_repeat, A_exp, A_expanded, A_collapsed, A_squeezed, 
    A_squeezed_first, A_unsqueezed, bA, bB, bC
)

---

<a id='1d'></a>
### Exercises
It might be a good idea to take a short break here and recap on what we've seen before moving further. The mini-exercises below should help you test your grasp of this section.

Construct a tensor $A$ of shape [10, 10] containing random floats, and a tensor $B$ of the same shape where all its elements are equal to $\pi$.

In [None]:
#! A = NotImplemented
#! B = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

Compute $C = AB^T$, the matrix multiplication of $A$ with the transpose of $B$ and $D = A\cdot B$, their elementwise multiplication.

In [None]:
#! C = NotImplemented
#! D = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(C.shape)
print(C)
print(D.shape)
print(D)

Try comparing $C$ with $D$. Are they comparable? Are they equal? What is the dtype of their comparison?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Multiply $A$ by 4 to create $F$. Now set all elements of $F$ that are above $\pi$ to zero.


_Hint 1_: You can index a tensor with a boolean tensor of the same dimensionality

_Hint 2_: You can set multiple indexed elements to a single value at once

In [None]:
#! F = NotImplemented
#! F[NotImplemented] = 0
# YOUR CODE HERE
raise NotImplementedError()

The incomplete function below implements matrix multiplication with a for loop. Complete the function and call it with your $A$ and $B^T$ matrices as its arguments. The result should be the same as the matrix $C$ you computed before (with room for some numerical inaccuraccy)

Note: You can use `torch.sum()` to compute the sum of a tensor (optionally specifying across which dimension)

<div class="alert alert-block alert-info">
<b>Note:</b> <a href="https://www.python.org/dev/peps/pep-0484/#rationale-and-goals">Type Hints</a> may be used in python function and variable declarations to give them a type signature. These type signatures are not strict (you can still bypass them), but they can help you organize your code. Type hints of incomplete functions given during assignments will inform you of what we expect your function to accept and return.
</div>

In [None]:
def my_mm(A: torch.FloatTensor, B: torch.FloatTensor) -> torch.FloatTensor:
    assert A.shape[1] == B.shape[0]
    assert (len(A.shape) == len(B.shape) == 2)
    C = torch.zeros((A.shape[0], B.shape[1]), device=device)
    for i in range(A.shape[0]):
        for j in range(B.shape[1]):
            #! C[i,j] = NotImplemented
            # YOUR CODE HERE
            raise NotImplementedError()
    return C
    
E = my_mm(A, B.transpose(1, 0))

# Make sure the mean absolute difference between the two results is below 0.0001.
assert torch.sum(torch.abs(E - C))/(E.shape[0]*E.shape[1]) < 1e-4 

Create a 2-dimensional tensor $S$ that consists of an instance of $A$ followed by two instances of $B$ followed by an instance of $A$ across its first dimension. What is the shape of $S$?

In [None]:
#! S = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()
print(S.shape)

Reshape $S$ into a tensor of shape [20, 2, 10]. 
Then transpose this into a tensor of shape [2, 10, 20].

In [None]:
#! S = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()
print(S.shape)
#! S = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()
print(S.shape)

In [None]:
del A, B, C, D, E, F, S

---

<a id='2'></a>
## 2. Automatic Differentiation and Neural Networks
We have so far seen some of torch's computational capabilities; its GPU accelaration and multitude of high level functions make it suitable for array processing and vector arithmetic. But torch is more than a faster numpy; its key components, and where most of the magic happens, are in its automatic differentiation mechanics and neural network libraries.

<a id='2a'></a>
### A. Autograd
Each torch tensor carries a flag around with it, `requires_grad`, which establishes whether that tensor requires gradient computation. 

By default, tensors do not require grad unless specified to. Whenever a tensor that requires grad assumes a role in the construction of another tensor, the new tensor also requires grad.
By dynamically tracking dependencies in the evolving computation graph, and utilizing this flag, torch is able to inform itself on which tensors need to be updated by gradient descent, and which do not (naturally, only tensors for which gradients are computed will be updated).

Practically, by setting `requires_grad` to `False` we can _freeze_ (parts of) our functions, making them static.

Let's see this in action by modeling a simple linear transformation $f(x): Ax$ from $x \in \mathbb{R}^5$  to $y \in \mathbb{R}^7$:

<div class="alert alert-block alert-info">
<b>Note:</b> Recall that a matrix of shape [M, N] is a linear map <em>from</em> $\mathbb{R}^N$ <em>to</em> $\mathbb{R}^M$.
</div>

In [16]:
A = torch.rand((7, 5), device=device)
print(A)

def f(x: torch.FloatTensor) -> torch.FloatTensor:
    return A@x 

tensor([[0.6919, 0.6459, 0.7517, 0.2551, 0.2136],
        [0.0316, 0.7695, 0.0466, 0.8732, 0.3765],
        [0.5317, 0.5456, 0.4557, 0.1532, 0.3846],
        [0.5271, 0.8347, 0.4891, 0.7046, 0.9799],
        [0.1948, 0.5878, 0.1504, 0.8382, 0.6391],
        [0.2774, 0.2241, 0.1369, 0.8511, 0.2390],
        [0.0507, 0.1079, 0.8044, 0.2574, 0.1701]], device='cuda:0')


Let's test our function for some $x$, and check whether the result requires grad.

In [18]:
x = torch.rand(5, device=device)

y = f(x)
print(y)
print(y.requires_grad)

tensor([1.7345, 1.6712, 1.4721, 2.6951, 1.9270, 1.3936, 0.8450],
       device='cuda:0')
False


It does not; what if the parameters of our function $f$ were trainable though?

In [19]:
A.requires_grad = True
y = f(x)
print(y.requires_grad)

True


The output of our linear transformation is now also trainable!


---

<a id='2b'></a>
### Exercises
Answer the next questions before you proceed.

Model an affine transformation $g(x) = Ax + \beta$ from $x \in \mathbb{R}^3$ to $y \in \mathbb{R}^{12}$ as the composition of two functions, $f_1(x) = Ax$, $f_2(x) = x + \beta$, such that $A$ requires grad but $\beta$ does not.

<div class="alert alert-block alert-info">
<b>Note:</b> You can use `requires_grad: bool` as an optional argument during tensor construction. Can you guess its default value?
</div>

In [None]:
#! A = NotImplemented
#! beta = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

def f_1(x: torch.FloatTensor) -> torch.FloatTensor:
    # YOUR CODE HERE
    raise NotImplementedError()
    return NotImplemented

def f_2(x: torch.FloatTensor) -> torch.FloatTensor:
    # YOUR CODE HERE
    raise NotImplementedError()
    return NotImplemented

x = torch.rand(3, device=device)

w = f_1(x)
y = f_2(w)

Try to figure out the answers on your own before verifying them with code.

Let's assume $x$ is a fixed data sample, therefore does not require grad (we don't usually want to fit our data, but the function applied on the data!)

* If $A$ requires grad but $\beta$ doesn't, does $w$ require grad? Does $y$?

* If $\beta$ requires grad but $A$ doesn't, does $w$ require grad? Does $y$?

What would that mean for $A$ and $\beta$ during gradient descent?

Verify your answers with code.

In [None]:
# check grads of w and y 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# check grads of w and y for a scenario where beta requires grad and A doesn't
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

In [None]:
del A, beta, x, y, w, f, f_1, f_2

---

<a id='3'></a>

## 3. Neural Networks
We are very close to defining our first neural network in torch. Torch includes a powerful neural network library, `torch.nn`, which allows us to create our own custom network flows, use highly optimized off-the-shelf implementations of most standard kinds of networks and compose different networks together.

<a id='3a'></a>
### A. Custom Neural Networks
You have already implemented an affine transformation; a shallow feedforward network is simply such a transformation followed by a non-linearity. For the sake of familiarizing ourselves with custom torch networks, we will go through the process of defining such a network from scratch (you won't normally be doing this, but it's still beneficial to have an idea of what's happening at the low level before proceeding to the high level).

The building block for a torch network is the `torch.nn.Module` class; we need to define our networks as objects inheriting that class. If we do so, we only need to implement two functions: `__init__()` and `forward()`. The first is responsible for registering the internal variables of our network, while the second specifies the kind of computation it actually performs.

Let's see these in practice; we will define a shallow feedforward network implementing $f(x) = \sigma(Wx + \beta)$ from any input dimension to any output dimension, where $\sigma$ the sigmoid activation.

In [None]:
class my_first_network(torch.nn.Module):
    def __init__(self, in_features: int, out_features: int, device: str) -> None:
        super(my_first_network, self).__init__()  # this is important! do not forget to call this
        self.device = device
        self.W = torch.nn.Parameter(torch.rand(out_features, in_features, device=self.device))
        self.beta = torch.nn.Parameter(torch.rand(out_features, device=self.device))
        
    def forward(self, x: torch.FloatTensor) -> torch.FloatTensor:
        return torch.sigmoid(self.W@x + self.beta)

<div class="alert alert-block alert-info">
<b>Note:</b> Notice the use of torch.nn.Parameter. Wrapping the tensors that are parametric to our network's function in torch.nn.Parameter is crucial as it informs torch that these tensors need to be stored, updated and shared between different function calls.
</div>

We can now test our first network class by instatiating an actual network and passing a random tensor through it.

In [None]:
f = my_first_network(in_features=3, out_features=12, device=device)
x = torch.rand(3, device=device)
y = f(x)
print(y)

<div class="alert alert-block alert-info">
<b>Note:</b> Notice that we can use `f(x)` instead of `f.forward(x)` -- `forward` overloads the `__call__` method of the `torch.nn.Module` class, so the above statements are equivalent.
</div>

Since we have the abstraction for one layer, what's stopping us from instantiating a second network and composing the two into a deep network?

In [None]:
class my_first_deep_network(torch.nn.Module):
    def __init__(self, in_features: int, intermediate_features: int, out_features: int, device: str) -> None:
        super(my_first_deep_network, self).__init__()
        self.device = device
        self.n_1 = my_first_network(
            in_features=in_features, out_features=intermediate_features, device=self.device
        )
        self.n_2 = my_first_network(
            in_features=intermediate_features, out_features=out_features, device=self.device
        )
        
    def forward(self, x: torch.FloatTensor) -> torch.FloatTensor:
        return self.n_2(self.n_1(x))
    
g = my_first_deep_network(in_features=3, intermediate_features=12, out_features=5, device=device)
y = g(x)
print(y)

Very easy, right?

Except we just did it the hard way! We could have used torch's pre-made `torch.nn.Linear`, the existing abstraction for single feedforward layers, and `torch.nn.Sequential`, the abstraction for composing sequences of networks. `Sequential` is initiated by an iterable of neural modules (not functions!), which are applied in the order specified.

How would that have looked like?

In [None]:
h = torch.nn.Sequential(
    torch.nn.Linear(in_features=3, out_features=12),
    torch.nn.Sigmoid(), 
    torch.nn.Linear(in_features=12, out_features=5), 
    torch.nn.Sigmoid()  
).to(device) 

y = h(x)
print(y)

Even easier!

<div class="alert alert-block alert-info">
<b>Note:</b> When composing networks using the `Sequential` abstraction, you should make sure that each network's expected input shape matches the output shape of the previous network. The device conversion is applied recursively to each sub-module within a module, ensuring that all components of the network live happily in the same device. 🏠 
</div>

So, why ever define our own networks if torch can do it for us? The answer is that very often (and very soon!) it might be the case that you'll need to write your own, potentially complex, computation flow, which won't necesserily be possible to rephrase as simple layer stacking.

In [None]:
del x, y, h, g, f, my_first_network, my_first_deep_network

<a id='3b'></a>
### B.  Loss Functions

We have seen how to construct parametric, trainable functions and networks. Requiring a gradient and having a gradient are two different things, however. To obtain the gradients of our trainable parameters we need a loss function. The loss function is an indicator of how far off the network's output (prediction) is from the actual truth. Applying the chain rule, we may differentiate the loss value w.r.t. the model's parameters, populating the tensors' gradients in the process. 

As with networks, torch provides implementations for the commonly used loss functions but also allows us to write our own (as another neural module!).

We will only experiment with an existing loss function, but first we need to construct some data to play with -- we can create a synthetic dataset of pairs $(x_i, y_i)$ where $x_i$ is a random number and $y_i = 3 \cdot x_i - 2$

In [20]:
# a tensor of shape 100, 1, i.e. 100 data points each of dimensionality 1
x = torch.rand((100, 1), device=device) 
y = 3 * x - 2

Time to employ a linear network.

<div class="alert alert-block alert-warning">
<b>Warning:</b>
Don't mix up input/output dimensionality and number of data samples! 
</div>

In [21]:
f = torch.nn.Linear(in_features=1, out_features=1).to(device)
prediction = f(x)

Now to define our loss function:

In [22]:
loss_fn = torch.nn.MSELoss()

Note that each element of our batch (i.e. each of the 100 data samples) has its own MSE (mean squared error) w.r.t. its corresponding output. These unique losses are averaged into a single scalar by the loss function.

In [23]:
loss = loss_fn(y, prediction)
print(loss)

tensor(0.5040, device='cuda:0', grad_fn=<MseLossBackward0>)


Now that we have computed the loss, we may use it to populate the parameters' gradients via a backward pass. This is automagically done by a simple call of `backward`. 🧙

In [24]:
loss.backward()

Torch provides a few different loss functions, each for a particular use case (the task and your output layer's activation function). Note that some loss functions are already implementing the network's output activation internally. Refer to the [documentation](https://pytorch.org/docs/stable/nn.html#loss-functions) for a detailed overview. In most cases, the cheatsheet below should contain the answer.

<div class="alert alert-block alert-info">
<b>Tip:</b> At a loss which loss to use? Use this cheatsheet!
</div>

| Task | Activation | Loss Function |
| --- | --- | --- |
| K-class Classification | - | CrossEntropyLoss |
| K-class Classification | LogSoftmax | NLLLoss |
| K-class, Multi-Label Classification | - | BCEWithLogitsLoss |
| K-class, Multi-Label Classification | LogSigmoid | BCELoss |
| Continuous Regression | - | MSELoss |
| Probability Distribution Fitting | LogSoftmax / LogSigmoid | KLDivLoss |

<a id='3c'></a>

### C.  Optimizers
Our struggles are slowly coming to an end. We have made a trainable network, we have computed the loss given the true output, and we have used the loss to populate the parameter gradients. The final thing to do is to use these gradients in order to update the parameter values. 
This is managed by an `Optimizer`. Gradient based optimizers are the norm for training neural networks; all of them are variants of stochastic gradient descent. Torch provides implementations of the classic optimizers. Regardless of which one of them is your favourite, the process always involves the same steps:
1. Initiate the optimizer by letting it know which parameters it is going to be responsible for
2. Iterate over your data, and:
    2. Compute the loss
    3. Back-propagate
    4. Perform an optimization step
    5. Zero out the gradients (so that they don't accumulate over optimization steps)

Let's see them in action on our toy network and synthetic dataset.

In [None]:
opt = torch.optim.Adam(f.parameters())  # initiate optimizer
for t in range(5001):  # iterate
    prediction = f(x)  # predict 
    loss = loss_fn(prediction, y)  # compute loss
    loss.backward()  # backpropagate
    opt.step()  # optimize
    opt.zero_grad()  # reset gradients
    if t % 500 == 0:
        print("Iteration {} loss: {}".format(t, loss.item()))

Congratulations! You have trained your first torch network! 🎉

Remember that the network was trying to approximate $a$ and $b$ in $a \cdot x + b$, which we set to $3 \cdot x - 2$ when we created the dataset. You can now probe the network's inner parameters to see to what extent it figured out the truth.

In [None]:
print(f.weight)
print(f.bias)

---

<a id='3d'></a>

### D. Exercises

Before we move on to a somewhat more realistic problem, it might be a good idea to get further acquainted with the basics. 

Let's take a quick look at a few different non-linear activations first.

Use `arange` to construct a float tensor $x$ of values $0 \dots 1000$ in ascending order. Then elementwise subtract $500$ and divide by $100$ to get a tensor of values $-5 \dots 5$.

In [None]:
#! x = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

Construct the tensors $s = \sigma(x)$, $t = tanh(x)$ and $r = ReLU(x)$ (refer to the documentation if needed).

In [None]:
s = torch.sigmoid(x)
#! t = NotImplemented
#! r = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

Now we can convert these to numpy arrays and plot the results. 

In [None]:
from matplotlib import pyplot as plt
plt.plot(x.cpu().numpy(),s.cpu().numpy())
plt.plot(x.cpu().numpy(),t.cpu().numpy())
plt.plot(x.cpu().numpy(),r.cpu().numpy())
plt.ylim((-2, 5))
plt.legend(["sigmoid", "tanh", "relu"])
plt.show()

Let's practice some more by solving the infamous [XOR problem](https://en.wikipedia.org/wiki/Exclusive_or) with a small deep network. 

In [None]:
X = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], device=device, dtype=torch.float)
Y = torch.tensor([0, 1, 1, 0], device=device, dtype=torch.float)

Use `torch.nn.Sequential` to create a minimal deep network with 1 output dimension and 2 intermediate dimensions. Use ReLU as your intermediate layer activation. 

Picturing the problem as a classification over two classes, select an appropriate output activation and loss function (refer to the cheatsheet for aid).

In [None]:
f = torch.nn.Sequential(
    # YOUR CODE HERE
    raise NotImplementedError()
).to(device)

#! loss_fn = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

Instantiate an optimizer for your network.

In [None]:
#! opt = NotImplemented
# YOUR CODE HERE
raise NotImplementedError()

Perform 5000 iterations of training, printing the loss as you go.

In [None]:
for t in range(1, 5001):  # iterate
    #! P = NotImplemented  # predict 
    #! loss = NotImplemented  # compute loss
    #! NotImplemented  # backpropagate
    #! NotImplemented  # optimize
    #! NotImplemented  # reset gradients
    # YOUR CODE HERE
    raise NotImplementedError()
    if t % 500 == 0:
        print("Iteration {} loss: {}".format(t, loss.item()))

You might encounter some error here (e.g., shape, loss function, tensor type, value not in a rnage, etc.). If you do, don't panic! Read it, understand what the problem is, use google searhc, read documentation of certain loss functions if needed, and remember the shape manipulations operations covered in this notebook.

Is the loss improving? If not, try toying around with the intermediate layer's width, the optimizer and its learning rate until your network can solve the problem.

Check what the trained network now predicts on our input data. Is that what we would expect here?

In [None]:
f(X)

In [None]:
del f, loss_fn, opt, X, Y

<a id='3d'></a>
## 4. Putting Everything Together
Time to hone our newly acquired torch skills! 

We will now put everything together and write an actual network on a real task. The code below is mostly complete, but some parts here and there are missing. You will be asked to fill those in, so pay attention!

First, construct a two-layer network that implements the function $f: \mathbb{R}^{300} \to \mathbb{R}$, such that:

$f(x) = W_2(ReLU(W_1x + \beta_1) + \beta_2$

where:
* $W_1 \in \mathbb{R}^{100, 300}$
* $W_2 \in \mathbb{R}^{1, 100}$ 
* $ \beta_1 \in \mathbb{R}^{100}$ 
* $\beta_2 \in \mathbb{R}^1$


using `torch.nn.Sequential` 

In [None]:
f = torch.nn.Sequential(
    # YOUR CODE HERE
    raise NotImplementedError()
).to(device)

Let's make some use of this network on real data.
We are going to open a data dump containing ~5500 baby names. Each name is associated with a label (either 0 for male, or 1 for female), and also a 300-dimensional vector. Representing words as dense vectors is standard practice in NLP; you will learn more about these vectors in your first assignment. For now, we will simply use them in an attempt to teach the network to distinguish between boy and girl babies given their names, while writing some useful code in the process. 👶 🍼

In [None]:
import pickle
with open("name_data.p", "rb") as fh:
    names, vectors, labels = pickle.load(fh)

What does our data look like? Run the snippet below a couple of times to get an impression.

In [None]:
print(np.random.permutation(list(zip(names, labels)))[0:20])

`vectors` is a list of numpy arrays, and `labels` is a list of integers. We will need to convert them to lists of FloatTensors. Since that is a lot of data, it is unlikely for them to all fit in the GPU, so we will use the RAM as a temporary storage regardless of your currently used device.

In [None]:
# or alternatively, vectors = [torch.tensor(vector) for vector in vectors]
vectors = list(map(lambda x: torch.tensor(x, dtype=torch.float), vectors))
labels = list(map(lambda x: torch.tensor(x, dtype=torch.float), labels))  

Training on your entire dataset is bad practice; we should split the data into a training set and a validation set. We could either do it manually, or let `sklearn` do it for us.

In [None]:
from sklearn.model_selection import train_test_split
names_train, names_val, X_train, X_val, Y_train, Y_val = train_test_split(names, vectors, labels, test_size=0.2)
assert len(X_train) == len(Y_train) == len(names_train)
assert len(X_val) == len(Y_val) == len(names_val)

Now that we have split the data, we may convert them from a list of tensors into a big tensor. We could do that by using `view()` to expand the first dimension of each vector and then `cat()` to merge them, but an easier solution is `stack()`.

In [None]:
X_train = torch.stack(X_train)
X_val = torch.stack(X_val)
Y_train = torch.tensor(Y_train)
Y_val = torch.tensor(Y_val)
print(X_train.shape)
print(X_val.shape)
print(Y_train.shape)
print(Y_val.shape)

Now that we have our tensors in a sensible format, we may construct a Dataset (a storage unit for our data) and a DataLoader (a wrapper responsible for shuffling the data, iterating through it and converting it to batches).

In [None]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

train_dataset = TensorDataset(X_train, Y_train)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataset = TensorDataset(X_val, Y_val)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=False)  # no need to shuffle the validation data

Let's begin and immediately stop an iteration through the training dataloader to get an idea of what's going on.

In [None]:
for batch_x, batch_y in train_dataloader:
    print(batch_x.shape)
    print(batch_x.dtype)
    print(batch_y.shape)
    print(batch_y.dtype)
    break

Looking good; batch_x is 32 300-dimensional vectors, and batch_y is 32 values.

`train_batch()` is a function that takes a network, a batch of inputs, a batch of outputs, a loss function and an optimizer, runs the training routine on that batch and returns the loss value of that batch.

In [None]:
from typing import Callable, List

# Callable is typed as Callable[[i1, i2, ..], o]
# where i1, i2, .. are the input types and o is the output type.

def train_batch(
    network: torch.nn.Module,  # the network
    X_batch: torch.FloatTensor,  # the X batch
    Y_batch: torch.LongTensor,   # the Y batch
    # a function from a FloatTensor (prediction) and a FloatTensor (Y) to a FloatTensor (the loss)
    loss_fn: Callable[[torch.FloatTensor, torch.FloatTensor], torch.FloatTensor],  
    # the optimizer
    optimizer: torch.optim.Optimizer
) -> float:
    # Set the training mode.
    network.train()
    # Train.
    prediction_batch = network(X_batch)  # forward pass
    batch_loss = loss_fn(prediction_batch.view(-1), Y_batch)  # loss calculation
    batch_loss.backward()  # gradient computation
    optimizer.step()  # back-propagation
    optimizer.zero_grad()  # gradient reset
    return batch_loss.item()

<div class="alert alert-block alert-info">
<b>Note:</b> Several network components (e.g. dropout units) may behave differently during training and validation. We use `.train()` to inform the network that we are in training time.
</div>

<div class="alert alert-block alert-warning">
<b>Warning:</b>
The `batch_loss` tensor requires gradient (why?). It is important to return its contents with `.item()` rather than the tensor itself, otherwise we risk memory leak because of the accumulated gradient tracking.
</div>

`train_epoch()` is a function that takes a network, the training dataloader, a loss function and an optimizer. It iterates through the dataloader and is responsible for calling `train_batch()`.

In [None]:
def train_epoch(
    network: torch.nn.Module, 
    dataloader: DataLoader,
    loss_fn: Callable[[torch.FloatTensor, torch.FloatTensor], torch.FloatTensor],
    optimizer: torch.optim.Optimizer, 
    device: str
) -> float:
    # Set the initial loss value.
    loss = 0.
    # Iterate over the batches in the dataloader.
    for i, (x_batch, y_batch) in enumerate(dataloader):
        x_batch = x_batch.to(device)  # convert back to your chosen device
        y_batch = y_batch.to(device)
        loss += train_batch(
            network=network, X_batch=x_batch, Y_batch=y_batch, loss_fn=loss_fn, optimizer=optimizer
        )
    loss /= (i+1) # divide the loss by the number of batches for consistency 
    return loss

Your turn; fill in the missing parts of `eval_batch()`, a function that takes a network, a batch of inputs, a batch of outputs and a loss function, and computes the loss of that batch.

In [None]:
def eval_batch(
    network: torch.nn.Module,  # the network
    X_batch: torch.FloatTensor,  # the X batch
    Y_batch: torch.LongTensor,   # the Y batch
    loss_fn: Callable[[torch.FloatTensor, torch.LongTensor], torch.FloatTensor]
) -> float:
    # Set the evaluation mode.
    network.eval()
    #
    with torch.no_grad():
        #! NotImplemented
    #! return NotImplemente
    # YOUR CODE HERE
    raise NotImplementedError()

<div class="alert alert-block alert-info">
<b>Note:</b> Notice that we use `.eval()` to inform the network we are in validation time. Notice also the `no_grad()` context; this is telling torch that it doesn't need to bother with gradient tracking momentarily, providing a significant speed-up for the current session.
</div>

`eval_epoch()` is basically the same as `train_epoch()`, aside from the lack of an optimizer. Fill it in.

In [None]:
def eval_epoch(
    network: torch.nn.Module, 
    dataloader: DataLoader,
    loss_fn: Callable[[torch.FloatTensor, torch.LongTensor], torch.FloatTensor],
    device: str
) -> float:
    #! NotImplemented
    # YOUR CODE HERE
    raise NotImplementedError()

We can also make an auxilliary `infer_batch()` function; the forward pass gives us the final layer's output, but we might be more interested in the predicted class rather than its probability.

In [None]:
def infer_batch(
    network: torch.nn.Module, 
    batch_x: torch.FloatTensor, 
    device: str
) -> torch.LongTensor:
    # First apply the sigmoid activation 
    # (since it is implemented by the loss function rather than the network itself).
    sigm = torch.sigmoid(network(batch_x.to(device)))
    # Round the result.
    classes = torch.round(sigm)
    # Detach it from the computation graph (we no longer care about its gradients).
    classes = classes.detach()
    # Cast the result into a LongTensor and return.
    return classes.to(torch.long)

One last thing before we can finally train; we need a loss function and an optimizer. 

In [None]:
opt = torch.optim.Adam(f.parameters(), lr=1e-05)
loss_fn = torch.nn.BCEWithLogitsLoss(reduction="mean")

In [None]:
NUM_EPOCHS = 100

train_losses = []
val_losses = []

for t in range(NUM_EPOCHS):
    train_loss = train_epoch(f, train_dataloader, optimizer=opt, loss_fn=loss_fn, device=device)
    val_loss = eval_epoch(f, val_dataloader, loss_fn, device=device)
    
    print("Epoch {}".format(t))
    print(" Training Loss: {}".format(train_loss))
    print(" Validation Loss: {}".format(val_loss))
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)

We may plot the losses to get an idea of what the learning curve looks like.

In [None]:
from matplotlib import pyplot as plt
plt.plot(train_losses)
plt.plot(val_losses)
plt.legend(["Training", "Validation"])
plt.show()

And let's bring this to an end by labeling all of our validation data and printing the results.

In [None]:
predictions = []

for x_batch, _ in val_dataloader:
    p_batch = infer_batch(f, x_batch, device).cpu().numpy().tolist()
    predictions.extend(p_batch)
    
from pprint import pprint
pprint(list(zip(names_val, predictions)))

Very convincing! What are these word vectors and how are they helping our network predict baby genders?  🤔

Do the first assignment and find out!

---