## Welcome to the first Foundation Lab (Day 2).
Today, we will start with the fundamentals of Pytorch  and then move on to the building our first neural network a Multilayer Perceptron (MLP)!
The following notebook is meant to give a short introduction to effectively start using Pytorch which will be the sole foundation to build more complex projects later.

For more specific functionalities and official documentation, you can visit by the [Pytorch Deep Learning](https://docs.pytorch.org/tutorials/) website.


# Facilitators
- Amr Khalifa (Google Deep Mind, amrkhalifa@google.com)
- Gergo Ignacz (KAUST, gergo.ignacz@kaust.edu.sa)

## "Your turn!" tasks
After some module, your will see the the "Your turn" sections, this means small tasks that you can practice on. The answers are always provided the following cells (hidden)

# PyTorch

[PyTorch](https://pytorch.org/) is an open source machine learning framework. At its core, PyTorch provides a few key features:

- A multidimensional **Tensor** object, similar to [numpy](https://numpy.org/) but with GPU accelleration.
- An optimized **autograd** engine for automatically computing derivatives
- A clean, modular API for building and deploying **deep learning models**

We will use PyTorch for all programming assignments throughout this 1.5 hours. This notebook will focus on the **Tensor API**, as it is the main part of PyTorch.

You can find more information about PyTorch by following one of the [official tutorials](https://pytorch.org/tutorials/) or by [reading the documentation](https://docs.pytorch.org/docs/2.9/).

To use PyTorch, we first need to import the `torch` package.

We also check the version; the assignments in this course will use PyTorch verion 2.9.0, since this is the default version in Google Colab (at the time of MenaML).

In [None]:
import torch
print("PyTorch Version is :", torch.__version__)

## Tensor Basics

### What is a tensor?

Pytorch uses tensors to represent data and they are the fundamental building blocks in machine learning and mathematics.

**Simple (named) tensors**:
1. Rank 0: Scalar
2. Rank 1: Vector
3. Rank 2: Matrix
4. Rank 3: Tensor (rank-3 tensor)
5. Rank N: and above rank-N tensor


<img src="https://github.com/ignaczgerg/training-open-resources/blob/main/images/scalar-vector-matrix-tensor.png?raw=true" alt="scalar-vector-matrix-tensor" width="350"/>

### Your turn!
Decide whether this is a scalar, vector, matrix or tensor.
Solution is found in the next cell.
1. The real number 0.23.
2. Longitude and latidute: (40° 41′ 17.57″ N, 74° 02′ 21.59″ W).
3. A simple (plain) EXCEL sheet.
4. An RGB image of a cat.
5. A black and white image of an old person.

#### Solution

<details>
  <summary>Click to see the answers</summary>
  
1. Scalar
2. Vector
3. Matrix
4. Tensor
5. Matrix
  
</details>

### Creating and Accessing tensors

A `torch` **tensor** is a multidimensional grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the **rank** of the tensor; the **shape** of a tensor is a tuple of integers giving the size of the array along each dimension.

We can initialize `torch` tensor from nested Python lists. We can access or mutate elements of a PyTorch tensor using square brackets.

Accessing an element from a PyTorch tensor returns a PyTorch scalar; we can convert this to a Python scalar using the `.item()` method:

In [None]:
# Create a rank 1 tensor from a Python list
a = torch.tensor([1, 2, 3])
print('Here is a:')
print(a)
print('type(a): ', type(a))
print('rank of a: ', a.dim())
print('a.shape: ', a.shape)

# # Access elements using square brackets
print()
print('a[0]: ', a[0])
print('type(a[0]): ', type(a[0]))
print('type(a[0].item()): ', type(a[0].item()))

# # Mutate elements using square brackets
a[1] = 10
print()
print('a after mutating:')
print(a)

The example above shows a one-dimensional tensor; we can similarly create tensors with two or more dimensions:

In [None]:
# Create a two-dimensional tensor
b = torch.tensor([[1, 2, 3],
                  [4, 5, 5]]
                 )
print('Here is b:')
print(b)
print('rank of b:', b.dim())
print('b.shape: ', b.shape)

# Access elements from a multidimensional tensor
print()
print('b[0, 1]:', b[0, 1])
print('b[1, 2]:', b[1, 2])

# Mutate elements of a multidimensional tensor
b[1, 1] = 100
print()
print('b after mutating:')
print(b)

### Tensor Factories (Creation Ops)

PyTorch provides many convenience methods for constructing tensors; this avoids the need to use Python lists. For example:

- [`torch.zeros(size)`](https://docs.pytorch.org/docs/stable/generated/torch.zeros.html): Creates a tensor of all zeros. Common useage: initializing bias vectors in neural network layers.
- [`torch.ones(size)`](https://docs.pytorch.org/docs/stable/generated/torch.ones.html): Creates a tensor of all ones. Common usage: initialization of all-pass masks or scaling factors (in normalization layers).
- [`torch.rand(size)`](https://docs.pytorch.org/docs/stable/generated/torch.rand.html): Creates a tensor with uniform random numbers. Common usage: generating noise vectors or initializing network weights.
- [`torch.full(size, fill_value)`](https://docs.pytorch.org/docs/stable/generated/torch.full.html): Creates tensor with `size` full of `fill_value`. Common usage: special initialization; or filling padding with a special value.
- [`torch.eye`](https://docs.pytorch.org/docs/stable/generated/torch.eye.html): Creates a 2-D tensor with ones on the diagonal and zeros elsewhere. Common usage: diagonal regularization (jitter) on matrices; identity initalization for RNNs.
- [`torch.zeros_like(input)`](https://docs.pytorch.org/docs/stable/generated/torch.zeros_like.html): Creates a tensor filled with the scalar value 0, with the same size as `input` (tensor). `torch.ones_like` fn is similar.

You can find a full list of tensor creation operations [in the documentation](https://docs.pytorch.org/docs/stable/torch.html#tensor-creation-ops).

In [None]:
# Create a tensor of all zeros
a = torch.zeros(2, 3)
print('tensor of zeros:')
print(a)

# Create a tensor of all ones
b = torch.ones(1, 2)
print('\ntensor of ones:')
print(b)
print('dim:',b.dim())
print('shape:',b.shape)

# Create a 3x3 identity matrix
c = torch.eye(3)
print('\nidentity matrix:')
print(c)

# Tensor of random values
d = torch.rand(3, 3)
print('\nrandom tensor:')
print(d)

### Datatypes

In the examples above, you may have noticed that some of our tensors contained floating-point values, while others contained integer values.

PyTorch provides a [large set of numeric datatypes](https://docs.pytorch.org/docs/stable/tensors.html) that you can use to construct tensors. PyTorch tries to guess a datatype when you create a tensor; functions that construct tensors typically have a `dtype` argument that you can use to explicitly specify a datatype.

Each tensor has a `dtype` attribute that you can use to check its data type:

In [None]:
# Let torch choose the datatype
x0 = torch.tensor([1, 2])   # List of integers
x1 = torch.tensor([1., 2.]) # List of floats
x2 = torch.tensor([1., 2])  # Mixed list
print('dtype when torch chooses for us:')
print('List of integers:', x0.dtype)
print('List of floats:', x1.dtype)
print('Mixed list:', x2.dtype)

# Force a particular datatype
y0 = torch.tensor([1, 2], dtype=torch.float32)  # 32-bit float
y1 = torch.tensor([1, 2], dtype=torch.int32)    # 32-bit (signed) integer
y2 = torch.tensor([1, 2], dtype=torch.int64)    # 64-bit (signed) integer
print('\ndtype when we force a datatype:')
print('32-bit float: ', y0.dtype)
print('32-bit integer: ', y1.dtype)
print('64-bit integer: ', y2.dtype)

# Other creation ops also take a dtype argument
z0 = torch.ones(1, 2)  # Let torch choose for us
z1 = torch.ones(1, 2, dtype=torch.int16) # 16-bit (signed) integer
z2 = torch.ones(1, 2, dtype=torch.uint8) # 8-bit (unsigned) integer
# Note: many modern architecture uses mixed precision: they use half-precision
# (float16 or brainfloat16, bf16) for matrix multiplication but store the
# weights in single precision (32-bit fp).
# If you are interested you can read more about this here (torch.cuda.amp):
# https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html
print('\ntorch.ones with different dtypes')
print('default dtype:', z0.dtype)
print('16-bit integer:', z1.dtype)
print('8-bit unsigned integer:', z2.dtype)

We can **cast** a tensor to another datatype using the [`.to()`](https://docs.pytorch.org/docs/stable/generated/torch.Tensor.to.html) method; there are also convenience methods like [`.float()`](https://docs.pytorch.org/docs/stable/generated/torch.Tensor.float.html) and [`.long()`](https://docs.pytorch.org/docs/stable/generated/torch.Tensor.long.html) that cast to particular datatypes:


In [None]:
x0 = torch.eye(3, dtype=torch.int64)
x1 = x0.float()  # Cast to 32-bit float
x2 = x0.double() # Cast to 64-bit float
x3 = x0.to(torch.float32) # Alternate way to cast to 32-bit float
x4 = x0.to(torch.float64) # Alternate way to cast to 64-bit float
print('x0:', x0.dtype)
print('x1:', x1.dtype)
print('x2:', x2.dtype)
print('x3:', x3.dtype)
print('x4:', x4.dtype)

PyTorch provides several ways to create a tensor with the same datatype as another tensor:
- [`torch.zeros_like()`](https://pytorch.org/docs/stable/generated/torch.zeros_like.html) creates a new tensor filled with zeros that matches the shape and datatype of a given tensor
- The tensor instance method [`.to()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.to) can take a tensor as an argument, in which case it casts to the datatype of the argument

In [None]:
x0 = torch.eye(3, dtype=torch.float64)  # shape (3, 3), dtype torch.float64
x1 = torch.zeros_like(x0)               # shape (3, 3), dtype torch.float64
x2 = torch.zeros(4, 5, dtype=x0.dtype)  # shape (4, 5), dtype torch.float64
x3 = torch.ones(6, 7).to(x0)            # shape (6, 7), dtype torch.float64)
print('x0 shape is %r, dtype is %r' % (x0.shape, x0.dtype))
print('x1 shape is %r, dtype is %r' % (x1.shape, x1.dtype))
print('x2 shape is %r, dtype is %r' % (x2.shape, x2.dtype))
print('x3 shape is %r, dtype is %r' % (x3.shape, x3.dtype))

Even though PyTorch provides a large number of numeric datatypes, the most commonly used datatypes are:

- `torch.float32`: Standard floating-point type; used to store learnable parameters, network activations, etc. Nearly all arithmetic is done using this type.
- `torch.int64`: Typically used to store indices
- `torch.bool`: The preferred type for storing boolean values (True/False).
- `torch.float16`: Used for mixed-precision arithmetic, usually on NVIDIA GPUs with [tensor cores](https://www.nvidia.com/en-us/data-center/tensorcore/).
- `torch.bfloat16`: Increasingly popular for mixed-precision training; has the same exponent range as float32 but with reduced mantissa precision. Supported by modern hardware (NVIDIA Ampere/Hopper GPUs, TPUs, Intel CPUs). Often preferred over float16 for training due to better numerical stability.



You won't need to worry about this datatype in this course.

### The bracketing counting trick


Here is a trick. You can tell the number of dimensions a tensor in PyTorch has by the number of square brackets on the outside (`[`) and you only need to count one side.

The image below shows the "bracket counting" trick to quickly get the dimension of a tensor from a visual output.

<img src="https://github.com/ignaczgerg/training-open-resources/blob/main/images/bracketing_counting.png?raw=true" alt="bracketing counting trick" width="750"/>



### Your turn!

In [None]:
# Create a random tensor with 64-bit floating point precision with a shape of (256, 256, 3)
tensor = None # TODO
# print("tensor:", tensor)

# Print out the dimension
print(None) # TODO

# print out the shape of tensor
print(None) # TODO

# create a new tensor filled with ones with the same size as tensor
tensor_like = None # TODO
print(tensor_like)

# create an identity matrix with the shape of (3,3) and mask out the original tensor on (:, :, 1)
tensor_eye = None # TODO
# TODO: add masking here


# print("Modified slice (:, :, 1):\n", tensor[:, :, 1])

In [None]:
#@title Solution
# Create a random tensor with 64-bit floating point precision with a shape of (256, 256, 3)
tensor = torch.rand(size=(256, 256, 3), dtype=torch.float64)
print("tensor:", tensor)

# Print out the dimension
print(tensor.ndim)

# print out the shape of tensor
print(tensor.shape)

# create a new tensor filled with ones with the same size as tensor
tensor_like = torch.ones_like(tensor)
print(tensor_like)

# create an identity matrix with the shape of (3,3) and mask out the original tensor on (:, :, 1)
tensor_eye = torch.eye(tensor.shape[0])
tensor[:, :, 1] = torch.eye(256)
print("Modified slice (:, :, 1):\n", tensor[:, :, 1])

## Tensor indexing

We have already seen how to get and set individual elements of PyTorch tensors. PyTorch also provides many other ways of indexing into tensors. Getting comfortable with these different options makes it easy to modify different parts of tensors with ease.

### Slice indexing

Similar to Python lists and numpy arrays, PyTorch tensors can be **sliced** using the syntax `start:stop` or `start:stop:step`. The `stop` index is always non-inclusive: it is the first element not to be included in the slice.

Start and stop indices can be negative, in which case they count backward from the end of the tensor.

In [None]:
a = torch.tensor([0, 11, 22, 33, 44, 55, 66])
print(0, a)        # (0) Original tensor
print(1, a[2:5])   # (1) Elements between index 2 and 5
print(2, a[2:])    # (2) Elements after index 2
print(3, a[:5])    # (3) Elements before index 5
print(4, a[:])     # (4) All elements
print(5, a[1:5:2]) # (5) Every second element between indices 1 and 5
print(6, a[:-1])   # (6) All but the last element
print(7, a[-4::2]) # (7) Every second element, starting from the fourth-last

For multidimensional tensors, you can provide a slice or integer for each dimension of the tensor in order to extract different types of subtensors:

In [None]:
# Create the following rank 2 tensor with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = torch.tensor([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print('Original tensor:')
print(a)
print('shape: ', a.shape)

# Get row 1, and all columns.
print('\nSingle row:')
print(a[1, :])
print(a[1])  # Gives the same result; we can omit : for trailing dimensions
print('shape: ', a[1].shape)

print('\nSingle column:')
print(a[:, 1])
print('shape: ', a[:, 1].shape)

# Get the first two rows and the last three columns
print('\nFirst two rows, last two columns:')
print(a[:2, -3:])
print('shape: ', a[:2, -3:].shape)

# Get every other row, and columns at index 1 and 2
print('\nEvery other row, middle columns:')
print(a[::2, 1:3])
print('shape: ', a[::2, 1:3].shape)

Slicing a tensor returns a **view** into the same data, so modifying it will also modify the original tensor. To avoid this, you can use the `clone()` method to make a copy of a tensor.

In [None]:
# Create a tensor, a slice, and a clone of a slice
a = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])
b = a[0, 1:]
c = a[0, 1:].clone()
print('Before mutating:')
print(a)
print(b)
print(c)

a[0, 1] = 20  # a[0, 1] and b[0] point to the same element
b[1] = 30     # b[1] and a[0, 2] point to the same element
c[2] = 40     # c is a clone, so it has its own data
print('\nAfter mutating:')
print(a)
print(b)
print(c)

print(a.storage().data_ptr() == c.storage().data_ptr())

So far we have used slicing to **access** subtensors; we can also use slicing to **modify** subtensors by writing assignment expressions where the left-hand side is a slice expression, and the right-hand side is a constant or a tensor of the correct shape:

In [None]:
a = torch.zeros(2, 4, dtype=torch.int64)
a[:, :2] = 1
a[:, 2:] = torch.tensor([[2, 3], [4, 5]])
print(a)
# Create a causal mask for GPT-style models
# Prevents tokens from attending to future positions
# Example:
# batch_size, seq_len = 32, 128
# mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
# mask[:, :] = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
# Note: https://docs.pytorch.org/docs/stable/generated/torch.triu.html

### Integer tensor indexing (Fancy indexing)

When you index into torch tensor using slicing, the resulting tensor view will always be a subarray of the original tensor. This is powerful, but can be restrictive.

We can also use **index arrays** to index tensors; this lets us construct new tensors with a lot more flexibility than using slices.

As an example, we can use index arrays to reorder the rows or columns of a tensor:

In [None]:
# Create the following rank 2 tensor with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print('Original tensor:')
print(a)

# Create a new tensor of shape (5, 4) by reordering rows from a:
# - First two rows same as the first row of a
# - Third row is the same as the last row of a
# - Fourth and fifth rows are the same as the second row from a
idx = [0, 0, 2, 1, 1]  # index arrays can be Python lists of integers
print('\nReordered rows:')
print(a[idx])

# Create a new tensor of shape (3, 4) by reversing the columns from a
idx = torch.tensor([3, 2, 1, 0])  # Index arrays can be int64 torch tensors
print('\nReordered columns:')
print(a[:, idx])

More generally, given index arrays `idx0` and `idx1` with `N` elements each, `a[idx0, idx1]` is equivalent to:

```
torch.tensor([
  a[idx0[0], idx1[0]],
  a[idx0[1], idx1[1]],
  ...,
  a[idx0[N - 1], idx1[N - 1]]
])
```

(A similar pattern extends to tensors with more than two dimensions)

We can for example use this to get or set the diagonal of a tensor:

In [None]:
a = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print('Original tensor:')
print(a)

idx = [0, 1, 2]
print('\nGet the diagonal:')
print(a[idx, idx])

# Modify the diagonal
a[idx, idx] = torch.tensor([11, 22, 33])
print('\nAfter setting the diagonal:')
print(a)

### Boolean tensor indexing

Boolean tensor indexing lets you pick out arbitrary elements of a tensor according to a boolean mask. Frequently this type of indexing is used to select or modify the elements of a tensor that satisfy some condition.

In PyTorch, we use `torch.bool` for this.

In [None]:
a = torch.tensor([[1,2], [3, 4], [5, 6]])
print('Original tensor:')
print(a)

# Find the elements of a that are bigger than 3. The mask has the same shape as
# a, where each element of mask tells whether the corresponding element of a
# is greater than three.
mask = (a > 3)
print('\nMask tensor:')
print(mask)

# We can use the mask to construct a rank-1 tensor containing the elements of a
# that are selected by the mask
print('\nSelecting elements with the mask:')
print(a[mask])

# We can also use boolean masks to modify tensors; for example this sets all
# elements <= 3 to zero:
a[a <= 3] = 0
print('\nAfter modifying with a mask:')
print(a)

### Your turn!

In [None]:

# Create a 4x4 tensor with values from 1 to 16 (hint: use torch.arange(start, stop))
x = None # TODO
# print("Original Tensor:\n", x)

# Select the 2x2 square in the top-left corner (rows 0-1, cols 0-1)
# Expected: [[1, 2], [5, 6]]
top_left = None # TODO
# print("\nTop Left Slice:\n", top_left)

# Select only the 1st and 3rd rows (indices 0 and 2) completely
# Expected: [[1, 2, 3, 4], [9, 10, 11, 12]]
rows_0_and_2 = None # TODO
# print("\nRows 0 and 2:\n", rows_0_and_2)

# Select all values in the tensor that are greater than 10
# Expected: [11, 12, 13, 14, 15, 16]
greater_than_10 = None # TODO
# print("\nValues > 10:\n", greater_than_10)

# Set all values that are even numbers to 0 in the original tensor 'x'
# Hint: Use the modulo operator %
# TODO add your code here!

# print("\nTensor with evens set to 0:\n", x)

In [None]:
#@title Solution
# Create a 4x4 tensor with values from 1 to 16 (hint: use torch.arange(start, stop))
x = torch.arange(1, 17).reshape(4, 4)
print("Original Tensor:\n", x)

# Select the 2x2 square in the top-left corner (rows 0-1, cols 0-1)
# Expected: [[1, 2], [5, 6]]
top_left = x[:2, :2]
print("\nTop Left Slice:\n", top_left)

# Select only the 1st and 3rd rows (indices 0 and 2) completely
# Expected: [[1, 2, 3, 4], [9, 10, 11, 12]]
rows_0_and_2 = x[[0, 2]]
print("\nRows 0 and 2:\n", rows_0_and_2)

# Select all values in the tensor that are greater than 10
# Expected: [11, 12, 13, 14, 15, 16]
greater_than_10 = x[x > 10]
print("\nValues > 10:\n", greater_than_10)

# Set all values that are even numbers to 0 in the original tensor 'x'
# Hint: Use the modulo operator %
x[x % 2 == 0] = 0
print("\nTensor with evens set to 0:\n", x)

## Reshaping operations

### View

PyTorch provides many ways to manipulate the shapes of tensors. The simplest example is [`.view()`](https://docs.pytorch.org/docs/stable/generated/torch.Tensor.view.html): This returns a new tensor with the same number of elements as its input, but with a different shape.

We can use `.view()` to flatten matrices into vectors, and to convert rank-1 vectors into rank-2 row or column matrices:

In [None]:
x0 = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])
print('Original tensor:')
print(x0)
print('shape:', x0.shape)

# Flatten x0 into a rank 1 vector of shape (8,)
x1 = x0.view(8)
print('\nFlattened tensor:')
print(x1)
print('shape:', x1.shape)

# # Convert x1 to a rank 2 "row vector" of shape (1, 8)
x2 = x1.view(1, 8)
print('\nRow vector:')
print(x2)
print('shape:', x2.shape)

# # Convert x1 to a rank 2 "column vector" of shape (8, 1)
x3 = x1.view(8, 1)
print('\nColumn vector:')
print(x3)
print('shape:', x3.shape)

# # Convert x1 to a rank 3 tensor of shape (2, 2, 2):
x4 = x1.view(2, 2, 2)
print('\nRank 3 tensor:')
print(x4)
print('shape:', x4.shape)

As a convenience, calls to `.view()` may include a single -1 argument; this puts enough elements on that dimension so that the output has the same shape as the input.

In [None]:
# We can reuse these functions for tensors of different shapes
def flatten(x):
  return x.view(-1)

def make_row_vec(x):
  return x.view(1, -1)

x0 = torch.tensor([[1, 2, 3], [4, 5, 6]])
x0_flat = flatten(x0)
x0_row = make_row_vec(x0)
print('x0:')
print(x0)
print('x0_flat:')
print(x0_flat)
print('x0_row:')
print(x0_row)

# the difference between torch.view(-1) and torch.flatten() is that the latter
# works on any tensor while the torch.view() only works on contiguous tensors
# (more on them later).

As its name implies, a tensor returned by `.view()` shares the same data as the input, so changes to one will affect the other and vice-versa:

In [None]:
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
x_flat = x.view(-1)
print('x before modifying:')
print(x)
print('x_flat before modifying:')
print(x_flat)

x[0, 0] = 10   # x[0, 0] and x_flat[0] point to the same data
x_flat[1] = 20 # x_flat[1] and x[0, 1] point to the same data

print('\nx after modifying:')
print(x)
print('x_flat after modifying:')
print(x_flat)

### Swapping axes

Another common reshape operation you might want to perform is transposing a matrix. You might be surprised if you try to transpose a matrix with `.view()`: The `view()` function takes elements in row-major order, so **you cannot transpose matrices with `.view()`**.

In general, you should only use `.view()` to add new dimensions to a tensor, or to collapse adjacent dimensions of a tensor.

For other types of reshape operations, you usually need to use a function that can swap axes of a tensor. The simplest such function is `.t()`, specificially for transposing matrices (2D). It is available both as a [function in the `torch` module](https://docs.pytorch.org/docs/stable/generated/torch.t.html), and as a tensor instance method.


In [None]:
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
print('Original matrix:')
print(x)
print('\nTransposing with view DOES NOT WORK!')
print(x.view(3, 2))
print('\nTransposed matrix:')
print(torch.t(x))
print(x.t())

For tensors with more than two dimensions, we can use the general transpose function: [`torch.transpose(input, dim0, dim1)`](https://docs.pytorch.org/docs/stable/generated/torch.transpose.html) to swap arbitrary dimensions, or the [`.permute`](https://docs.pytorch.org/docs/stable/generated/torch.permute.html) method to arbitrarily permute dimensions:

In [None]:
# Create a tensor of shape (2, 3, 4)
x0 = torch.tensor([
     [[1,  2,  3,  4],
      [5,  6,  7,  8],
      [9, 10, 11, 12]],
     [[13, 14, 15, 16],
      [17, 18, 19, 20],
      [21, 22, 23, 24]]])
print('Original tensor:')
print(x0)
print('shape:', x0.shape)

# Swap axes 1 and 2; shape is (2, 4, 3)
x1 = x0.transpose(1, 2)
print('\nSwap axes 1 and 2:')
print(x1)
print(x1.shape)

# Permute axes; the argument (1, 2, 0) means:
# - Make the old dimension 1 appear at dimension 0;
# - Make the old dimension 2 appear at dimension 1;
# - Make the old dimension 0 appear at dimension 2
# This results in a tensor of shape (3, 4, 2)
x2 = x0.permute(1, 2, 0)
print('\nPermute axes')
print(x2)
print('shape:', x2.shape)

### Contiguous tensors

Some combinations of reshaping operations will fail with cryptic errors. The exact reasons for this have to do with the way that tensors and views of tensors are implemented, and are beyond the scope of this assignment.

What you need to know is that you can typically overcome these sorts of errors by either by calling [`.contiguous()`](https://docs.pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html) before `.view()`, or by using [`.reshape()`](https://docs.pytorch.org/docs/stable/generated/torch.reshape.html) instead of `.view()`.

In [None]:
x0 = torch.randn(2, 3, 4)

try:
  # This sequence of reshape operations will crash
  x1 = x0.transpose(1, 2).view(8, 3)
except RuntimeError as e:
  print(type(e), e)

# We can solve the problem using either .contiguous() or .reshape()
x1 = x0.transpose(1, 2).contiguous().view(8, 3)
x2 = x0.transpose(1, 2).reshape(8, 3)
print('x1 shape: ', x1.shape)
print('x2 shape: ', x2.shape)

### Your turn!

In [None]:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# You are working on an image classification task to predict what is
# on the image (lets say animals). You already implemented some generic
# boilerplate classes and functions, but now it is time to actually make it work.
# Note: for the sake of simplicity, we just use random tensors and not real images.
# Hence the first class: RandomImageDataset
class RandomImageDataset(Dataset):
  def __init__(self, num_samples=100):
    self.num_samples = num_samples

  def __len__(self):
    # we always need a __len__ fn in dataloader.
    return self.num_samples

  def __getitem__(self, idx):
    # Task 1
    # we also always need a __getitem__ fn the dataloader.
    # In a common scenario, images are loaded with OpenCV in
    # format of HWC where the C=channel is usually red/green/blue.
    # create a random tensor of 256 by 256 with red/green/blue channels.
    image_hwc = None

    # Unfortunately, pytorch only accepts images in a CHW format.
    # Task: swap dimensions (2->0, 0->1, 1->2).
    # hint: you can use .permute
    image_chw = None
    return image_chw

def preprocess_batch(images):
  # Nearly in all scenarios, it is essential to normalize the features.
  # A common normalization technique is called Z-normalization (or standardization).
  # norm_data = (data - mean) / std

  # Task 2
  # Lets assume 'images' is a batch of 32 images, 3 channels (RGB) with 224x224 pixels
  # Mean values for R, G, and B channels are [0.485, 0.456, 0.406]
  # Hint: use .view() to assign the correct dimensions.
  means = None

  # PyTorch stretches 'means' to match the batch automatically
  normalized_images = None

  return normalized_images


class SimpleCNNClassifier(nn.Module):
  def __init__(self):
    super().__init__()
    self.num_classes = 10
    # We manually define the weights for the final layer task
    self.weights = torch.randn(512, self.num_classes)

  def forward(self, x):
    # Task 3
    # The output of the last convolution layer is: 64 images with
    # 512 channels and 7x7 feature maps. We've given this to you:
    features = torch.randn(64, 512, 7, 7)

    # Flatten the features into the linear layer.
    # We keep the batch dim (64) and flatten the rest.
    flat_features = None # TODO
    # print(f"Shape of flat_features: {flat_features.shape}") # Should be [64, 25088]
    # For this specific task, we simulate a 32-batch vector of size 512
    features_for_matmul = torch.randn(32, 512) # this is given to you!


    # Task 4
    # How to map these into the Linear layer, so that the shapes will be correct.
    # Hint: you need to do some sort of multiplication (but which one?)
    class_logits = None # TODO

    return class_logits


# Task 5
# During your analysis of the final model, you noticed that some of
# images gets really high probability scores (the model is confident).
# You want to see all of those instances, where this prediction score is higher than
# 0.9 and discard the rest. Fill in the function below.
def analyze_predictions():
  top_probs = torch.rand(64)

  # Create a mask where confidence is greater than 0.9.
  high_confidence_mask = None # TODO

  # Select only the high-confidence scores
  reliable_preds = None # TODO

  print(f"Original batch size: {len(top_probs)}")
  print(f"Count of reliable preds: {len(reliable_preds)}")



# # ==========================================
# # Execution Loop. Un-comment the code below to test your
# # implementation above.

ds = RandomImageDataset()
dl = DataLoader(ds, batch_size=32, shuffle=True)

# # Task 1 (get a batch)
# batch_images_chw = next(iter(dl))
# print(f"Dataloader output shape: {batch_images_chw.shape}")
# assert batch_images_chw.shape == (32, 3, 256, 256), f"Task 1 Error: Expected (32, 3, 256, 256), got {batch_images_chw.shape}"

# #Task 2
# # We fake a resize to 224 for the normalization task
# batch_resized = torch.rand(32, 3, 224, 224)
# norm_batch = preprocess_batch(batch_resized)
# print(f"Normalized batch shape: {norm_batch.shape}")
# assert norm_batch.shape == (32, 3, 224, 224), f"Task 2 Error: Expected (32, 3, 224, 224), got {norm_batch.shape}"

# # Task 3 and 4 (run the model)
# model = SimpleCNNClassifier()
# output = model(batch_resized)
# print(f"Final logits shape (should be [32, 10]): {output.shape}")
# assert output.shape == (32, 10), f"Error: expected (32, 10), got {output.shape}"

# # Task 5 (analysis)
# analyze_predictions()

In [None]:
# @title Solution
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# You are working on an image classification task to predict what is
# on the image (lets say animals). You already implemented some generic
# boilerplate classes and functions, but now it is time to actually make it work.
# Note: for the sake of simplicity, we just use random tensors and not real images.
# Hence the first class: RandomImageDataset
class RandomImageDataset(Dataset):
  def __init__(self, num_samples=100):
    self.num_samples = num_samples

  def __len__(self):
    # we always need a __len__ fn in dataloader.
    return self.num_samples

  def __getitem__(self, idx):
    # Task 1
    # we also always need a __getitem__ fn the dataloader.
    # In a common scenario, images are loaded with OpenCV in
    # format of HWC where the C=channel is usually red/green/blue.
    # create a random tensor of 256 by 256 with red/green/blue channels.
    image_hwc = torch.rand(256, 256, 3)

    # Unfortunately, pytorch only accepts images in a CHW format.
    # Task: swap dimensions (2->0, 0->1, 1->2).
    # hint: you can use .permute
    image_chw = image_hwc.permute(2, 0, 1)
    return image_chw

def preprocess_batch(images):
  # Nearly in all scenarios, it is essential to normalize the features.
  # A common normalization technique is called Z-normalization (or standardization).
  # norm_data = (data - mean) / std

  # Task 2
  # Lets assume 'images' is a batch of 32 images, 3 channels (RGB) with 224x224 pixels
  # Mean values for R, G, and B channels are [0.485, 0.456, 0.406]
  # Hint: use .view() to assign the correct dimensions.
  means = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)

  # PyTorch stretches 'means' to match the batch automatically
  normalized_images = images - means

  return normalized_images


class SimpleCNNClassifier(nn.Module):
  def __init__(self):
    super().__init__()
    self.num_classes = 10
    # We manually define the weights for the final layer task
    self.weights = torch.randn(512, self.num_classes)

  def forward(self, x):
    # Task 3
    # The output of the last convolution layer is: 64 images with
    # 512 channels and 7x7 feature maps. We've given this to you
    features = torch.randn(64, 512, 7, 7)

    # Flatten the features into the linear layer.
    # We keep the batch dim (64) and flatten the rest.
    flat_features = features.view(64, -1)
    # print(f"Shape of flat_features: {flat_features.shape}") # Should be [64, 25088]

    # For this specific task, we simulate a 32-batch vector of size 512
    features_for_matmul = torch.randn(32, 512) # this is given to you!

    # Task 4
    # How to map these into the Linear layer, so that the shapes will be correct.
    # Hint: you need to do some sort of multiplication (but which one?)
    class_logits = features_for_matmul @ self.weights

    return class_logits


# Task 5
# During your analysis of the final model, you noticed that some of
# images gets really high probability scores (the model is confident).
# You want to see all of those instances, where this prediction score is higher than
# 0.9 and discard the rest. Fill in the function below.
def analyze_predictions():
  top_probs = torch.rand(64)

  # Create a mask where confidence is greater than 0.9.
  high_confidence_mask = (top_probs > 0.9)

  # Select only the high-confidence scores
  reliable_preds = top_probs[high_confidence_mask]

  print(f"Original batch size: {len(top_probs)}")
  print(f"Count of reliable preds: {len(reliable_preds)}")



# ==========================================
# Execution Loop. Un-comment the code below to test your
# implementation above.

ds = RandomImageDataset()
dl = DataLoader(ds, batch_size=32, shuffle=True)

# Task 1 (get a batch)
batch_images_chw = next(iter(dl))
print(f"Dataloader output shape: {batch_images_chw.shape}")
assert batch_images_chw.shape == (32, 3, 256, 256), f"Task 1 Error: Expected (32, 3, 256, 256), got {batch_images_chw.shape}"

# Task 2
# We fake a resize to 224 for the normalization task
batch_resized = torch.rand(32, 3, 224, 224)
norm_batch = preprocess_batch(batch_resized)
print(f"Normalized batch shape: {norm_batch.shape}")
assert norm_batch.shape == (32, 3, 224, 224), f"Task 2 Error: Expected (32, 3, 224, 224), got {norm_batch.shape}"

# Task 3 and 4 (run the model)
model = SimpleCNNClassifier()
output = model(batch_resized)
print(f"Final logits shape (should be [32, 10]): {output.shape}")
assert output.shape == (32, 10), f"Error: expected (32, 10), got {output.shape}"

# Task 5 (analysis)
analyze_predictions()

### Your turn!

In [None]:
# create a 3D tensor of shape (2, 3, 4)
# think of this as for example 2 images, 3 color channels, 4 pixels width
x = None # TODO
# print("shape:", x.shape)

# Flatten the last two dimensions (3, 4) into a single dimension of size 12.
# The result should have shape (2, 12).
# Hint: Use .view()
flattened = None # TODO
# print("\nflattened shape:", flattened.shape)

# swap the 1st and 2nd dimensions (indices 1 and 2).
# old: (2, 3, 4) -> new: (2, 4, 3)
# Hint: Use .transpose()
swapped = None # TODO
# print("\nSwapped shape:", swapped.shape)

# Check if the 'swapped' tensor is contiguous in memory.
# Hint: It should return False because transposing changes strides without moving data
is_contig = None # TODO
# print("\n contiguous?", is_contig)

# Now, lets force it!
# Create a new tensor from 'swapped' that is actually contiguous in memory.
# This allows us to use .view() on it later without errors.
cont_tensor = None # TODO
# print("\nIs fixed tensor contiguous?", fixed_tensor.is_contiguous())

# You can view the answers in the cell down below!

In [None]:
# @title Solution
# create a 3D tensor of shape (2, 3, 4)
# think of this as for example 2 images, 3 color channels, 4 pixels width
x = torch.arange(24).reshape(2, 3, 4)
print("shape:", x.shape)

# Flatten the last two dimensions (3, 4) into a single dimension of size 12.
# The result should have shape (2, 12).
# Hint: Use .view()
flattened = x.view(2, 12)
print("\nflattened shape:", flattened.shape)

# swap the 1st and 2nd dimensions (indices 1 and 2).
# old: (2, 3, 4) -> new: (2, 4, 3)
# Hint: Use .transpose()
swapped = x.transpose(1, 2)
print("\nSwapped shape:", swapped.shape)

# Check if the 'swapped' tensor is contiguous in memory.
# Hint: It should return False because transposing changes strides without moving data
is_contig = swapped.is_contiguous()
print("\n contiguous?", is_contig)

# Now, lets force it!
# Create a new tensor from 'swapped' that is actually contiguous in memory.
# This allows us to use .view() on it later without errors.
cont_tensor = swapped.contiguous()
print("\nIs fixed tensor contiguous?", cont_tensor.is_contiguous())

# You can view the answers in the cell down below!

## Tensor operations

### Elementwise operations

Basic mathematical functions operate elementwise on tensors, and are available as operator overloads, as functions in the `torch` module, and as instance methods on torch objects; all produce the same results:

In [None]:
x = torch.tensor([[1, 2, 3, 4]], dtype=torch.float32)
y = torch.tensor([[5, 6, 7, 8]], dtype=torch.float32)

# Elementwise sum; all give the same result
print('Elementwise sum:')
print(x + y)
print(torch.add(x, y))
print(x.add(y))

# Elementwise difference
print('\nElementwise difference:')
print(x - y)
print(torch.sub(x, y))
print(x.sub(y))

# Elementwise product
print('\nElementwise product:')
print(x * y)
print(torch.mul(x, y))
print(x.mul(y))

# Elementwise division
print('\nElementwise division')
print(x / y)
print(torch.div(x, y))
print(x.div(y))

# Elementwise power
print('\nElementwise power')
print(x ** y)
print(torch.pow(x, y))
print(x.pow(y))

Torch also provides many standard mathematical functions; these are available both as functions in the `torch` module and as instance methods on tensors:

You can find a full list of all available mathematical functions [in the documentation](https://pytorch.org/docs/stable/torch.html#pointwise-ops); many functions in the `torch` module have corresponding instance methods [on tensor objects](https://pytorch.org/docs/stable/tensors.html).

In [None]:
x = torch.tensor([[1, 2, 3, 4]], dtype=torch.float32)

print('Square root:')
print(torch.sqrt(x))
print(x.sqrt())

print('\nTrig functions:')
print(torch.sin(x))
print(x.sin())
print(torch.cos(x))
print(x.cos())

### Reduction operations

So far we've seen basic arithmetic operations on tensors that operate elementwise. We may sometimes want to perform operations that aggregate over part or all of a tensor, such as a summation; these are called **reduction** operations.

Like the elementwise operations above, most reduction operations are available both as functions in the `torch` module and as instance methods on `tensor` objects.

The simplest reduction operation is summation. We can use the `.sum()` function to reduce either an entire tensor, or to reduce along only one dimension of the tensor using the `dim` argument:

In [None]:
x = torch.tensor([[1, 2, 3],
                  [4, 5, 6]], dtype=torch.float32)
print('Original tensor:')
print(x)

print('\nSum over entire tensor:')
print(torch.sum(x))
print(x.sum())

# We can sum over each row:
print('\nSum of each row:')
print(torch.sum(x, dim=0))
print(x.sum(dim=0))

# Sum over each column:
print('\nSum of each column:')
print(torch.sum(x, dim=1))
print(x.sum(dim=1))

Other useful reduction operations include [`mean`](https://pytorch.org/docs/stable/torch.html#torch.mean), [`min`](https://pytorch.org/docs/stable/torch.html#torch.min), and [`max`](https://pytorch.org/docs/stable/torch.html#torch.max). You can find a full list of all available reduction operations [in the documentation](https://pytorch.org/docs/stable/torch.html#reduction-ops).

Some reduction operations return more than one value; for example `min` returns both the minimum value over the specified dimension, as well as the index where the minimum value occurs:

In [None]:
x = torch.tensor([[2, 4, 3, 5], [3, 3, 5, 2]], dtype=torch.float32)
print('Original tensor:')
print(x, x.shape)

# Finding the overall minimum only returns a single value
print('\nOverall minimum: ', x.min())

# Compute the minimum along each column; we get both the value and location:
# The minimum of the first column is 2, and it appears at index 0;
# the minimum of the second column is 3 and it appears at index 1; etc
col_min_vals, col_min_idxs = x.min(dim=0)
print('\nMinimum along each column:')
print('values:', col_min_vals)
print('idxs:', col_min_idxs)

Reduction operations *reduce* the rank of tensors: the dimension over which you perform the reduction will be removed from the shape of the output. If you pass `keepdim=True` to a reduction operation, the specified dimension will not be removed; the output tensor will instead have a shape of 1 in that dimension.

### Matrix operations

Note that unlike MATLAB, * is elementwise multiplication, not matrix multiplication. PyTorch provides a number of linear algebra functions that compute different types of vector and matrix products. The most commonly used are:

- [`torch.dot`](https://docs.pytorch.org/docs/stable/generated/torch.dot.html): Computes inner product of vectors
- [`torch.mm`](https://docs.pytorch.org/docs/stable/generated/torch.mm.html): Computes matrix-matrix products
- [`torch.mv`](https://docs.pytorch.org/docs/stable/generated/torch.mv.html): Computes matrix-vector products
- [`torch.addmm`](https://docs.pytorch.org/docs/stable/generated/torch.addmm.html) / [`torch.addmv`](https://docs.pytorch.org/docs/stable/generated/torch.addmv.html): Computes matrix-matrix and matrix-vector multiplications plus a bias
- [`torch.bmm`](https://pytorch.org/docs/1.1.0/torch.html#torch.addmv) / [`torch.baddmm`](https://docs.pytorch.org/docs/stable/generated/torch.bmm.html): Batched versions of `torch.mm` and `torch.addmm`, respectively
- [`torch.matmul`](https://docs.pytorch.org/docs/stable/generated/torch.matmul.html): General matrix product that performs different operations depending on the rank of the inputs; this is similar to `np.dot` in numpy.

Here is an example of using `torch.dot` to compute inner products. Like the other mathematical operators we've seen, most linear algebra operators are available both as functions in the `torch` module and as instance methods of tensors:

In [None]:
v = torch.tensor([9,10], dtype=torch.float32)
w = torch.tensor([11, 12], dtype=torch.float32)

# Inner product of vectors
print('Dot products:')
print(torch.dot(v, w))
print(v.dot(w))

# dot only works for vectors -- it will give an error for tensors of rank > 1
x = torch.tensor([[1,2],[3,4]], dtype=torch.float32)
y = torch.tensor([[5,6],[7,8]], dtype=torch.float32)
try:
  print(x.dot(y))
except RuntimeError as e:
  print(e)

# Instead we use mm for matrix-matrix products:
print('\nMatrix-matrix product:')
print(torch.mm(x, y))
print(x.mm(y))

With all the different linear algebra operators that PyTorch provides, there is usually more than one way to compute something. For example to compute matrix-vector products we can use `torch.mv`; we can reshape the vector to have rank 2 and use `torch.mm`; or we can use `torch.matmul`. All give the same results, but the outputs might have different ranks:

In [None]:
print('Here is x (rank 2):')
print(x)
print('\nHere is v (rank 1):')
print(v)

# Matrix-vector multiply with torch.mv produces a rank-1 output
print('\nMatrix-vector product with torch.mv (rank 1 output)')
print(torch.mv(x, v))
print(x.mv(v))

# We can reshape the vector to have rank 2 and use torch.mm to perform
# matrix-vector products, but the result will have rank 2
print('\nMatrix-vector product with torch.mm (rank 2 output)')
print(torch.mm(x, v.view(2, 1)))
print(x.mm(v.view(2, 1)))

print('\nMatrix-vector product with torch.matmul (rank 1 output)')
print(torch.matmul(x, v))
print(x.matmul(v))

### Your turn!

In [None]:
# two, 2x2 tensor is given to you!
a = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0],
                  [7.0, 8.0]])

# add 'a' and 'b' together, then multiply the result by 2. Expected: [[12., 16.], [20., 24.]]
# hint: You can use standard operators + and *
elementwise_result = None
# print("Elementwise Result:\n", elementwise_result)

# Calculate the sum of all elements in tensor 'a'. Expected: 10.0
total_sum = None
# print("\nTotal Sum of 'a':", total_sum)

# calculate the mean of 'b' along the rows (dimension 1). Expected: [5.5, 7.5]
row_means = None
# print("\nRow Means of 'b':", row_means)

# perform a matrix multiplication between 'a' and 'b'. Note: This is different from elementwise multiplication!
# Hint: Use @ or torch.matmul()
matmul_result = None
# print("\nMatrix Multiplication (a @ b):\n", matmul_result)

# the answers are given in the cell below.

In [None]:
#@title Solution
# two, 2x2 tensor is given to you!
a = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0],
                  [7.0, 8.0]])

# add 'a' and 'b' together, then multiply the result by 2.
# expected: [[12., 16.], [20., 24.]]
# hint: You can use standard operators + and *
elementwise_result = (a + b) * 2
print("Elementwise Result:\n", elementwise_result)

# Calculate the sum of all elements in tensor 'a'.
# Expected: 10.0
total_sum = a.sum()
print("\nTotal Sum of 'a':", total_sum)

# calculate the mean of 'b' along the rows (dimension 1).
# Expected: [5.5, 7.5]
row_means = b.mean(dim=1)
print("\nRow Means of 'b':", row_means)

# perform a matrix multiplication between 'a' and 'b'.
# Note: This is different from elementwise multiplication!
# Hint: Use @ or torch.matmul()
matmul_result = a @ b
print("\nMatrix Multiplication (a @ b):\n", matmul_result)

# the answers are given in the cell below.

## Broadcasting

Broadcasting is a powerful mechanism that allows PyTorch to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller tensor and a larger tensor, and we want to use the smaller tensor multiple times to perform some operation on the larger tensor.

For example, suppose that we want to add a constant vector to each row of a tensor. We could do it like this:


In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = torch.tensor([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = torch.tensor([1, 0, 1])
y = torch.zeros_like(x)   # Create an empty matrix with the same shape as x

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(x.shape[0]):
    y[i, :] = x[i, :] + v
print(y)

This works; however when the tensor x is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the tensor x is equivalent to forming a tensor vv by stacking multiple copies of v vertically, then performing elementwise summation of x and vv. We could implement this approach like this:


In [None]:
vv = v.repeat((4, 1))  # Stack 4 copies of v on top of each other
print(vv)              # Prints "[[1 0 1]
                       #          [1 0 1]
                       #          [1 0 1]
                       #          [1 0 1]]"

In [None]:
y = x + vv  # Add x and vv elementwise
print(y)

PyTorch broadcasting allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:

In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = torch.tensor([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = torch.tensor([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting
print(y)

The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to broadcasting; this line works as if v actually had shape (4, 3), where each row was a copy of v, and the sum was performed elementwise.

Broadcasting two tensors together follows these rules:

1.   If the tensors do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length.
2.   The two tensors are said to be *compatible* in a dimension if they have the same size in the dimension, or if one of the tensors has size 1 in that dimension.
3.   The tensors can be broadcast together if they are compatible in all dimensions.
4.   After broadcasting, each tensor behaves as if it had shape equal to the elementwise maximum of shapes of the two input tensors.
5.   In any dimension where one tensor had size 1 and the other tensor had size greater than 1, the first tensor behaves as if it were copied along that dimension

If this explanation does not make sense, try reading the explanation from the [documentation](https://pytorch.org/docs/stable/notes/broadcasting.html).

Not all functions support broadcasting. You can find functions that does not support broadcasting from the official docs. (e.g. [`torch.mm`](https://docs.pytorch.org/docs/stable/generated/torch.mm.html) does not support broadcasting, but [`torch.matmul`](https://docs.pytorch.org/docs/stable/generated/torch.matmul.html) does)

Broadcasting can let us easily implement many different operations. For example we can compute an outer product of vectors:

In [None]:
# Compute outer product of vectors
v = torch.tensor([1, 2, 3])  # v has shape (3,)
w = torch.tensor([4, 5])     # w has shape (2,)
# To compute an outer product, we first reshape v to be a column
# vector of shape (3, 1); we can then broadcast it against w to yield
# an output of shape (3, 2), which is the outer product of v and w:
print(v.view(3, 1) * w)

### Your turn!

In [None]:
# You are working on a Transformer model (like GPT) for
# text generation task. You have the basic architecture sketches, but
# the tensor manipulations for the attention mechanism are missing.
# Note: We use random tensors to simulate embeddings and token IDs.

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # These project input vectors into Query, Key, and Value matrices
        # Note: don't worry if this feels too much at first, there will
        # be a lecture on transformers during MenaML!
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        # Lets initialize our QKV values with the same shape as x.
        # x shape: [batch_size (32), seq_len (10), embed_Dim (512)]
        batch_size, seq_len, embed_dim = x.shape

        # We pass x through the linear layers to get Q, K, V
        q = self.q_proj(x)  # Shape: [32, 10, 512]
        k = self.k_proj(x)
        v = self.v_proj(x)

        # Task 1
        # To compute multi-head attention, we need to split the embed_dim
        # into 'num_heads' pieces and move the 'num_heads' dimension
        # so it acts like a batch dimension.
        # Currently we have:
        # [batch, seq, embed] -> target: [batch, heads, seq, head_dim]

        # first, separate the heads (reshape)
        q = None
        k = None

        # second, swap the axes so 'heads' comes before 'seq_len'
        # Hint: use .permute() to swap dim 1 and 2
        q = None
        k = None

        # Calculate raw attention scores (scaled dot product)
        # Shape will become: [batch, heads, seq_len, seq_len]
        scores = None

        # Task 2
        # In a GPT-style model, we must prevent the model from "cheating" by
        # looking at future words. We need a "Causal Mask" (an upper triangular matrix).
        # Create a mask where the upper triangle (future) is True, and the rest False.
        # Hint: torch.triu() creates a triangular matrix.
        # We use diagonal=1 to keep the current word visible.

        # Create a ones matrix of shape (seq_len, seq_len)
        ones_mat = None
        future_mask = torch.triu(ones_mat, diagonal=1).to(x.device)

        # Apply the mask: fill "True" positions with negative infinity
        scores = None

        return scores


def filter_padded_tokens(input_ids, embeddings):
    # Task 3
    # Real text data has variable lengths, so we pad short sentences with 0.
    # We want to calculate the average embedding for each sentence,
    # BUT we must strictly exclude the padding tokens (ID 0) from the average.
    # input_ids shape: [32, 10] (Batch, Seq)
    # embeddings shape: [32, 10, 512] (Batch, Seq, Dim)

    # Create a Boolean mask where input_ids are NOT zero (valid tokens)
    # Hint: simple comparison operator
    valid_mask = None  # Shape: [32, 10]

    # We need to expand this mask to match the embedding dimensions [32, 10, 512]
    # to zero-out the invalid vectors.
    # unsqueeze(2) makes it [32, 10, 1] so it broadcasts correctly.
    valid_mask_expanded = None

    # Zero out embeddings that correspond to padding
    clean_embeddings = None

    # Calculate mean, but be careful to divide by the REAL length, not 10.
    # Sum along sequence dimension
    sum_embeddings = None

    # Count valid tokens per sentence
    real_lengths = None

    # Create the sentense embedding by dividing sum_embedding and the
    # real lenght.
    # Avoid division by zero by using tensor.clamp(min=1)
    sentence_embeddings = sum_embeddings / real_lengths.clamp(min=1)

    return sentence_embeddings


# # Uncomment the code below to test your implementations
# # ==========================================
# # Execution Loop.
# # Simulate a batch of 32 sentences, max length 10, embedding size 512
# batch_size = 32
# seq_len = 10
# embed_dim = 512

# # Task 1 and 2 Check
# model = MultiHeadSelfAttention(embed_dim=embed_dim, num_heads=8)
# dummy_input = torch.randn(batch_size, seq_len, embed_dim)

# attn_scores = model(dummy_input)

# print(f"Attention Scores Shape: {attn_scores.shape}")
# # Expected: [32, 8, 10, 10] (Batch, Heads, Seq (Query), Seq (Key))
# assert attn_scores.shape == (32, 8, 10, 10), f"Task 1 Error: Expected (32, 8, 10, 10), got {attn_scores.shape}"

# # Verify Causal Mask (Task 2)
# # The top right corner of the first head's first batch item should be -inf
# top_right_val = attn_scores[0, 0, 0, 9].item()
# print(f"Top-right value (should be -inf): {top_right_val}")
# assert top_right_val == float('-inf'), "Task 2 Error: Future positions were not masked with -inf"

# # Task 3 Check
# # Create dummy IDs with some zeros (padding) at the end
# fake_ids = torch.randint(1, 1000, (32, 10))
# fake_ids[:, -3:] = 0  # Last 3 tokens are always pad for this test
# fake_embeddings = torch.randn(32, 10, 512)

# sent_embeds = filter_padded_tokens(fake_ids, fake_embeddings)

# print(f"Sentence Embeddings Shape: {sent_embeds.shape}")
# assert sent_embeds.shape == (32, 512), f"Task 3 Error: Expected (32, 512), got {sent_embeds.shape}"

In [None]:
#@title Solution
# You are working on a Transformer model (like GPT) for
# text generation task. You have the basic architecture sketches, but
# the tensor manipulations for the attention mechanism are missing.
# Note: We use random tensors to simulate embeddings and token IDs.

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # These project input vectors into Query, Key, and Value matrices
        # Note: don't worry if this feels too much at first, there will
        # be a lecture on transformers during MenaML!
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        # Lets initialize our QKV values with the same shape as x.
        # x shape: [batch_size (32), seq_len (10), embed_Dim (512)]
        batch_size, seq_len, embed_dim = x.shape

        # We pass x through the linear layers to get Q, K, V
        q = self.q_proj(x)  # Shape: [32, 10, 512]
        k = self.k_proj(x)
        v = self.v_proj(x)

        # Task 1
        # To compute multi-head attention, we need to split the embed_dim
        # into 'num_heads' pieces and move the 'num_heads' dimension
        # so it acts like a batch dimension.
        # Currently we have:
        # [batch, seq, embed] -> target: [batch, heads, seq, head_dim]

        # first, separate the heads (reshape)
        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim)
        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim)

        # second, swap the axes so 'heads' comes before 'seq_len'
        # Hint: use .permute() to swap dim 1 and 2
        q = q.permute(0, 2, 1, 3)
        k = k.permute(0, 2, 1, 3)

        # Calculate raw attention scores (scaled dot product)
        # Shape will become: [batch, heads, seq_len, seq_len]
        scores = (q @ k.transpose(-2, -1)) / (self.head_dim ** 0.5)

        # Task 2
        # In a GPT-style model, we must prevent the model from "cheating" by
        # looking at future words. We need a "Causal Mask" (an upper triangular matrix).
        # Create a mask where the upper triangle (future) is True, and the rest False.
        # Hint: torch.triu() creates a triangular matrix.
        # We use diagonal=1 to keep the current word visible.

        # Create a ones matrix of shape (seq_len, seq_len)
        ones_mat = torch.ones(seq_len, seq_len, dtype=torch.bool)
        future_mask = torch.triu(ones_mat, diagonal=1).to(x.device)

        # Apply the mask: fill "True" positions with negative infinity
        scores = scores.masked_fill(future_mask, float('-inf'))

        return scores


def filter_padded_tokens(input_ids, embeddings):
    # Task 3
    # Real text data has variable lengths, so we pad short sentences with 0.
    # We want to calculate the average embedding for each sentence,
    # BUT we must strictly exclude the padding tokens (ID 0) from the average.
    # input_ids shape: [32, 10] (Batch, Seq)
    # embeddings shape: [32, 10, 512] (Batch, Seq, Dim)

    # Create a Boolean mask where input_ids are NOT zero (valid tokens)
    # Hint: simple comparison operator
    valid_mask = (input_ids != 0)  # Shape: [32, 10]

    # We need to expand this mask to match the embedding dimensions [32, 10, 512]
    # to zero-out the invalid vectors.
    # unsqueeze(2) makes it [32, 10, 1] so it broadcasts correctly.
    valid_mask_expanded = valid_mask.unsqueeze(2)

    # Zero out embeddings that correspond to padding
    clean_embeddings = embeddings * valid_mask_expanded.float()

    # Calculate mean, but be careful to divide by the REAL length, not 10.
    # Sum along sequence dimension
    sum_embeddings = clean_embeddings.sum(dim=1)

    # Count valid tokens per sentence
    real_lengths = valid_mask.sum(dim=1, keepdim=True)

    # Avoid division by zero
    sentence_embeddings = sum_embeddings / real_lengths.clamp(min=1)

    return sentence_embeddings


# ==========================================
# Execution Loop.
# Simulate a batch of 32 sentences, max length 10, embedding size 512
batch_size = 32
seq_len = 10
embed_dim = 512

# Task 1 & 2 Check
model = MultiHeadSelfAttention(embed_dim=embed_dim, num_heads=8)
dummy_input = torch.randn(batch_size, seq_len, embed_dim)

attn_scores = model(dummy_input)

print(f"Attention Scores Shape: {attn_scores.shape}")
# Expected: [32, 8, 10, 10] (Batch, Heads, Seq (Query), Seq (Key))
assert attn_scores.shape == (32, 8, 10, 10), f"Task 1 Error: Expected (32, 8, 10, 10), got {attn_scores.shape}"

# Verify Causal Mask (Task 2)
# The top right corner of the first head's first batch item should be -inf
top_right_val = attn_scores[0, 0, 0, 9].item()
print(f"Top-right value (should be -inf): {top_right_val}")
assert top_right_val == float('-inf'), "Task 2 Error: Future positions were not masked with -inf"

# Task 3 Check
# Create dummy IDs with some zeros (padding) at the end
fake_ids = torch.randint(1, 1000, (32, 10))
fake_ids[:, -3:] = 0  # Last 3 tokens are always pad for this test
fake_embeddings = torch.randn(32, 10, 512)

sent_embeds = filter_padded_tokens(fake_ids, fake_embeddings)

print(f"Sentence Embeddings Shape: {sent_embeds.shape}")
assert sent_embeds.shape == (32, 512), f"Task 3 Error: Expected (32, 512), got {sent_embeds.shape}"


## Running on GPU

One of the most important features of PyTorch is that it can use graphics processing units (GPUs) to accelerate its tensor operations.

We can easily check whether PyTorch is configured to use GPUs:

Tensors can be moved onto any device using the .to method.

In [None]:
import torch

if torch.cuda.is_available():
  print('PyTorch can use GPUs!')
else:
  print('PyTorch cannot use GPUs.')

You can enable GPUs in Colab via Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU.

This may cause the Colab runtime to restart, so we will re-import torch in the next cell.

We have already seen that PyTorch tensors have a `dtype` attribute specifying their datatype. All PyTorch tensors also have a `device` attribute that specifies the device where the tensor is stored -- either CPU, or CUDA (for NVIDA GPUs). A tensor on a CUDA device will automatically use that device to accelerate all of its operations.

Just as with datatypes, we can use the `.to()` method to change the device of a tensor. We can also use the convenience methods `.cuda()` and `.cpu()` methods to move tensors between CPU and GPU.

In [None]:
# Construct a tensor on the CPU
x0 = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
print('x0 device:', x0.device)

# Move it to the GPU using .to()
x1 = x0.to('cuda')
print('x1 device:', x1.device)

# Move it to the GPU using .cuda()
x2 = x0.cuda()
print('x2 device:', x2.device)

# Move it back to the CPU using .to()
x3 = x1.to('cpu')
print('x3 device:', x3.device)

# Move it back to the CPU using .cpu()
x4 = x2.cpu()
print('x4 device:', x4.device)

# We can construct tensors directly on the GPU as well
y = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float64, device='cuda')
print('y device / dtype:', y.device, y.dtype)

# Calling x.to(y) where y is a tensor will return a copy of x with the same
# device and dtype as y
x5 = x0.to(y)
print('x5 device / dtype:', x5.device, x5.dtype)

Performing large tensor operations on a GPU can be **a lot faster** than running the equivalent operation on CPU.

Here we compare the speed of adding two tensors of shape (10000, 10000) on CPU and GPU:

(Note that GPU code may run asynchronously with CPU code, so when timing the speed of operations on the GPU it is important to use `torch.cuda.synchronize` to synchronize the CPU and GPU.)

In [None]:
import time

a_cpu = torch.randn(10000, 10000, dtype=torch.float32)
b_cpu = torch.randn(10000, 10000, dtype=torch.float32)

a_gpu = a_cpu.cuda()
b_gpu = b_cpu.cuda()
torch.cuda.synchronize()

t0 = time.time()
c_cpu = a_cpu + b_cpu
t1 = time.time()
c_gpu = a_gpu + b_gpu
torch.cuda.synchronize()
t2 = time.time()

# Check that they computed the same thing
diff = (c_gpu.cpu() - c_cpu).abs().max().item()
print('Max difference between c_gpu and c_cpu:', diff)

cpu_time = 1000.0 * (t1 - t0)
gpu_time = 1000.0 * (t2 - t1)
print('CPU time: %.2f ms' % cpu_time)
print('GPU time: %.2f ms' % gpu_time)
print('GPU speedup: %.2f x' % (cpu_time / gpu_time))

You should see that running the same computation on the GPU was more than 30 times faster than on the CPU! Due to the massive speedups that GPUs offer, we will use GPUs to accelerate much of our machine learning code starting in Assignment 2.



# A simple network: MLP

In theory, we could build our own neural network using tensors and tensor operations, by specifying all our parameters (weights, filters, bias vectors etc) then ask Pytorch to calculate the gradient and then adjust the parameters (perform the learning). However, this can quickly become unmanagable if we have a lot of parameters and a complicated network architecture.
Fortunately, all the heavy lifting have been done in the past and we can utilize torch.nn with all its functionalities.

## XOR


The XOR is a digital logic gate (or function) that gives a true output when the number of true inputs is odd. This example became famous by the fact that a single neuron, i.e. a linear classifier, cannot learn this simple function. Hence, we will learn how to build a small neural network that can learn this function. To make it a little bit more interesting, we move the XOR into continuous space and introduce some gaussian noise on the binary inputs. Our desired separation of an XOR dataset could look as follows:

<img src="https://github.com/ignaczgerg/training-open-resources/blob/9e037b339d9315a9164115a840128050a5f0af67/images/noisy_xor_2d_gaussian.png?raw=true" alt="xor visualization" width="350"/>

## The model


`torch.nn` defines lots of useful classes such as linear layers, activation functions, loss functions etc.

In [None]:
# lets impor torch.nn
import torch # we have already imported this, I just leave it as a reminder
import torch.nn as nn

Additionally to `torch.nn`, there is also torch.nn.functional. It contains functions that are used in network layers. This is in contrast to `torch.nn` which defines them as `nn.Modules` (more on it below), and `torch.nn` actually uses a lot of functionalities from `torch.nn.functional`.

In [None]:
import torch.nn.functional as F

### nn.Module

In [None]:
class MyModule(nn.Module):

    # we always need a init()
    def __init__(self, num_inputs, num_hidden, num_outputs):
        super().__init__()
        # some init for my module, we will define it later

    # and we always need a forward
    def forward(self, x):
        # function for performing the calculation of the module. We will define them later.
        pass

The forward function is where we define the computation. The forward executed when we call the object module (`nn = MyModule()`; `nn(x)`). In the `__init__` function, we usually create all the parameters, using `nn.Parameter`, or defining other modules that are used in the forward function.

We can now make use of our pre-defined modules in the `torch.nn` package. We define the a small neural network with one input layer, one hidden layer with a tanh activation function and an output layer.

The x1 and x2 represents the input coordinates of the datapoints. The hidden layer with the `tanh` function is shown in next followed by the output layer in yellow.

<img src="https://github.com/ignaczgerg/training-open-resources/blob/9e037b339d9315a9164115a840128050a5f0af67/images/mlp.png?raw=true" alt="scalar-vector-matrix-tensor" width="350"/>


Reminder, we need to initialize:
- a linear layer (nn.Linear)
- an activation function (nn.Tanh)
- lastly, another linear layer

The forward should accept the x variable and perform the calculations defined in the `__init__`

In [None]:
# define the layer structure like the image above
class SimpleClassifier(nn.Module):

    def __init__(self, num_inputs, num_hidden, num_outputs):
        super().__init__()
        # Initialize the modules we need to build the network
        self.linear1 = nn.Linear(num_inputs, num_hidden)
        self.act_fn = nn.Tanh()
        self.linear2 = nn.Linear(num_hidden, num_outputs)

    def forward(self, x):
        # Perform the calculation of the model to determine the prediction
        x = self.linear1(x)
        x = self.act_fn(x)
        x = self.linear2(x)
        return x

We will use a tiny neural network with two input neurons and four hidden neurons (having 4 panels in the classification panes). As we perform binary classification, we will use a single output neuron. Note that we do not apply a sigmoid on the output (yet). This is because other functions, especially the loss, are more efficient and precise to calculate on the original outputs instead of the sigmoid output.

In [None]:
# Initialize SimpleClassifier
size_of_hidden_neurons = 2
model = SimpleClassifier(num_inputs=2, num_hidden=size_of_hidden_neurons, num_outputs=1)
# Printing a module shows all its submodules
print(model)

In [None]:
# inspect the parameters
for name, param in model.named_parameters():
    print(f"Parameter {name}, shape {param.shape}")

## The data

In [None]:
import torch.utils.data as data

The dataset class summarizes the basic functionality of a dataset in a natural way. To define a dataset in PyTorch, we simply specify two functions: `__getitem__`, and `__len__`. The get-item function has to return the n-th data point in the dataset, while the len function returns the size of the dataset. For the XOR dataset, we can define the dataset class as follows:

In [None]:
# This data generator is defined for you (now!)
class XORDataset(data.Dataset):
    def __init__(self, size, std=0.1):
        """
        Inputs:
            size - Number of data points we want to generate
            std - Standard deviation of the noise (see generate_continuous_xor function)
        """
        super().__init__()
        self.size = size
        self.std = std # we need this to add the small noise
        self.generate_continuous_xor()
        torch.manual_seed(1)

    def generate_continuous_xor(self):
        # Each data point in the XOR dataset has two variables, x and y, that can be either 0 or 1
        # The label is their XOR combination, i.e. 1 if only x or only y is 1 while the other is 0.
        # If x=y, the label is 0.
        data = torch.randint(low=0, high=2, size=(self.size, 2), dtype=torch.float32)
        label = (data.sum(dim=1) == 1).to(torch.long)
        # To make it slightly more challenging, we add a bit of gaussian noise to the data points.
        data += self.std * torch.randn(data.shape)

        self.data = data
        self.label = label

    def __len__(self):
        # Number of data point we have. Alternatively self.data.shape[0], or self.label.shape[0]
        return self.size

    def __getitem__(self, idx):
        # Return the idx-th data point of the dataset
        # If we have multiple things to return (data point and label), we can return them as tuple
        data_point = self.data[idx]
        data_label = self.label[idx]
        return data_point, data_label

In [None]:
# lets inspect
dataset = XORDataset(size=200)
print("Size of dataset:", len(dataset))
print("Data point 0:", dataset[0])

In [None]:
# visualize the dataset
import matplotlib.pyplot as plt
def visualize_samples(data, label):
    if isinstance(data, torch.Tensor):
        data = data.cpu().numpy()
    if isinstance(label, torch.Tensor):
        label = label.cpu().numpy()
    data_0 = data[label == 0]
    data_1 = data[label == 1]

    plt.figure(figsize=(3,3))
    plt.scatter(data_0[:,0], data_0[:,1], edgecolor=None, label="Class 0", s=12)
    plt.scatter(data_1[:,0], data_1[:,1], edgecolor=None, label="Class 1", s=12)
    plt.ylabel(r"$x_2$")
    plt.xlabel(r"$x_1$")
    plt.legend()

visualize_samples(dataset.data, dataset.label)
plt.show()

### Dataloader

The class `torch.utils.data.DataLoader` represents a Python iterable over a dataset with support for automatic batching, multi-process data loading and many more features. The data loader communicates with the dataset using the function `__getitem__`, and stacks its outputs as tensors over the first dimension to form a batch. In contrast to the dataset class, we usually don’t have to define our own data loader class, but can just create an object of it with the dataset as input.

Lets add some arguments:

- `batch_size`: number of samples to stack per batch
- `shuffle`: if `True` the data is returned randomly
- `num_workers`: number of parallel subprocesses to use for data loading. Default is 0. Large objects (high-resolution images) require more workers.
- `pin_memory`: if `True` the dataloader will copy tensors into CUDA pinned memory before returning them. This can save some time for large objects on the GPU. Good practice: set `True` for the training set but `False` for val and test sets.
- `drop_last`: if `True` the last batch is dropped in case it is smaller than the specified batch size. This occurs when the dataset size is not an integer multiple of the `batch_size`.


In [None]:
data_loader = data.DataLoader(dataset,
                              batch_size=8,
                              shuffle=True,
                              num_workers=0,
                              pin_memory=True,
                              drop_last=True
                              )

In [None]:
# next(iter(...)) catches the first batch of the data loader
# If shuffle is True, this will return a different batch every time we run this cell
# For iterating over the whole dataset, we can simple use "for batch in data_loader: ..."

# taks: get the "next" data from the dataloder using python built in `iter` and `next`
data_inputs, data_labels = next(iter(data_loader))

# The shape of the outputs are [batch_size, d_1,...,d_N] where d_1,...,d_N are the
# dimensions of the data point returned from the dataset class

# Inspect the shape of `data_inputs` and `data_labels`
print("Data inputs", data_inputs.shape, "\n", data_inputs)
print("Data labels", data_labels.shape, "\n", data_labels)

## Optimization

We will do the following steps
1. get a batch from the dataloader.
2. Obtain the predictions from the model for the batch using the weights that we have.
3. Calculate the loss based on the difference between predictions and the ground truth labels.
4. Backpropagation: calculate the gradients for every parameter with respect to the loss.
5. Update the parameters (weights) of the model in the direction of the gradients.

## Loss function

we will use the binary cross entropy loss (BCE):
$$
{L}_{B C E}=-\sum_i\left[y_i \log x_i+\left(1-y_i\right) \log \left(1-x_i\right)\right]
$$

where y are the labels (ground truth) and the x are the predictions.

pytorch has a lot of built in loss functions, such as `nn.BCELoss` and `nn.BCEWithLogitsLoss` (includes the sigmoid in it already).


In [None]:
# Define the `loss_module` using nn.BCEWithLogitsLoss
loss_module = nn.BCEWithLogitsLoss()

For updating the parameters, PyTorch provides the package `torch.optim` that has most popular optimizers implemented. Stochastic Gradient Descent (`torch.optim.SGD`) updates parameters by multiplying the gradients with a small constant, called learning rate, and subtracting those from the parameters (hence minimizing the loss). Therefore, we slowly move towards the direction of minimizing the loss. A good default value of the learning rate for a small network as ours is 0.1.

In [None]:
# Lets run the optimizer. This will take a few seconds on even on a slow CPU.
# Input to the optimizer are the parameters of the model: model.parameters()

# define the optimizer using `torch.optim.SGD` with a learning rate (lr) of 0.1 and pass model.parameters() as an argument also!
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)

The optimizer provides two useful functions: `optimizer.step()`, and `optimizer.zero_grad()`. The step function updates the parameters based on the gradients as explained above. The function `optimizer.zero_grad()` sets the gradients of all parameters to zero. While this function seems less relevant at first, it is a crucial pre-step before performing backpropagation. If we call the backward function on the loss while the parameter gradients are non-zero from the previous batch, the new gradients would actually be added to the previous ones instead of overwriting them. This is done because a parameter might occur multiple times in a computation graph, and we need to sum the gradients in this case instead of replacing them. Hence, remember to call `optimizer.zero_grad()` before calculating the gradients of a batch.

## Training

In [None]:
# Initialize the training dataset with 2500 points!
train_dataset = XORDataset(size=2500)

# initialize the dataloder with this new train dataset, use a batch size of 128 with shuffle enabled.
train_data_loader = data.DataLoader(train_dataset, batch_size=8, shuffle=True)

# setting the device to cuda (this has been done for you already, just uncomment the two rows below)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

We set our model to training mode

In [None]:
def train_model(model, optimizer, data_loader, loss_module, num_epochs=50):
    # Set model to train mode
    model.train()

    # Training loop
    losses = []
    for epoch in range(num_epochs):
        for data_inputs, data_labels in data_loader:
            # Step 0: Move input data to device (only strictly necessary if we use GPU)
            data_inputs = data_inputs.to(device)
            data_labels = data_labels.to(device)

            # Step 1: Run the model on the input data
            preds = model(data_inputs)
            preds = preds.squeeze(dim=1) # Output is [Batch size, 1], but we want [Batch size]

            # Step 2: Calculate the loss
            loss = loss_module(preds, data_labels.float())

            ## Step 3: Perform backpropagation
            # Before calculating the gradients, we need to ensure that they are all zero.
            # The gradients would not be overwritten, but actually added to the existing ones.
            optimizer.zero_grad()
            # Perform backpropagation
            loss.backward()

            # Step 4: Update the parameters
            optimizer.step()

        # we append the loss to losses, we can inspect it later.
        losses.append(loss.item())

        # print the loss at every fifth epochs
        if epoch % 5 == 0:
            print(f"Epoch: {epoch} | Loss: {loss.item():.5f}")
    return losses


And finally:

In [None]:
# call training
losses = train_model(model, optimizer, train_data_loader, loss_module, num_epochs=20)

In [None]:
# We can visualize the loss decreases with respect to the epochs
plt.scatter(range(len(losses)), [loss for loss in losses])
plt.xlabel("Number of epochs")
plt.ylabel("Loss")
plt.show()

## Saving a model

In [None]:
# inspect
state_dict = model.state_dict()
print(state_dict)

In [None]:
# torch.save(object, filename). For the filename, any extension can be used
torch.save(state_dict, "xor_classifier.pt")

## Loading the model

In [None]:
# Load state dict from the disk (make sure it is the same name as above)
state_dict = torch.load("xor_classifier.pt")

# Create a new model and load the state
new_model = SimpleClassifier(num_inputs=2, num_hidden=size_of_hidden_neurons, num_outputs=1)
new_model.load_state_dict(state_dict)

# Verify that the parameters are the same
print("Original model\n", model.state_dict())
print("\nLoaded model\n", new_model.state_dict())

## Evaluation

In [None]:
# create a test dataset

test_dataset = XORDataset(size=500)
# drop_last -> Don't drop the last batch although it is smaller than 128
test_data_loader = data.DataLoader(test_dataset, batch_size=128, shuffle=False, drop_last=False)


we will use accuracy as a metric: it is simple enough for us now.
$$
acc:=\frac{T P+T N}{T P+T N+F P+F N}
$$

In [None]:
def eval_model(model, data_loader):
    model.eval() # Set model to eval mode
    true_preds, num_preds = 0., 0.

    with torch.no_grad(): # Deactivate gradients for the following code
        for data_inputs, data_labels in data_loader:

            # Determine prediction of model on dev set
            data_inputs, data_labels = data_inputs.to(device), data_labels.to(device)
            preds = model(data_inputs)
            preds = preds.squeeze(dim=1)
            preds = torch.sigmoid(preds) # Sigmoid to map predictions between 0 and 1
            pred_labels = (preds >= 0.5).long() # Binarize predictions to 0 and 1

            # Keep records of predictions for the accuracy metric (true_preds=TP+TN, num_preds=TP+TN+FP+FN)
            true_preds += (pred_labels == data_labels).sum()
            num_preds += data_labels.shape[0]

    acc = true_preds / num_preds
    print(f"Accuracy of the model: {100.0*acc:4.2f}%")


In [None]:
eval_model(model, test_data_loader)

## Your turn!
Complete the tasks below and try to explain your answers.


1. Restart the kernel and go back to the start of "A simple network: MLP" part.
2. Adjust the hidden neuron size to: 1, 2, 3, and 4 and run the training for each (other hyperparameters remain the same). Always re-create the model fresh before the re-training and evaluations, other parameters should remain the same. Explain your results.
3. Now, train a 2 neuron model with lr=0.01 for 200 epochs. Explain the difference between these new results and point 2.
4. Set the train dataset size to 2500 as before. Use a learning rate of 1e-10 and an epoch number of 200. Other hyperparameters and test set size should remain the same as we set them originally originally (hidden size 4!). Explain the difference.
5. Set the number of epoch to be 1 (use all hyperparameters as we originally used). Explain the results.

### Solution

<details>
  <summary>Click to see the answers</summary>

2. Results of the sweep:

`size_of_hidden_neurons`: 1,  accuracy: 77% (should be close to 75%)

`size_of_hidden_neurons`: 2,  accuracy: 99.8% (!)

`size_of_hidden_neurons`: 3,  accuracy: 100%

`size_of_hidden_neurons`: 4,  accuracy: 100%

Something seems wrong! From theory, we know that a simple MLP with 2 hidden neurons is perfectly capable of solving the XOR problem.

This demonstrates a classic optimization challenge: while the architecture is capable, the training process failed. With only 2 neurons, the model is at its theoretical limit and is very sensitive to initialization and hyperparameters (like lr and epochs). Although, the accuracy is nearly 100%.

For the MLPs with 3 and 4 neurons, the network is over-parameterized. This extra capacity provides redundancy, creating more "paths" for the optimizer to descend, making it much easier to find a solution that works.


3. Difference of hyperparameters

LR=0.1, epochs=20

`size_of_hidden_neurons`: 2,  accuracy: 99.8%

LR=0.01, epochs=200

`size_of_hidden_neurons`: 2,  accuracy: 100%

Now, we see the explanation from point 2 in action.

4. Effect of learning rate:

`LR=1e-1, epochs=20`: ~100%

`LR=1e-10, epochs=200`: ~50%

The step size is so small, that the percieved effectiveness of the learning is negligible. The LR and the step-size should be large enough for the network to make meaningful steps towards the minimum.

5. Only 1 epoch

The network can only do 1 backward pass, so it effectively does not learn much. The accuracy is underwhelming (~60%).
  
</details>


# Advanced: visualizing the weight changes during training

In [None]:
import copy # to make deepcopies

def train_model(model, optimizer, data_loader, loss_module, num_epochs=50):
    # Set model to train mode
    model.train()

    # Training loop
    losses = []
    model_snapshots = []
    for epoch in range(num_epochs):
        for data_inputs, data_labels in data_loader:
            # Step 0: Move input data to device (only strictly necessary if we use GPU)
            data_inputs = data_inputs.to(device)
            data_labels = data_labels.to(device)

            # Step 1: Run the model on the input data
            preds = model(data_inputs)
            preds = preds.squeeze(dim=1) # Output is [Batch size, 1], but we want [Batch size]

            # Step 2: Calculate the loss
            loss = loss_module(preds, data_labels.float())

            ## Step 3: Perform backpropagation
            # Before calculating the gradients, we need to ensure that they are all zero.
            # The gradients would not be overwritten, but actually added to the existing ones.
            optimizer.zero_grad()
            # Perform backpropagation
            loss.backward()

            # Step 4: Update the parameters
            optimizer.step()

        # we append the loss to losses, we can inspect it later.
        losses.append(loss)

        # print the loss at every fifth epochs
        if epoch % 5 == 0:
            # print(f"Epoch: {epoch} | Loss: {loss.item():.5f}")
            snapshot = copy.deepcopy(model.state_dict())
            model_snapshots.append(snapshot)
    return losses, model_snapshots


In [None]:
# call training
_, snapshots = train_model(model, optimizer, train_data_loader, loss_module, num_epochs=1000)

In [None]:
from matplotlib.animation import FuncAnimation
import numpy as np

def visualize_training(model, model_snapshots, data_inputs, data_labels):
    fig, ax = plt.subplots(figsize=(6, 5))

    X = data_inputs.cpu().numpy()
    y = data_labels.cpu().numpy()

    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    # decision boundary grid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                         np.arange(y_min, y_max, 0.05))

    def update(frame_idx):
        ax.clear()
        model.load_state_dict(model_snapshots[frame_idx])
        model.eval()


        with torch.no_grad():
            grid_tensor = torch.from_numpy(np.c_[xx.ravel(), yy.ravel()]).float().to(device)
            Z = model(grid_tensor)
            Z = torch.sigmoid(Z).reshape(xx.shape).cpu().numpy()

        ax.contourf(xx, yy, Z, alpha=0.4, cmap="coolwarm")

        ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap="coolwarm")
        ax.set_title(f"Training Progress: Frame {frame_idx}")

    anim = FuncAnimation(fig, update, frames=len(model_snapshots), interval=200)

    anim.save('training_progress.gif', writer='pillow')
    print("Animation saved as training_progress.gif!")
    plt.show()

In [None]:
visualize_training(model, snapshots, dataset.data, dataset.label)