<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/11_3_attention_scoring_functions_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 KB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.9/120.9 KB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## 11.3 Attention Scoring Functions

### 11.3.2 Convenience Functions

#### Masked Softmax Operation

* `tensor.size(dim=None)`: Returns the size of the given tensor.

In [18]:
import torch
X = torch.tensor([[1, 2],
                  [3, 4]])

X.size(), X.size(dim=0), X.size(dim=1)

(torch.Size([2, 2]), 2, 2)

* `torch.nn.functional.softmax(input, dim=None)`: Applies a softmax function.

In [3]:
X = torch.rand(2, 2)
Y = torch.nn.functional.softmax(X)
Y

  Y = torch.nn.functional.softmax(X)


tensor([[0.4667, 0.5333],
        [0.4341, 0.5659]])

In [4]:
Y = torch.nn.functional.softmax(X, dim=0)
Y

tensor([[0.4563, 0.4239],
        [0.5437, 0.5761]])

* `torch.repeat_interleave(input, repeats)`: Repeat elements of a tensor.

In [5]:
X = torch.tensor([[1, 2], 
                  [3, 4]])
Y = X.repeat_interleave(2)
Y

tensor([1, 1, 2, 2, 3, 3, 4, 4])

In [6]:
Y = X.repeat_interleave(2, dim=1)
Y

tensor([[1, 1, 2, 2],
        [3, 3, 4, 4]])

* Slicing with `None` will add an axis to your array.

In [7]:
X = torch.rand(2, 2, 4)              # Shape: 2 x 2 x 4
valid_lens = torch.tensor([2, 3])    # Shape: 2
X.shape, valid_lens.shape, valid_lens.dim()

(torch.Size([2, 2, 4]), torch.Size([2]), 1)

In [8]:
shape = X.shape
valid_lens = torch.repeat_interleave(valid_lens, shape[1])
valid_lens, shape

(tensor([2, 2, 3, 3]), torch.Size([2, 2, 4]))

In [9]:
## _sequence_mask function

X = X.reshape(-1, shape[-1])         # shape: 4 x 4
value = -1e6
print('Shape: ', shape)
print('X.shape: ', X.shape)
print('valid_lens.shape: ', valid_lens.shape)
print('valid_lens: ', valid_lens)

print('**************************')

maxlen = X.size(1)                    # maxlen = 4
print('Maximum length is: ', maxlen)

mask = torch.arange((maxlen), dtype=torch.float32,
                    device=X.device)
print('Mask (without None slicing):', mask)
print('Slicing with None on valid_len: ', valid_lens[:, None])
print('Slicing with None on mask: ', mask[None, :])

mask = torch.arange((maxlen), dtype=torch.float32,
                    device=X.device)[None, :] < valid_lens[:, None]
print('Mask: ', mask)

print('**************************')

X[~mask] = value
X


Shape:  torch.Size([2, 2, 4])
X.shape:  torch.Size([4, 4])
valid_lens.shape:  torch.Size([4])
valid_lens:  tensor([2, 2, 3, 3])
**************************
Maximum length is:  4
Mask (without None slicing): tensor([0., 1., 2., 3.])
Slicing with None on valid_len:  tensor([[2],
        [2],
        [3],
        [3]])
Slicing with None on mask:  tensor([[0., 1., 2., 3.]])
Mask:  tensor([[ True,  True, False, False],
        [ True,  True, False, False],
        [ True,  True,  True, False],
        [ True,  True,  True, False]])
**************************


tensor([[ 9.1626e-01,  8.5247e-01, -1.0000e+06, -1.0000e+06],
        [ 8.7970e-01,  4.7735e-01, -1.0000e+06, -1.0000e+06],
        [ 4.8545e-01,  2.6565e-01,  3.2469e-01, -1.0000e+06],
        [ 1.6048e-01,  9.4612e-01,  8.0980e-01, -1.0000e+06]])

In [10]:
torch.nn.functional.softmax(X.reshape(shape), dim=-1)

tensor([[[0.5159, 0.4841, 0.0000, 0.0000],
         [0.5993, 0.4007, 0.0000, 0.0000]],

        [[0.3768, 0.3024, 0.3208, 0.0000],
         [0.1958, 0.4295, 0.3747, 0.0000]]])

In [11]:
Y = torch.tensor([[1, 2],
                  [3, 4]])
Mask = torch.tensor([True, False])
Y.shape, Mask.shape

(torch.Size([2, 2]), torch.Size([2]))

In [12]:
Y[Mask], Y[~Mask]     # Masks first and second row respectively.

(tensor([[1, 2]]), tensor([[3, 4]]))

* `torch.bmm(input, mat2)`: Performs a batch matrix-matrix product of matrices stored in `input` and `mat2`.
* `input` and `mat2` must be 3-D tensors each containing the same number of matrices.
* If `input` is a ($b \times n \times m$) tensor, `mat2` is a ($b \times m \times p$), `out` will be a ($b \times n \times p$) tensor.
$$
out_i = input_i @ mat2_i
$$

In [13]:
Q = torch.ones((2, 3, 4))   # Queries (input) 2 x 3 x 4
K = torch.ones((2, 4, 6))   # Keys (mat2)    2 x 4 x 6
 
out = torch.bmm(Q, K) # 2 x 3 x 6
out.shape

torch.Size([2, 3, 6])

### 11.3.3 Scaled Dot-Product Attention

* `torch.transpose(input, dim0, dim1)`: returns a tensor that is a transposed version of the input.

In [14]:
X = torch.tensor([[1, 2],
                  [3, 4]])
Y1 = X.transpose(0, 1)
Y1

tensor([[1, 3],
        [2, 4]])

* `torch.unsqueeze(input, dim)`: returns a new tensor with a dimension of size one inserted at the specified position.

In [15]:
X = torch.tensor([1, 2])

Y1 = X.unsqueeze(0)
Y2 = X.unsqueeze(1)

X.shape, Y1.shape, Y2.shape

(torch.Size([2]), torch.Size([1, 2]), torch.Size([2, 1]))

* Class `torch.nn.Dropout(p=0.5, inplace=False)`: During training, randomly zeros some of the elements of the input tensor with probability `p` using samples from a Bernoulli distribution.

In [17]:
from torch import nn

net = nn.Dropout(0.5)
input = torch.randn(5, 5)
output = net(input)
output

tensor([[-0.0000, -0.4506,  2.2459, -0.0000,  0.9600],
        [ 1.7553, -0.8203, -0.7829, -0.0000, -0.0000],
        [-0.0000, -0.8351,  5.2506,  0.0000, -0.0000],
        [-0.0000,  0.0000, -2.5133, -1.8300,  0.0000],
        [-4.3499, -1.0717,  0.0490,  1.4745,  1.8120]])

* `model.eval()`: Is a kind of switch for some specific layers/parts of the model that behave differently during training and inference (evaluating) time. For example Dropout layers, BatchNorm layers etc. You need to turn them off during model evaluation, and `.eval()` will do it for you.
* In addition, the common practice for evaluating/validation is using `torch.no_grad()` in pair with `model.eval()` to turn off gradients computations.
* Don't forget to turn back to `training` mode after eval step (using `model.train()`).

* `torch.squeeze(input, dim=None)`: returns a tensor with all the dimensions of `input` of size 1 is removed.
* For example, if `input` is of shape ($A \times 1 \times B \times C \times 1 \times D$), then the `out` tensor will be of shape ($A \times B \times C \times D$).
* When `dim` is given, a squeeze operation is done only in the given dimension. If input is of shape ($A \times 1 \times B$), `squeeze(input, 0)` leaves the tensor unchanged, but `squeeze(input, 1)` will squeeze the tensor to the shape ($A \times B$).

In [22]:
X = torch.ones(2, 1, 2, 1, 3)
print('Shape of X is: ', X.shape)

Y1 = X.squeeze()
print('Shape of Y1 is: ',Y1.shape)

Y2 = X.squeeze(dim=0)
print('Shape of Y2 is: ',Y2.shape)

Y3 = X.squeeze(dim=1)
print('Shape of Y3 is: ',Y3.shape)

Shape of X is:  torch.Size([2, 1, 2, 1, 3])
Shape of Y1 is:  torch.Size([2, 2, 3])
Shape of Y2 is:  torch.Size([2, 1, 2, 1, 3])
Shape of Y3 is:  torch.Size([2, 2, 1, 3])
