1. How does unsqueeze help us to solve certain broadcasting problems?

2. How can we use indexing to do the same operation as unsqueeze?

3. How do we show the actual contents of the memory used for a tensor?

4. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added
to each row or each column of the matrix? (Be sure to check your answer by running this
code in a notebook.)

5. Do broadcasting and expand_as result in increased memory use? Why or why not?

6. Implement matmul using Einstein summation.

7. What does a repeated index letter represent on the lefthand side of einsum?

8. What are the three rules of Einstein summation notation? Why?

9. What are the forward pass and backward pass of a neural network?

10. Why do we need to store some of the activations calculated for intermediate layers in the
forward pass?

11. What is the downside of having activations with a standard deviation too far away from 1?

12. How can weight initialization help avoid this problem?

# Answers

1. unsqueeze helps solve certain broadcasting problems by adding a new dimension to a tensor, which can be used to match the dimensions of other tensors in element-wise operations. For example, if you have a 1D tensor and want to perform element-wise operations with a 2D tensor, you can use unsqueeze to add a new dimension and make their shapes compatible for broadcasting.

2. You can achieve the same operation as unsqueeze using indexing. For example, if you want to add a new dimension to a tensor a, you can use a[:, None] or a[:, np.newaxis] to add a new dimension along the second axis. This effectively replicates the data along that dimension.

In [1]:
#3 To show the actual contents of the memory used for a tensor, you can use the .storage() method in PyTorch.

import torch

tensor = torch.tensor([1, 2, 3])
storage = tensor.storage()
print(storage)


 1
 2
 3
[torch.storage.TypedStorage(dtype=torch.int64, device=cpu) of size 3]


  storage = tensor.storage()


4. When adding a vector of size 3 to a matrix of size 3×3, the elements of the vector are added to each column of the matrix.

5. Broadcasting and expand_as do not result in increased memory use. They are memory-efficient because they do not create new copies of data but rather allow operations to be performed on the original data in a memory-efficient way by virtually expanding or broadcasting the dimensions as needed.

In [2]:
#6 Implementing matmul using Einstein summation:

import torch

a = torch.tensor([[1, 2, 3]])
b = torch.tensor([[4], [5], [6]])

result = torch.einsum('ij,jk->ik', a, b)
print(result)


tensor([[32]])


7. A repeated index letter on the left-hand side of einsum represents summation or contraction along that dimension. For example, in 'ij,jk->ik', the repeated index letter 'j' means that the 'j' dimension of the first input tensor and the second input tensor will be summed over.

8. The three rules of Einstein summation notation are:

Repeated indices imply summation.
Indices that appear only once are free indices.
The resulting expression specifies the dimensions of the output tensor.
These rules help define how tensors are contracted or operated on in an einsum expression.

9. The forward pass of a neural network involves feeding input data through the network's layers, computing intermediate activations, and ultimately producing an output. The backward pass, also known as backpropagation, is the process of calculating gradients of the loss function with respect to the network's parameters. It is essential for updating the network's weights during training.

10. Storing activations for intermediate layers in the forward pass is necessary for backpropagation. During backpropagation, the gradients are computed by propagating them backward through the network. The stored intermediate activations are needed to compute these gradients efficiently.

11.The downside of having activations with a standard deviation too far away from 1 is that it can lead to issues like vanishing or exploding gradients during backpropagation. This can make training difficult or slow, as the gradients may become very small or very large, affecting the weight updates.

12. Weight initialization can help avoid the problem of activations with a standard deviation far from 1 by setting the initial values of weights to specific values. For example, initializing weights with values drawn from a suitable distribution (e.g., Xavier/Glorot initialization) can help ensure that activations have an appropriate scale, reducing the risk of vanishing or exploding gradients during training.