In [None]:
How does unsqueeze help us to solve certain broadcasting problems?
unsqueeze adds a new dimension to a tensor, which can be crucial for broadcasting operations. It helps align the dimensions of two tensors for element-wise operations.

import torch
tensor1 = torch.tensor([1, 2, 3])
tensor2 = torch.tensor([10, 20])
result = tensor1.unsqueeze(1) + tensor2  # Broadcasting tensor1 to shape (3, 1)


In [None]:
How can we use indexing to do the same operation as unsqueeze?
You can achieve the same result using indexing like this:
    
import torch
tensor1 = torch.tensor([1, 2, 3])
tensor2 = torch.tensor([10, 20])
result = tensor1[:, None] + tensor2  # Using indexing to add a new axis


In [None]:
How do we show the actual contents of the memory used for a tensor?
You can use the .numpy() method to convert a PyTorch tensor to a NumPy array and then print it to see the actual values. For example:
    
import torch
tensor = torch.tensor([1, 2, 3])
numpy_array = tensor.numpy()
print(numpy_array)


In [None]:
When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix?
The elements of the vector are added to each column of the matrix. Broadcasting extends the vector along the rows to match the shape of the matrix.

Do broadcasting and expand_as result in increased memory use? Why or why not?
No, broadcasting and expand_as do not result in increased memory use because they do not create new copies of the data. They operate on the original tensor views, which share memory. This makes them memory-efficient.

Implement matmul using Einstein summation.
Here's an example of implementing matrix multiplication using Einstein summation:

import torch
A = torch.tensor([[1, 2], [3, 4]])
B = torch.tensor([[5, 6], [7, 8]])
result = torch.einsum('ij,jk->ik', A, B)


In [None]:
What does a repeated index letter represent on the lefthand side of einsum?
A repeated index letter represents a summation over that index. In the Einstein summation notation, repeated indices imply a summation operation.

What are the three rules of Einstein summation notation? Why?
The three rules of Einstein summation notation are:

Repeated indices are implicitly summed over.
Indices can appear at most twice in any term.
Each term must have the same number of indices as its corresponding operand.
These rules ensure that the notation represents valid tensor operations and simplifies complex expressions.
What are the forward pass and backward pass of a neural network?
The forward pass is the process of propagating input data through a neural network to generate predictions or outputs. Each layer in the network computes weighted sums and applies activation functions sequentially. The forward pass calculates the predicted output of the network.

The backward pass, also known as backpropagation, is the process of computing gradients of the loss function with respect to the network's parameters. These gradients are used in optimization algorithms (e.g., gradient descent) to update the model's weights during training.

Why do we need to store some of the activations calculated for intermediate layers in the forward pass?
Intermediate activations are stored during the forward pass because they are required during the backward pass for gradient computation. These activations are used to calculate gradients with respect to the loss function and update the model's parameters during training via backpropagation.

What is the downside of having activations with a standard deviation too far away from 1?
Activations with a standard deviation too far from 1 can lead to issues during training, such as vanishing or exploding gradients. Vanishing gradients make it challenging for the network to learn, while exploding gradients can lead to unstable training. Keeping activations close to a standard deviation of 1 helps stabilize and speed up training.

How can weight initialization help avoid this problem?
Proper weight initialization techniques, such as Xavier/Glorot initialization or He initialization, can help avoid the problem of activations with a standard deviation too far from 1. These techniques initialize the model's weights in a way that encourages activations to stay within a reasonable range during training, which mitigates gradient issues and accelerates convergence.