In [None]:
1. How does unsqueeze help us to solve certain broadcasting problems?


In [None]:
The unsqueeze function helps us solve certain broadcasting problems by adding a new dimension to a tensor. When broadcasting requires dimensions to match for elementwise operations, unsqueeze can be used to expand the tensor along the specified dimension, allowing proper alignment with the other tensor. It helps in adjusting the shapes of tensors to enable compatible broadcasting.

In [None]:
2. How can we use indexing to do the same operation as unsqueeze?


In [1]:
import numpy as np

a = np.array([1, 2, 3])
b = a[:, None]  # Using indexing with None
c = a[np.newaxis, :]  # Using indexing with np.newaxis

print(b)
print(c)


[[1]
 [2]
 [3]]
[[1 2 3]]


In [None]:
3. How do we show the actual contents of the memory used for a tensor?


In [None]:
To show the actual contents of the memory used for a tensor, you can access the underlying data by calling the .numpy() method (for PyTorch tensors) or the .tolist() method (for TensorFlow tensors) on the tensor object. This will return a NumPy array or a Python list containing the tensor's data.

In [None]:
4. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added
to each row or each column of the matrix? (Be sure to check your answer by running this
code in a notebook.)


In [None]:
When adding a vector of size 3 to a matrix of size 3x3, the elements of the vector are added to each column of the matrix. Broadcasting rules dictate that when adding tensors with different shapes, the smaller tensor is expanded to match the shape of the larger tensor. In this case, the vector is broadcasted along the second dimension (columns) of the matrix.

In [2]:
import numpy as np

vector = np.array([1, 2, 3])
matrix = np.array([[4, 5, 6], [7, 8, 9], [10, 11, 12]])

result = matrix + vector
print(result)


[[ 5  7  9]
 [ 8 10 12]
 [11 13 15]]


In [None]:
5. Do broadcasting and expand_as result in increased memory use? Why or why not?


In [None]:
Broadcasting and expand_as do not result in increased memory use. They do not create additional copies of the tensor data. Instead, they allow efficient computation by adjusting the tensor's shape and manipulating the indexing mechanism to avoid unnecessary memory duplication. The expanded view of the tensor is used during computation without actually allocating additional memory for the expanded tensor.

In [None]:
6. Implement matmul using Einstein summation.


In [3]:
import numpy as np

a = np.random.rand(3, 4)
b = np.random.rand(4, 5)

c = np.einsum('ij,jk->ik', a, b)

print(c)


[[1.37964213 0.87073231 0.87967245 1.14610443 0.91575191]
 [1.43531051 1.46566456 0.87257348 1.53638773 1.20553217]
 [1.28770516 1.04972445 0.83758174 1.081176   1.20259086]]


In [None]:
7. What does a repeated index letter represent on the lefthand side of einsum?


In [None]:
A repeated index letter on the left-hand side of Einstein summation notation represents a summation over that index. It indicates that the corresponding dimensions will be contracted, resulting in a summation operation when calculating the output.

In [None]:
8. What are the three rules of Einstein summation notation? Why?


In [None]:
The three rules of Einstein summation notation are:

Repeated indices: A repeated index implies summation over that index.

Free indices: Free indices are those that appear exactly once on both sides of the expression and are not summed over. They represent the indices of the resulting output tensor.

Einstein's convention: In Einstein summation notation, an index that appears once as a subscript (lower index) and once as a superscript (upper index) indicates a summation over that index.

These rules are used to compactly represent and perform tensor operations without explicitly writing out the summation signs.

In [None]:
9. What are the forward pass and backward pass of a neural network?


In [None]:
The forward pass of a neural network refers to the process of propagating input data through the network's layers, computing the intermediate activations, and producing an output prediction. During the forward pass, each layer performs a linear transformation (weighted sum) followed by an activation function to produce the output for the subsequent layer.

The backward pass, also known as backpropagation, is the process of calculating the gradients of the loss function with respect to the network's parameters. It involves propagating the error signal from the output layer to the input layer, using the chain rule of derivatives to compute the gradients layer by layer. These gradients are then used to update the parameters of the network during the training process.

In [None]:
10. Why do we need to store some of the activations calculated for intermediate layers in the
forward pass?


In [None]:
Storing activations calculated for intermediate layers in the forward pass is necessary for backpropagation and gradient computation during the backward pass. The intermediate activations are required to calculate the gradients of the loss function with respect to the network parameters. They serve as inputs for the chain rule, allowing the calculation of gradients layer by layer. Without storing the activations, it would not be possible to propagate the gradients backward through the network and update the parameters effectively.

In [None]:
11. What is the downside of having activations with a standard deviation too far away from 1?


In [None]:
The downside of having activations with a standard deviation too far away from 1 is that it can lead to vanishing or exploding gradients during training. Vanishing gradients occur when the gradients become extremely small, making it challenging for the network to learn and update the lower layers. Exploding gradients, on the other hand, occur when the gradients become extremely large, causing unstable updates and difficulties in convergence.



In [None]:
12. How can weight initialization help avoid this problem?

In [None]:
Weight initialization can help avoid the problem of activations with a standard deviation too far away from 1. Proper weight initialization techniques, such as Xavier/Glorot initialization or He initialization, set the initial values of the weights in a way that helps stabilize the training process. These techniques take into account the number of input and output units of a layer to initialize the weights appropriately. By initializing the weights correctly, the activations are more likely to have an appropriate range of values, preventing them from being too small or too large, and improving the convergence and stability of the network during training.