# Assignment_12

#### Q1. How does unsqueeze help us to solve certain broadcasting problems?

**Ans.** 

The unsqueeze function is used in PyTorch to add an extra dimension to a tensor. This can help us solve certain broadcasting problems where the tensor shapes do not match.

Broadcasting is a technique in PyTorch that allows us to perform element-wise operations on tensors with different shapes. However, to do this, PyTorch requires that the shapes of the tensors are compatible. Two tensors are compatible for broadcasting if their shapes are equal, or if one of the tensors has a dimension of size 1.

If we have two tensors that are not compatible for broadcasting, we can use unsqueeze to add an extra dimension to one of the tensors, effectively making it compatible with the other tensor. For example, consider the following tensors:

`A = torch.tensor([1, 2, 3])`

`B = torch.tensor([[4, 5, 6],`

 `[7, 8, 9]])`


If we want to add A to every row of B, we need to reshape A to be a 2D tensor of shape (3, 1), so that it has the same number of columns as B. We can do this using the unsqueeze function:

`A = A.unsqueeze(1)`

`C = A + B`


Now, A has shape (3, 1) and B has shape (2, 3), which are compatible for broadcasting. PyTorch automatically broadcasts A to have the same shape as B by copying its elements along the new dimension.

In general, unsqueeze is a useful function when we need to add an extra dimension to a tensor to make it compatible with another tensor for broadcasting.

#### Q2. How can we use indexing to do the same operation as unsqueeze?

**Ans.** 
We can use indexing to achieve the same operation as unsqueeze by creating a new axis or dimension in the tensor. In PyTorch, we can use the None keyword to create a new axis.

For example, consider the following tensor:

`A = torch.tensor([1, 2, 3])`

To add a new axis to A using indexing, we can write:


`A[:, None]`

This creates a new axis at position 1, effectively reshaping A from a 1D tensor of shape (3,) to a 2D tensor of shape (3, 1). We can then use this reshaped tensor for broadcasting with another tensor, just as we did with unsqueeze.

Similarly, to add a new axis to a 2D tensor like B from the previous example:

`B = torch.tensor([[4, 5, 6],`
                  `[7, 8, 9]])`

We can use indexing as follows:


`B[:, :, None]`


This creates a new axis at position 2, effectively reshaping B from a 2D tensor of shape (2, 3) to a 3D tensor of shape (2, 3, 1). Again, we can use this reshaped tensor for broadcasting with another tensor.

In general, using indexing to add new axes or dimensions can be a useful alternative to unsqueeze for reshaping tensors and making them compatible for broadcasting.

#### Q3. How do we show the actual contents of the memory used for a tensor?

**Ans.** The commonly used way to store such data is in a single array that is laid out as a single, contiguous block within memory. More concretely, a 3x3x3 tensor would be stored simply as a single array of 27 values, one after the other.

#### Q4. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)


**Ans.** 
When adding a vector of size 3 to a matrix of size 3x3, the elements of the vector are added to each row of the matrix. This is because broadcasting in PyTorch operates along the last dimensions of the tensors.

For example, consider the following code:

import torch

A = torch.ones((3, 3))

B = torch.tensor([1, 2, 3])

C = A + B

print(C)

This will output the following tensor:

tensor([[2., 3., 4.],

    `[2., 3., 4.],`
    `[2., 3., 4.]])`
As we can see, the values of B have been added to each row of A. This is because the last dimension of B (size 3) is broadcasted to match the last dimension of A (also size 3), and the addition operation is applied element-wise along this dimension.

If we want to add the vector to each column of the matrix instead, we need to reshape the vector to be a column vector of size (3, 1) so that it matches the shape of the matrix along the last dimension. We can do this using the unsqueeze method or indexing.

#### Q5. Do broadcasting and expand_as result in increased memory use? Why or why not?

**Ans.** 
Here the elements of c are expanded to make three rows that match, making the operation possible. Again, PyTorch doesn't actually create three copies of c in memory. This is done by the expand_as method behind the scenes:
`c.expand_as(m)
tensor([[10., 20., 30.],
        [10., 20., 30.],
        [10., 20., 30.]])`
        
If we look at the corresponding tensor, we can ask for its storage property (which shows the actual contents of the memory used for the tensor) to check there is no useless data stored:
`t = c.expand_as(m)
t.storage()
 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]`

Even though the tensor officially has nine elements, only three scalars are stored in memory. This is possible thanks to the clever trick of giving that dimension a stride of 0 (which means that when PyTorch looks for the next row by adding the stride, it doesn't move):
`t.stride(), t.shape
((0, 1), torch.Size([3, 3])`

#### Q6. Implement matmul using Einstein summation.

**Ans.** 
Before using the PyTorch operation @ or torch.matmul, there is one last way we can implement matrix multiplication: Einstein summation (einsum). This is a compact representation for combining products and sums in a general way. We write an equation like this:

ik,kj -> ij
The left hand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (i,k and k,j). The right hand side represents the result dimensions, so here we have a tensor with two dimensions i,j.

The rules of Einstein summation notation are as follows:

Repeated indices on the left side are implicitly summed over if they are not on the right side.
Each index can appear at most twice on the left side.
The unrepeated indices on the left side must appear on the right side.
So in our example, since k is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in (i,j) the sum of all the coefficients (i,k) in the first tensor multiplied by the coefficients (k,j) in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:

def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have just one member on the left hand side. For instance, this:

torch.einsum('ij->ji', a)
returns the transpose of the matrix a. You can also have three or more members. This:

torch.einsum('bi,ij,bj->b', a, b, c)
will return a vector of size b where the k-th coordinate is the sum of a[k,i] b[i,j] c[k,j]. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want to compute the matrix product per batch, you would could this:

torch.einsum('bik,bkj->bij', a, b)
Let's go back to our new matmul implementation using einsum and look at its speed:

%timeit -n 20 t5 = matmul(m1,m2)
68.7 µs ± 4.06 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
As you can see, not only is it practical, but it's very fast. einsum is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA.


#### Q7. What does a repeated index letter represent on the lefthand side of einsum?

**Ans.** 
Before using the PyTorch operation @ or torch.matmul, there is one last way we can implement matrix multiplication: Einstein summation (einsum). This is a compact representation for combining products and sums in a general way. We write an equation like this:

ik,kj -> ij
The lefthand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (i,k and k,j). The righthand side represents the result dimensions, so here we have a tensor with two dimensions i,j.

The rules of Einstein summation notation are as follows:

Repeated indices on the left side are implicitly summed over if they are not on the right side.
Each index can appear at most twice on the left side.
The unrepeated indices on the left side must appear on the right side.
So in our example, since k is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in (i,j) the sum of all the coefficients (i,k) in the first tensor multiplied by the coefficients (k,j) in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:

def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have just one member on the lefthand side. For instance, this:

torch.einsum('ij->ji', a)
returns the transpose of the matrix a. You can also have three or more members. This:

torch.einsum('bi,ij,bj->b', a, b, c)
will return a vector of size b where the k-th coordinate is the sum of a[k,i] b[i,j] c[k,j]. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want to compute the matrix product per batch, you would could this:

torch.einsum('bik,bkj->bij', a, b)
Let's go back to our new matmul implementation using einsum and look at its speed:

%timeit -n 20 t5 = matmul(m1,m2)
68.7 µs ± 4.06 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
As you can see, not only is it practical, but it's very fast. einsum is often the fastest way to do custom operations in PyTorch, without diving


#### Q8. What are the three rules of Einstein summation notation? Why?

**Ans.** 
1. Each index can appear at most twice in any term.
2. Repeated indices are implicitly summed over. 
3. Each term must contain identical non-repeated indices.

#### Q9. What are the forward pass and backward pass of a neural network?

**Ans.** 
Backward and forward pass makes together one "iteration". During one iteration, you usually pass a subset of the data set, which is called "mini-batch" or "batch" , "Epoch" means passing the entire data set in batches

#### Q10. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?

**Ans.** Forward propagation refers to storage and calculation of input data which is fed in forward direction through the network to generate an output. Hidden layers in neural network accepts the data from the input layer, process it on the basis of activation function and pass it to the output layer or the successive layers.

#### Q11. What is the downside of having activations with a standard deviation too far away from 1?

**Ans.** 
Normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution. Areas of the normal distribution are often represented by tables of the standard normal distribution. A portion of a table of the standard normal distribution .
my standard deviation and variance are above 1, the standard deviation will be smaller than the variance. But if they are below 1, the standard deviation will be bigger than the variance.
in a normal distribution, a score that is 1 s.d. above the mean is equivalent to the 84th percentile


#### Q12. How can weight initialization help avoid this problem?

**Ans.** The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network.