$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 3: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer**:

The receptive field defined as the region of the input space that affects a particular CNN's unit of the network. This input region can be not only the input of the network but also output from other units in the network, therefore this receptive field can be calculated relative to the input that we consider and also relative the unit that we are taking into consideration as the “receiver” of this input region.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer**:

* Add more convolutional layers (make the network deeper) - increases the 
receptive field size linearly, as each extra layer increases the receptive field size by the kernel size

* Add pooling layers or higher stride convolutions (sub-sampling) - increases the receptive field size multiplicatively. sequentially placed dilated convolutions, increase the RF exponentially.

* Depth-wise convolutions - the receptive field is increased with a small compute footprint, so it is considered a compact way to increase the receptive field with fewer parameters. this technique do not directly increase the receptive field

3. Imagine a CNN with three convolutional layers, defined as follows:

In [None]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer**:

To calculate the size of the receptive field of each pixel, 
we will use the formula $[r_{0}=\sum_{l=1}^{L}\left(\left(k_{l}-1\right) \prod_{i=1}^{l-1} s_{i}\right)+1]$, where $r_0$ is the output layer receptive field. 
We will implemnt the calculations below with
K = kernels, S = strides, D = dialtions
 

In [None]:
import numpy as np

num_layers = 5
K = [3, 2, 5, 2, 7]
S = [1, 2, 2, 2, 1]
D = [1, 1, 1, 1, 2]
r_0 = 1

for l in range(num_layers):
    layer_sum = 1
    for i in range(l):
        layer_sum = layer_sum * S[i]
        r_0 += (K[l] - 1) * layer_sum

print(f"The size of the receptive field of each pixel in the output tensor is {r_0}")

The size of the receptive field of each pixel in the output tensor is 111


4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer**:

The x input has a significant affect on the filters that were learned.
Since we changed the expression to learn also the residual block, it also changed the real output of the layers.

### Dropout

1. Consider the following neural network:

In [None]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer**

Given that the layers are indepedent from each other, if a weight is dropped (zeroed) in the first dropout layer this is probability p1. If the weight still survives the first layer but then gets dropped in the second, this is probability (1-p1)*p2. 
Thus, the final expressions is: $q = p1 + (1-p1)*p2$.

2. **True or false**: dropout must be placed only after the activation function.

**Answer**

True, given dropout of p, the dropout layer wil set every hidden unit to 0 in probability of p. This should be done after the activation function which converts the range of the hidden units to [-1,1] or [0,1] depending on the activation function. If dropout is performed before activation then on the hidden units that have been "dropped" we will apply activation and this can technically "re-activate" their value if we are not careful of the calculation. For example if a hidden unit has value x then is dropped to 0, then activation sigmoid(0) = 0.5 is applied. Final value will be 0.5. In the other order, we get sigmoid(x) that is passed to dropout so its final value is 0. Final value of 0 is the expected default value.  
This is not the case for relu since dropout(relu(x)) = relu(dropout(x)) = 0

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer**

Let A(x):= activation of a given unit x. The expectation without any dropout is A(x). When dropout with probability p is used, then for a given x we are de-activating the unit with probability p. So in this case E = (1-p)A(x) + p*0; there is probability 1-p we remain with activation value and probability p that 0(A(x)) = 0 de-activated.

In order for E = A(x) we need to scale by 1/(1-p). Then
E = (1-p)(A(x)/(1-p)) + p*0 = A(x)

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer**

Should use binary cross entropy: −(ylog(p)+(1−y)log(1−p))


If output is a dog and model returns 1 (hotdog) -> L2 norm = 1
All wrong answers for N samples: L2 loss = sqrt(N).
Doesnt differentiate between how big the error is based on the intial probabilities (p , 1-p) and not just the final class output. 

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [None]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer**

The most likely cause is vanishing gradients. The model has 26 linear layers and the activation function chosen will cause small gradients that will eventually zero out during backpropogration. This will effectively prevent the weights from updating and the training will stop as a result. 

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer**

While the derivatives of the tanh are larger than the derivatives of the sigmoid, this may not be enough to fix the problem. Tanh has a maximum derivative of 1 and this may help us in training. But overall it faces the same vanishing gradient issue, which will likely be the case in a deep network. 


4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**Answer**

1. True: The derivative of the Relu function is either 0 or 1. During backpropogation as long as we are multipliying by value 1, there is no issue with vanishing gradient value.
2. False: For input x > 0, Relu(x) = x. The gradient here is 1, which is linear (constant). For x <= 0, Relu(x) = 0 so gradient is 0, also a linear constant. 
3. True: If every training example causes a certain neuron to have a negative value (which then becomes 0 after ReLU is applied), then the neuron will never be adjusted, since no matter which training example is selected (or which batch) the gradient on the neuron will be 0. Thus, the neuron can be "dead".

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer**

- SGD: Computes gradient for each sample one at a time. If there are 100 samples in dataset then the gradient and loss are calculated 100 times in a single epoch. An update per sample.
- Mini-batch SGD: Computes the gradient for one batch at a time, overall all samples are computed. If there are 100 samples in dataset and 20 mini batches with 5 samples in each, then the gradient is calculated 20 times in a single epoch. An update per batch.
- GD: Computed gradient for all samples at once. An update per all samples. 


2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

**Answer**

1. The reasons that SGD is used more often comaped to GD are:
- In each update it is faster to compute on one sample vs all samples at once,
GD takes a fewer number of updates but each update is done actually after one whole epoch. SGD takes a lot of update steps but it will take a lesser number of epochs i.e. the number of times we iterate through all examples will be lesser in this case and thus it is a much faster process.

- Easier to move out of local minimum when training on one sample at a time. In cases where using GD we may get stuck on local minimum and due to learning rate may not be able to converge to global minimum. 

2. When we are working with very large datasets, it may not be possible to even calculate gradients for all samples at once. There may not be enough memory or processing power to perform a single update. To solve this we can use mini-batch SGD or SGD and use a dataloader. The dataloader will only load to memory the samples that are being calculated at a given time, this will save memory.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer**

If we increase batch size, this may not neccesarily lead to better results in same number of iterations. Since we already have good results with batch size B, adding more samples can cause the model to generalize which leads to lower accuracy and slower learning.  

The reason for this could be that during each mini-batch the gradients are averaged out. If there are too many samples in the batch the model will over generalize. 

4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer**

1. True : In SGD the optimization step is performed for each sample per epoch.
2. False : Gradients have more variance/ more noisy since they are only based on one sample at a time. In GD the gradients at each update are averaged so there is less fluctuation. However, there is quicker convergence in SGD compared to GD.
3. True : Since the losses of SGD fluctuate more, it is more likely that we will not get stuck in local minimi during training. 
4. False : Only 1 sample and its calculations need to be stored in memory at a given time.
5. False : GD is not guaranteed to converge to global minimum. This will depend on loss function (convexity) and learning rate. SGD does converge to local minimum.
6. True : SGD with momentum will converge faster. SGD with Newton's method will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. Momentum helps accelerate gradients in the right direction.

5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer**

1. Vanishing gradients occurs when the derivative or slope will get smaller and smaller as we go backward with every layer during backpropagation. 
this problem occurs with the sigmoid and tanh activation function because the derivatives of the sigmoid and tanh activation functions are between 0 to 0.25 and 0–1 because the weights update is very small or exponential small, the training time takes too much longer, and in the worst case, this may completely stop the neural network training.

  Exploding gradients occurs when the derivatives or slope will get larger and larger as we go backward with every layer during backpropagation. This situation is the exact opposite of the vanishing gradients and happens because of the weights and not the activation function. 

2. The more we multiply smaller/bigger numbers due to increased depth, the more chances the can run into these problems. 


3. For example: If we have a derivative value of 0.001, then in a 2 layer network, after backpropogation 2 times we get a value of $0.001^2$. 
But if we have the same derivative in a 100 layer network, after backpropogation of all layers we get a value of $0.001^{100}$. This is a very small number that can cause a vanishing gradient.  
  In the opposite direction if we have a large gradient of 2, and a 2 layer network the result after backpropogation is $2^2$. We can update the weights with value 4. But in an 100 layer network, a final value of $2^{100}$ will cause the gradient to explode. $2^{100}$ is numerically too big to use and to calculate in context with the rest of the calculations.

4. If the model learns very slowly and perhaps the training stagnates at a very early stage just after a few iterations then the issue is likely vanishing gradient. If the model has a poor loss or the model displays NaN loss whilst training then the issue is likely an exploding gradient.


### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer**

$ \frac{\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 - y}{(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2)(1-(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2))}$



$\frac{\partial L}{\partial W_1} = \frac{1}{N}\sum_{i=1}^{N} \frac{\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 - y}{(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2)(1-(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2))}*(W_2\varphi'(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1)x^{(i)} + \lambda\norm{\mat{W}_1}_F' $


$\frac{\partial L}{\partial W_2} = \frac{1}{N}\sum_{i=1}^{N} \frac{\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 - y}{(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2)(1-(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2))}*(\varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1)) + \lambda\norm{\mat{W}_2}_F' $


$\frac{\partial L}{\partial b_1} = \frac{1}{N}\sum_{i=1}^{N} \frac{\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 - y}{(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2)(1-(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2))}*(W_2\varphi'(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) $


$\frac{\partial L}{\partial b_2} = \frac{1}{N}\sum_{i=1}^{N} \frac{\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 - y}{(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2)(1-(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2))} $


$\frac{\partial L}{\partial x} = \frac{1}{N}\sum_{i=1}^{N} \frac{\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 - y}{(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2)(1-(\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2))}*(W_2\varphi'(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1)W_1) $




2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**Answer**
1. Instead of using the formulas calculated above and the chain rule, we can go ahead and calculate the derivative values empirically with the limit equation above. 

2. For every derivative have to evaluate 2 functions instead of 1 so its more resources. It lead to roundoff errors in the discretization process and cancellation. It is also slow at computing partial derivatives of a function with respect to many inputs.

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [3]:
from torch.autograd import grad
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"loss={loss}")

# TODO: Calculate gradients numerically for W and b
grad_W = torch.zeros_like(W)
grad_b = torch.zeros_like(b)
delta = 2e-8

for i in range(0,d):
  for j in range(0,d):
    w_new = W.clone()
    w_new[i,j] += delta
    grad_W[i,j] = (foo(w_new,b)-foo(W,b))/delta

for i in range(0,d):
  b_new = b.clone()
  b_new[i] += delta
  grad_b[i] = (foo(W,b_new)-foo(W,b))/delta

# TODO: Compare with autograd using torch.allclose()
loss.backward()
autograd_W = W.grad
autograd_b = b.grad

assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)

loss=1.8899954468598215


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer**

1. Word embedding is a representation of a word in D dimensional vector space. Used to reduce dimension of input from 1-hot encoding which is size of vocabulary to a smaller D dimension. Embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0’s and 1’s.
2. Using only the 1-hot encoding is not a feasible embedding approach as it demands large storage space for the word vectors and reduces model efficiency. The 1-hot vector is very sparse and most of the time would not yield good results compared to the embedding. 

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors. 

In [7]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"Y shape={Y.shape}")

Y shape=torch.Size([5, 6, 7, 8, 42000])


**Answer**

1. X is a tensor of 4 dimensions, it can be thought of as containinig 5 tensors, where each tensor has a 3-D shape of (6x7x8) that contains integers between 0 and 42. Y is the embedding of X that is a tensor of 5 dimensions. Each 4-Dim vector got an embedding of size 42000. This is what is represented in last dimension. 

2. The following is an implemenation of the embedding with only torch tensors

In [12]:
import torch
from torch.nn.functional import one_hot

new_X=one_hot(X)
print(f'X encoded shape = {new_X.shape}')
embeding = nn.Linear(42, 42000)
y=embeding(new_X.float())
print(f'y shape is ={y.shape}')

X encoded shape = torch.Size([5, 6, 7, 8, 42])
y shape is =torch.Size([5, 6, 7, 8, 42000])


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**Answer**

1. True: TBPTT is a form of backpropogation with an additional variable. is a modified version of the BPTT training algorithm for recurrent neural networks where the sequence is processed one timestep at a time and periodically (k1 timesteps) the BPTT update is performed back for a fixed number of timesteps (k2 timesteps). 
2. False: We also need to limit the timesteps for BPTT
3. False: We can also learn relations between inputs that are farther than S timesteps apart since a model like LSTM has a memory gate to learn more distant relations.

### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


**Answer**

1. Attention is proposed as a solution to the limitation of the Encoder-Decoder model encoding the input sequence to one fixed length vector from which to decode each output time step. This is more relevant when we have long-term dependencies, more of the hidden state need learn the dependencies and less for understanding the meaning. Here the problem is adversery while we have 2 tasks for one learning model and the attention mechanism that deals with this problem.
The attention mechanism uses a weighted sum of all of the encoder hidden states to flexibly focus the attention of the decoder to the most relevant parts of the input sequence instead of decode all parts in a sequence. 

2. The self-attention mechanism allows the inputs to interact with each other and find out who they should pay more attention to. The outputs are aggregates of these interactions and attention scores, in our case we expect the learned hidden states will lead the decoder to generate words into consideration to the previous word in the sequence. 

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

**Answer**

1. KL-divergence will not effect the image reconstaction because it is a measure of divergence between two distributions over the same variable x and helps the model to learn the latent space distribution of each variable.

2. KL-divergence will effect the image generation of the model which is generate the images according to the latent space distribution. The model without the KL-divergence is less variability in the generated x' since it's not based on probability functions.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer**

1. True: It is normally distributed with mean 0, std I. This is related to the KL-divergence
2. False: This is because of the input of the decoder that was defined in normal distribution. the results will be vary within the values of the normal distribution with mean and std
3. False: We are maximixing the lower bound and hoping the bound is tight. 

3. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
  2. It's crucial to backpropagate into the generator when training the discriminator.
  3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
  4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer**

1. False: We want both losses to be small. But there is a tradeoff since minimizing one loss maximizes the other.
2. False: There is no update of weights for the generator when training the discriminator. The generator weights are updated in backpropogation only when training the generator
3. True: We take z, a random data point, the latent space vector from $\mathcal{N}(\vec{0},\vec{I})$ and run it through the generator
4. True: the descriminator will be able to distinct faster
5. False: Training the generator further will improve the loss of the generator but it will not necessarily have an effect on the discriminator so it will not further improve generated images. 

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

**Answer**

1. IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
Dice Coefficient is 2 * the Area of Overlap divided by the total number of pixels in both images.
The mean Average Precision or mAP score is calculated by taking the mean AP over all classes and/or overall IoU thresholds, depending on different detection challenges that exist.



2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

**Answer**

YOLO is the one stage detector. The output of YOLO is a bounding box. In order to obtain such output YOLO takes an image and split it into an SxS grid, where each grid cell predicts only one object. Image classification and localization is applied on each grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each of the grid cells predicts B bounding boxes with confidence scores for those boxes. Each bounding box consists of 5 predictions: bx, by, bw, bh, and confidence.
  Thus, output dimension will S × S × (B ∗ (1+4) + C) tensor. 


In the mask r-CNN the object detection output is done by RPN. RPN uses a CNN to generate the multiple Region of Interest(RoI) using a lightweight binary classifier. It does this using anchors boxes over the image. The classifier returns object/no-object scores. Non Max suppression is applied to Anchors with high objectness score. RPN has two outputs. One output is for “object-like” and the other is for “not object-like” in form of a probability distribution.
Then there are 4 additional outputs per anchor box (x,y,w,h). 
Thus, output dimension will be: k(4 + 2)
Where k = number of anchor boxes, 2 are the object-like and not object-like outputs, and 4 outputs per box. 

