$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 3: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer** The receptive field in Convolutional Neural Networks (CNN) is the region of the input space that affects a particular unit of the network. Note that this input region can be not only the input of the network but also output from other units in the network, therefore this receptive field can be calculated relative to the input that we consider and also relative the unit that we are taking into consideration as the “receiver” of this input region. Usually, when the receptive field term is mentioned, it is taking into consideration the final output unit of the network (i.e. a single unit on a binary classification task) in relation to the network input (i.e. input image of the network).

Link:
https://blog.christianperone.com/2017/11/the-effective-receptive-field-on-cnns/#:~:text=The%20receptive%20field%20in%20Convolutional,particular%20unit%20of%20the%20network.&text=The%20numbers%20inside%20the%20pixels,sliding%20step%20of%20the%20filter).

Receptive fields are defined portion of space or spatial construct containing units that provide input to a set of units within a corresponding layer.
The receptive field is defined by the filter size of a layer within a convolution neural network. The receptive field is also an indication of the extent of the scope of input data a neuron or unit within a layer can be exposed to (see image below).


2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer**

1. Change stride: changing stride can apply the kernel on feature inputs more or less. For example increasing the stride will cause the same feature to be a part of a calculation less.

2. Use polling: polling groups togther a patch of feature maps into one cell, for example max pooling will take the maximum value of the patch and keep it as a single value in the next feature map.

3. Use dilation: applies the kernel of non negihboring inputs. This mixes inputs from diffrent location in the previous feature map.


3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer**: 13

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer**: During the time of backpropagation, there are two pathways for the gradients to transit back to the input layer while traversing a residual block.  We have gradient pathway-1 and gradient pathway-2. When the computed gradients pass from the Gradient Pathway-2, two weight layers are encountered in our residual function F(x). The weights or the kernels in the weight layers  are updated and new gradient values are calculated. In the case of initial layers, the newly computed values will either become small or eventually vanish. To save the gradient values from vanishing, the shortcut connection (identity mapping) will come into the picture. The gradients can directly pass through the Gradient Pathway-1. In Gradient Pathway-1, the gradients don’t have to encounter any weight layer, hence, there won’t be any change in the value of computed gradients. The residual block will be skipped at once and the gradients can reach the initial layers which will help them to learn the correct weights. Which adjusts the input layer to increase the performance of the network. As well as it makes use of the Identity Connection, which helps to protect the network from vanishing gradient problem, and it uses bottleneck residual block design to increase the performance of the network


### Dropout

1. Consider the following neural network:

In [None]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer**: q = 1 - ((1-P1)*(1-P2))

2. **True or false**: dropout must be placed only after the activation function.

**Answer**: True, There’s some debate as to whether the dropout should be placed before or after the activation function. As a rule of thumb, place the dropout after the activate function for all activation functions other than relu. In passing 0.5, every hidden unit (neuron) is set to 0 with a probability of 0.5. In other words, there’s a 50% change that the output of a given neuron will be forced to 0.
At much lower levels: p=0.1 or 0.2. Dropout was used after the activation function of each convolutional layer: CONV->RELU->DROP. You apply dropout after the non-linear activation function. 

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer**: Let N be the number of neurons in our network.
Then N * (1-p) is the amount of neurons in our network after applying the dropout.
Therefore, since we have less neurons, and the sum of weights of all neurons is lower, each of the neurons that are left will have a higher relative weight, it's new weight relative to it's previous one is now **1-p** times higher.
So we have to scale these activations by 1 / (1-p) to bring their relative weight back to it's original expected value.

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

Answer: Classification seeks to predict a value from a finite set of categories, and since the quadratic loss function gives a measure of how accurate a predictive model is, thus it is used on classification schemes which produce probabilities, in addition, we can use a least squares loss function to
measure the quality of any particular weight

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [None]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

Answer: Since this is a sequential function, it will add more layers using sigmoid activation function, however as more layers are being added to our neural network, the gradients of the loss function approaches zero, making the network hard to train. Since it squishes a large input space into a small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

Answer: Since the sigmoid and tanh activation functions are similar as their derivatives are between  0 to 0.25 and 0–1, the updated weight values are small, and the new weight values are very similar to the old weight values, thus leading to vanishing gradient problem! When inputs become very small or very large, the sigmoid function saturates at 0 and 1 and the tanh function saturates at -1 and 1


4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

Answer: 1. true, since if the input in our model is positive, the ReLU function returns it, and if it is negative then it returns 0, and this because the ReLU’s derivative is 1 for values larger than zero as multiplying 1 by itself several times still gives 1 and the negative component of the ReLU function cannot be discriminated against because it is 0. As a result, negative values’ derivatives are simply set to 0. Thus never reaching a vanishing gradient 



2. Range of ReLU is [0, inf). ReLU activation function is continuous, but not differentiable at x = 0, as the derivative or gradient of ReLU has a constant value when x > 0The gradient of ReLU is 1 for 
x > 0 and 0 for x < 0

3. ReLU cannot learn on examples for which their activation is zero. it happens somtimes that is occurs sometimes when you initialize the entire neural network with zero and place ReLU on the hidden layers. Another cause is when a large gradient flows through, what will happen is that a ReLU neuron will update its weight and might be ended up with a big negative weight and bias. When this happens then this neuron will always produce 0 during the forward propagation, and then the gradient flowing through this neuron will forever be zero irrespective of the input, therefore the weights of this neuron will never be updated again which means its good as dead, thus calling it a "dead" neuron.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

Answer: In regular gradient descent we run through all of the samples in the training set before doing a single update for a parameter. In stochastic gradient descent, we use only one training sample from the training set for parameter update.
Minibatch Stochastic gradient descent is a mix of both, we run over a subset of the training samples before doing a parameter update.


2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

Answer: A. If the training set is very large, we will likely prefer to use SGD or minibatch SGD, since regular gradient descent may take very long, Whereas SGD and minibatch SGD will improve much faster since they don’t need to run through all samples before every update.

If we have a good chance of reaching converging into local minimum, we will prefer to use SGD since it has a better chance of escaping it. That is because regular GD always runs on the same samples and SGD will run on different samples before each parameter update.

B. Finding the closed-form solution is computationally complex, however, when we have only a small training set, it will be computationally reasonable to find the pseudo inverse function. So we will not need to use gradient descent at all since we will have a closed form solution.


3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

Answer: I would expect the number of iterations to increase.
Small batch sizes are noisy, so using them gives a regularization effect.
Using a big batch size will cause us to train over similar samples in each iteration, increasing the generalization error. And we will be more likely to converge into a local minima and never escape it and reach the previous loss value.


4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

Answer: A. True, by definition, one run of an epoch means that every sample in the training set was taken into consideration for updating the loss function.

B. False, gradients with SGD have more variance since they sample a different sample in every update, so the convergence will be very noisy and it is likely that they will avoid the global minima.

C. True, since SGD samples a different sample every update and not the same batch of samples, so the convergence will be more noisy and it will be easier to escape a local minima.

D. True, in normal GD we have to save the entire training set in the RAM, in SGD we only need to save one sample in each parameter update, and then we can release it and save the next one.

E. False, SGD is more likely to escape a local minimum since it runs on a different sample of the training set in each update. Which GD is more likely to converge to a local minimum and not the global one, since it runs on the same batch of samples in each update.

F. True, The momentum will help the gradient to continue moving towards the global minimum as it reaches the local minimum (the narrow ravine), since it will remember the direction in which it moved towards the ravine, and will attempt continue the trend of moving in that direction, even if the current samples the SGD is training on have a different gradient (depends on how different their gradient is and on the momentum hyperparameter).


5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

Answer: A. Vanishing gradients is a phenomenon that occurs when the gradient keeps getting smaller as we go backward during backpropagation. This causes the weights update to become very small and the neural network to stop improving.
This happens due to the sigmoid and tanh activation functions since their values are below 1, so the derivative will always grow smaller.

Exploding gradients occurs when the gradient keeps getting larger as we go backward through the layers. This happens due to high weights, they cause the gradient to also become high and therefore the new weights will differ a lot from the old ones. Which will cause us to start jumping and miss the global minimum.

B. In the vanishing gradients problem, the larger the depth of our network, there are more times when the gradient gets smaller, until it is so small that the gradient becomes negligible and our loss function barely improves.

Similarly, in the exploding gradient problem, the larger the depth, the more the gradient will increase, until it becomes so large that the gradient changes too much and no longer approaches the global minimum.

C. Vanishing gradients – assume for example that our activation function is sigmoid, which is between 0 and 0.25, and that our weights have values less than 1.
Therefore for many hidden layer we will multiply by less than 0.25 a large amount of times until the gradient becomes so small that the model almost stops learning.

Exploding gradients – assume we use the RELU activation function, which is 0 for negative values and 1 for positive ones. And weights that have values higher than 10.
Than for many hidden layers we will multiply by more than 10 a large amount of times until the gradient becomes so large that it will stop converging towards a minimum.

D. We can look at how our loss function changes in each epoch, if it doesn’t seem to improve, then we most likely have the vanishing gradients problem.
If the loss function changes a lot, but does not converge towards the global minimum, then we most likely have the exploding gradients problem.


### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

Answer: 1. 
Numerical differentiation is powerful tool to check the correctness of implementation, 
The derivative of a function f(x) measures the sensitivity to change of the function value (output value) with respect to a change in its argument x (input value). In other words, the derivative tells us the direction f(x) is going.
The gradient shows how much the parameter x needs to change (in positive or negative direction) to minimize f(x).

2. rounding error, and slow to compute, and requires to keep intermediate data in the memory during the forward pass in case it will be used in the backpropagation. In addition, it has a lack of flexibility, e.g., compute the gradient of gradient. Whereas AD create computation graph for gradient computation

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [1]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
# grad_W =...
# grad_b =...

# TODO: Compare with autograd using torch.allclose()
# autograd_W = ...
# autograd_b = ...
# assert torch.allclose(grad_W, autograd_W)
# assert torch.allclose(grad_b, autograd_b)

SyntaxError: ignored

### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

Answer: A. Word embeddings are representations of text where words that have similar meaning  have a similar representation.
The encoding of a word is usually a vector that encodes the word such that the distance between two vectors that represent words is small if the words are similar and vice versa.
A great advantage of word embeddings as opposed to other ways of encoding text is that for a large vocabulary, we will not necessarily have sparse vectors that result in more memory resources, since not every word must correspond to a number in the vector, so we can use a more expressive representation of the text.
Language models can take advantage of these embeddings by modifying them to allow context dependent embeddings, where the same word may have a different embedding in different sentences.

B. Since the vocabulary in the sentiment analysis model is so large, training it directly on sequences of tokens will require us to have sparse vectors of high dimensionality, which may cause significant performance issues. In addition, using word embeddings will allow word with similar meaning to have a similar representation, which will increase the accuracy of our model. So a language model like the sentiment analysis can be trained without an embedding, but with lower performance and accuracy.

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors. 

Answer: X is a tensor of shape (5, 6, 7, 8) filled with integers between 0 and 42.
embedding is an embedding module containing 42 tensors of size 42000 (embedds 42 tokens to vectors of dimension 42000).
The size of the embedding is 42 since we have 42 different possible integers in X, and we want to embed each of them to a different vector.

So Y is a Tensor that embedds each integer with a different value in X to a vector of dimension 42000. (two integers with the same value will be mapped to the same vector)
It’s shape is (5, 6, 7, 8, 42000) since each integer in the origin tensor X, which is of shape (5, 6, 7, 8) will now be a 42000 dimension vector.

In [None]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

Answer: A. True: The backpropagation algorithm presents a sequence of timesteps of input and output pairs to the network, calculates errors in each timestep and updates the weights. Truncated backpropagation is a modified version since it processes the sequence one timestep at a time, but only S times during the algorithm the algorithm will calculate error across S timesteps. This makes training large sequences more efficient since we can consider them as a number of smaller sequences instead of having to go through the entire sequence for a single parameter update.

B. False: To implement TBPTT, we don’t limit the sequence model to length S, but we limit the interval between each run of BPTT and the number of timesteps we apply BPTT on to S. The sequence model length will remain the same.

C. True: The model only calculates errors every S timesteps, on the previous S timesteps, it will not consider the timesteps that came before.


### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


Answer: A. The attention mechanism creates a unique mapping between each time step of the decoder output to all the hidden states.
This way we are able to utilize all of the hidden states of the input sequence during the decoding process, this allows the decoder to focus more on hidden states that are important for it and less on unimportant ones.

The model without attention only passes the hidden state of the previous input, using this model we have to save the complete sequence of information using a single vector.
Therefore, the attention model allows us to accurately process long input sequences, by remembering all of the hidden states and not just the last one.

B. When using self-attention, each hidden state will keep the dependencies between each word in the sequence to every other word in the same sequence, and this way the relationship between words in the sequence are captured.
Since we no longer need to remember all of the previous hidden states, we expect to do a better job performance wise.
Since we keep dependencies between each word in the sequence, we will likely do better in understanding the dependencies between different parts of the sequence, the syntactic function between words in the sentence.
However since we no longer keep all of the previous hidden states, we might do worse in long sequences that have a large dependency between words that are very far from each other.


### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

**Answer**
Without the KL-divergence term, we would have no regularization, our distributions will not be consistent, so the model might ignore the fact that distributions are returned and behave like a normal autoencoder.
A. There will be no difference in the reconstruction
B. The latent space will have a less consistent distribution which will cause the model to have a much harder time generating accurate images.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

Answer: 
A. False, It is true that we attempt to keep the latent space close to the N(0, 1) distribution. But out distribution will eventually be N(U, σ), where U and σ are decided during training.

B. False, when sampling for the VAE we use the reparametrization trick where we sample from an isotropic Gaussian. This proccess will lead to different z values. These z values will give different reconstruction when passed to the decoder.
    
C.  True, we maximize ELBO which in term minimizes the loss function which is intractable.

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
  2. It's crucial to backpropagate into the generator when training the discriminator.
  3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
  4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer**: 
A) True, when training is complete we want the generator to fool the discriminator. This means that its loss is low and the discriminators loss is high because it can't distinguish between real images and fake ones.
B)False, when training the discrininator we do not train the generator.
C) True, when generating an image we sample random noise from the standard normal distribution and feed it to the generator.
D) True, if the discriminator is trained it can help the generator train by giving a higher loss value which will help back propogation. It is important to note that a we should not train it to much because then it can always detect fake images and the generator will not learn.
E) False, when this state is reached there is no reason to continue training the generator because the discriminator is just making random guesses. It will be equivalent to training the model on a coin flip.

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

Answer: The main difference between IoU and Dice score comes when taking the average score over a set of inferences.
The difference emerges when trying to decide how much worse classifier B is than A for any given case.
IoU score tends to penalize single instances of bad classification more than Dice score, even when they can both agree that instance is bad.
So the Dice score tends to measure closer to average performance, while the IoU score measures closer to the worst case performance.

For this reason, we will prefer to use IoU when we have images with a small amount of objects to classify, and we want to make sure that we classify all of them, even if not at the exact location. We will prefer to use Dice in images with many objects where we care less about missing a single object as long as the general accuracy is high.

mAP is a bit different, since it does not care how accurate it was when detecting an object, but rather it uses IoU to determine how many of the objects were correctly classified and how many were classified incorrectly. (We define an IoU threshold, if the IoU score is higher than it, it was classified correctly and vice versa).

So it will make sense to use mAP when we want to know how many of each object where in a picture, but not necessarily their accurate location.


2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

Answer: YOLO is a one stage detector, since it requires only a single forward propagation through a neural network to detect objects. 
MASK – R – CNN is a two stage detector since it uses one model to extracts regions of objects, and a second model to classify them.

RPN – region proposal network: the model outputs a set of proposals, each has a score of it’s probability of being an object and the label of the object.
To generate these proposals for the region where the object lies, a small network is slide over a convolutional feature map that is the output of the last convolutional layer.
RPN has a classifier and a regressor: the classifier determines the probability of a proposal having the target object, and the regressor determines the coordinates of the proposals.
We choose a number of anchors K (usually 9), there will be a total of k possible proposals for each pixel.
The anchors are assigned labels based on their IoU scores and we output the set of proposals, which for an image of width W and height H, will be a Tensor of shape [W, H, K].

Yolo – first, the image is divided into grids, every grid cell will detect objects that appear within them.
We then use bounding box regression to predict the locations and labels of objects in the cells.
Then we use IoU to ensure that the predicted bounding boxes are equal to the real boxes of the objects. We can than eliminate unnecessary bounding boxes so that the final detection will only consist of the best ones.
So, the final output of YOLO are bounding boxes that detect objects.
If the size of each cell is equal to (S x S), we have B bounding boxes in each cell, and each bounding box has 5 attributes (x, y, w, h) and a box confidence score. and there are K possible labels.
Then the shape of the output will be equal to (S, S, B * 5 + K).
