$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 3: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer:**
The receptive field is the region of the input space that a cnn feature is affected by. 
Basically, the size of the kernel the cnn layer operates on defines the receptive field.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer**: 

- Change the conv net depth- We explained above theat the kernel size of one layer controls the RF of the next layer. If we add more layers, the number of kernels the image will pass through is larger, what will cause the RF of the last layer to grow (same as for decreasing the dipth of the net).

- Add diliation- when adding diliation to the kernel, it creates a bigger kernel with the same parameters number. for example, the receptive field of a 3x3 kernel with diliation of 2 has the same RF as a regular kernel of size 5x5.

- Add stride- When adding a stride to a convolutional layer, there are more inputs which affect the output of the layer comparing to a general conv layer which changes by 1 (pixel for example) it location.

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer**:

We will use the following recusive formula to compute the receptive field-
$r_o = \sum_{l=1}^{L}\left(\left({k_l-1}\right)\prod_{i=1}^{l-1}s_i\right)+1$


$=> r_0 = ((3-1)*1 + (2-1)*1*1 + (5-1)*2*1*1 + (2-1)*1*2*1*1 + (7*2-1 -1)*1*1*2*1*1) + 1 = 38$

- The number of channels between layers and padding don't affect RF.
- The diliation of the last layer causes the kernel to be of a size- 7*2-1

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer** Residual networks were desined in order to solve the issue of vanishing gradients in deep convolutional networks. As the depth of the cnn increases, the gradients in the deeper layers gets smaller and smaller and can eeven get to a situation the nn can't learn at all. When using the skip connection and propagating the input to the ouput of the conv layer, we enable the gradients to flow via this skip connection during the backpropgation phase. This causes the filters to be generated completely different.

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer:**
The combined droput will be-
$\\q = 1-(1-p1)*(1-p2)$

2. **True or false**: dropout must be placed only after the activation function.

**Answer:**
False: placing the dropout before or after the activation function doesn't change any result. The dropout layer purpose is to randomly remove a portion of the neurons in the layer. It doesn't backpropagated and doesn't affect the values of the activation function if it's applied before or after it.

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer**: 

The expectation of the activation after dropout layer with probability p-

$E(activation) = p*0 + (1-p)*x = (1-p)x$ //(p*0- since the value of the output gets zeroed with probability p).

By scaling the activations by 1/(1-p) we get that the expectations is-
$E(activation) = (1-p)x * \frac{1}{1-p} = x$

which is the same as the expected value of a regular netwoek not using dropout.


### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer**:

No. MSE is not the right choice for classification problems since its doesn't produce a final set of values as the classification problem wishes to get.

In order to get better and faster results we will use the Binary Cross Erntropy loss for this problem.

** It is possible to use MSE as a loss in classification problem, but it is not the best choice.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer**: 
The most likley situation that caused this result is that the neural network was stuck in a local minimum.
This local minimum is mainly caused by the Sigmoid function used in as activation.
Because Sigmoid function gets produces values between 0 and 1, when it gets a very small value it will output a value close to 0.
This will cause the training to get "stuck" during backprop since the gradients values are very small (vanishing gradients).
Also, because of using only 1 feature, maybe the network has took it's maximum information regarding it's affect on the result, what can cause a plteau.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer**:
Yes- as stated above, because of the values it produces, the Sigmoud function can cause negative and small values to "dissapear".
TanH on the other hand outputs values between -1 and 1, what solves the Sigmoid issue and can help training.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

5. **Answer**: False- the ReLU function can cause vanishing gradients, the same way as explained above for Sigmoid.
6. **Answer**: True- The ReLU function is linear on it's positive part, so when backpropagating through it we get a gradient linear to it's input.
7. **Answer**: True- When getting a negative value, ReLU outputs 0. This causes it's output neurons to be turned off.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer**:

SGD- During training the algorigm updated the weights after every training example (samples are chosen randomly).

Mini-batch SGD- During training the algorithm updates the weights after a given number of training examples (mini-bach number of samples).

GD- During training the algorithm updates the weights only after going through all the traning examples.

2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

**Answer**: 

3.
- In GD we calculate all the derivitives in on step what causes the calculation time get very slow.
- Chances to overfit in SGD are lower since we pick a single point in random at each parameters update.
4. When the dataset is too large there will not be enough memory to hold all the parameters and gradients calculation for each step.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer**: We expect the number of iterations to decrease since at each iteration we see twice the number of examples. Therefore, the algorithm will converge to $l_0$ after ~1/2 of the iterations.

4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

5. **Answer**: True- As stated above, SGD updates the parameters for every training sample during every iteration.
6. **Answer**: False- SGD gradients tend to have bigger variance compared to the gradients of GD. Because SGD cakculates the gradient for randomly chosen training example each time it will see very different gradient values compared to GD which calculated the mean gradients of all training examples each iteration.
7. **Asnwer**: True- When updating the parameters after every training sample, the algorithm has more option go get out of local minima comparing to only one parameters update per epoch.
8. **Answer**: False- GD requires more memory since it needs to hold the data for all gradients of all training exampls in each iteration.
9. **Answer**: False- GD can converge to local minimum, while SGD can converge to global minimum. This is the more probable situation since the randomness of the SGD is more likely to get it out of local minimas compared to GD which is not random at all (GD must start from different stating point in each epoch to avoid local minima).
10. **Answer**: True- Momentum gives big weight on the previous gradient directions. So if the surface has one drection only, momentum will cause the algorithm go bigger steps towards this direction in each iteration.

5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer**:
7. "vanishig gradients"- Happens espacially when the network is deep

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**Answer**: 

3. This formula can be used instead of calculating the gradients analitically in backpropagation. It can be used when the AD can't handle the computation overload.

4. This approach can be less accurate, since it doesn't uses chain rule as AD use to calculate complex deivatives.
Also, it can be slower to compute.

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [4]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
# grad_W =...
# grad_b =...

# TODO: Compare with autograd using torch.allclose()
# autograd_W = ...
# autograd_b = ...
# assert torch.allclose(grad_W, autograd_W)
# assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.6421, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  
  **Answer:**
  Word embeddings is a learned representation for text where words that have the same meaning have a similar representation. In fact it's a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network. There are different techniques for learning these representations, for example Embedding-Layer, Word2Vec and Glove.
  A language model is the use of statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence.
  Language models are using in NLP applications in general and particularly ones that generate text as an output. One type of language models is the Neural Language Models, which make uses of neural networks and word embeddings.

  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If 
  yes, what would be the consequence for the trained model? if no, why not?
  
  **Answer:**
  Technically, language models can be trained without an embedding. Word embeddings help in capturing the semantic and syntactic contexts of different words, and therefore probably will result with lower results.

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  
  **Answer**
  Y contain the actual vector representation of each token in the X n-dimentional tensor. Each vector length is 42000.
  The output shape is the input shape with the addition of the vector length as the last layer

  2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors. 
  
  **Answer**
  At the end the nn.Embedding module is a lookup table that stores the tokens representations. Probably, I would hold a torch tensor with two dimensions, the first with the number of embeddings as indices, and the second with the actual tokens representations for each index.

In [5]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  
  **Answer**
  True. TBPTT, is a modified version of the BPTT training algorithm (which is a modified version of the backpropagation algorithm) for recurrent neural networks where the sequence is processed one timestep at a time and periodically the BPTT update is performed back for a fixed number of timesteps.

  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  
  **Answer**
  False. To implement TBPTT we shouldn't limit the length of the input sequence. The algorithm processes the sequence one timestep at a time and update periodically using the BPTT algotithm for a fixed number of timesteps.
  We need to define two key parameters for TBPTT:
  - k1: The number of forward-pass timesteps between updates. 
  - k2: The number of timesteps to which to apply BPTT.

  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

  **Answer**
  False. Because we can apply the BPTT update every k1 timesteps, we can define S (k2) to be low enough to still learn relations between different "time-batches".

### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


**Answer**: 

In a regular encoder decoder mechanism the encoder encodes the input sequence to one fixed length context vector which is used by the decoder.
This produces a limitaition (especialy as the input sequence gets longer) to the ability of the decoder to decode based on this fixed lenght context vector.
When adding attention mechanism between the encoder and the decoder, each output vector of the decoder is based now on it's own context vector, which is affected by the output vectors of the input sequence and by **the hidden states of the decoder in of the previous step**. This is how the hidden states of the decoder areaffected by the attention mechanism.



### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  
  **Answer**
  The VAE loss is comprised of both the reconstruction term that makes the encoding-decoding scheme efficient, and the KL-divergence term that can be interpreted as a regularizer which prevents the inference network from copying x into z. 
  Therefore, if we forgot to include the KL-divergence term, our model can be led into a sub-optimal solution where the decoder ignores the inferred latent code z, and suffer from an issue known as posterior collapse where the encoder become independent of x and the inference network produces uninformative latent variables.
  Due that, the images reconstructed by the model during training probably will show great results.

  1. Images generated by the model ($z \to x'$)?

  **Answer**
  As explained above, the decoder ignores the inferred latent code z, so the generated images should suffer from bad results.
  

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  
  **Answer**
  False. In our case we did choose the latent-space distribution to be $\mathcal{N}(\vec{0},\vec{I})$, but it surely not a must.
  As explained in class, for example it can be $x \sim \mathcal{N}(\mu, \Sigma)$

  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.

  **Answer**
  True. At the end VAE forward function is sequential operations on different layers, if we didn't update any weights between the different inputs and only pass it forward we should get the same results (reconstruction).
  
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

  **Answer**
  False. We are maximizing the variational lower bound on the data likelihood which is the reconstructed term

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.

  **Answer**
  True. GAN's loss consist of two parts, the expected conditional log-likelihood for real and generated data, while the discriminator output the probability that the sample is real. 
  
  $ V(D,G) = \mathbb{E}_{x \sim p(x)} \left[ \log D(x) \right] + \mathbb{E}_{z\sim q(z)} \left[\log(1 - D\left(G(z)\right)) \right] $

  The generator can't directly affect the $\log(D(x))$ term in the function, so, for the generator, minimizing the loss is equivalent to minimizing $\log(1 - D(G(z)))$.
  The discriminator wants to correctly distinguish real and fake samples, so we want to maximize it's loss.
  
  2. It's crucial to backpropagate into the generator when training the discriminator.

  **Answer**
  False. When training the discriminator we freeze the generator and backpropogate only the discriminator.

  3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.

  **Answer**
  True. GAN is an implicit model, means it does not try to estimate $𝑃(𝑋)$ explicitly like VAE (not even $𝑃(𝑋|𝑧)$. When we want to generate a new image we only need to use the generator and provide him with a random noise as an input, this random noise vecotr can be sampled from the described latent-space distribution.

  4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.

  **Answer**
  True. Altough studies still didn't prove it, the basic idea is correct. If you can't train a classifier to tell the difference between real and generated data even for the initial random generator output, you can't get the GAN training started.

  5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

  **Answer**
  False. In that case the discriminator is no longer contribute to the system and along with that the generator can't improve due to the bad feedback.

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

  **Answer**
  IoU - Intersection over Union, is the rate of overlap area over the union area. Meaning, what % of the predicted and ground-true area is covered by the prediction mask.
  Dice score - is 2 * the area of overlap divided by the total number of pixels in both images.
  The IoU and Dice are very similar and even positivly correlated. Dice counts twich the true positive while IoU counts true positives once in both the numerator and denominator.The difference comes when taking the average score over a set of inferences. In general, the IoU metric tends to penalize single instances of bad classification more than Dice score quantitatively even when they can both agree that this one instance is bad. The IoU metric tends to have a "squaring" effect on the errors relative to the Dice score. So the Dice score tends to measure something closer to average performance, while the IoU score measures something closer to the worst case performance. 

  mAP - Mean average precision, when there are several classes in the segmentation problem, the mAP calculate the mean AP (average precision) among them, where AP is the integral under the Pricision-Recall curve for the relevant class.

  We would use mAP when we don't want to set specific weights on some classes.
  We would use IoU when we want to give more weight for larger errors and Dice otherwise.

2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

  **Answer**
  A one-stage detector, requires only a single pass through the neural network and predicts all the bounding boxes in one go, between both, YOLO is a one stage detector.
  RPN outputs - Region Proposal Network takes an image of any size as input and outputs a set of rectangular object proposals each with an objectness score. 
  YOLO output - YOLO has 24 convolutional layers followed by 2 fully connected layers. Some convolution layers use 1 × 1 reduction layers alternatively to reduce the depth of the features maps. For the last convolution layer, it outputs a tensor with shape (7, 7, 1024). The tensor is then flattened. Using 2 fully connected layers as a form of linear regression, it outputs 7×7×30 parameters and then reshapes to (7, 7, 30), i.e. 2 boundary box predictions per location.
