$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 2: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.
**Answer:** The term of receptive field in CNN's refers to the concept of the amount of neurons in the original feature map,
that the current neurons calculations iterate over. essentially, we can think that a neuron with a bigger receptive fields,
calculates a 'higher level' feature, meaning, for example, in an image, a neuron with a large receptive field, in a net that extracts features for example on a face, 
extracts features not of the sort of if the image has a straight or curved line, but features of the 'eye location, color' sort, 'hair color' and similar features.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.
**Answer:** We can control the rate the receptive field grows with the following mechanisms:
1. Stride - stride is a mechanism where we apply a convolution, not neuron to neuron adjacent, but with a step size, described
by the stride parameter.
2. Dilation - a parameter, which determines per neuron, what distance to take between neurons on which we apply the convolution calculation on. this parameter essentially multiplies the receptive field of the convolution by the dilation size.
3. Pooling - a layer that applies a calculation(max/avg/min), on the previous layer, by the kernel size. this layer causes the receptive field to multiply by the kernel size.

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?
**Answer:** Using the formula from the tutorials: 
## TODO
we can see that the output dim would be

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer:** This is explained by understanding what a residual block essentially means: by adding the input to the filter result, we are learning not the same filter we have learned before, but a **differential** filter, meaning, that we are learning 'additive' features on top of the data.

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer:** let the amount of values remaining in the end after the dropout be $V_{out}$, and the input $V_{in}$.

$V_{out} = V_{in} - p_1*V_{in} -(1-p_1)*p_2*V_{in} = V_{in}(1-(p_1+p_2-p_1 * p_2))$

Therefore, **let $q = p_1 +p_2 -p_1 * p_2$**

2. **True or false**: dropout must be placed only after the activation function.
**Answer:** Depends on the activation function. if one uses ReLU, we will not 'drop out' the percentage of values we wished to drop out, but if we use a continous one, such as tanh, it would result in the same distribution and lost values, therefore, false.

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.
**Answer:** 
Let $P_i$ be a random variable representing thd dropout, and $x_i$ be the layer ontop the dropout is applied.
$a$ be such constant such that $\displaystyle \mathop{\mathbb{E}}{[a\cdot P_i \cdot x_i]} = \displaystyle \mathop{\mathbb{E}}{[X]}$

$\displaystyle \mathop{\mathbb{E}}{[a\cdot P_i \cdot x_i]} = a\cdot \displaystyle \mathop{\mathbb{E}}{[P_i]}\cdot \displaystyle \mathop{\mathbb{E}}{[x_i]} = a \cdot (1-p) \cdot \displaystyle \mathop{\mathbb{E}}{[x_i]}=\displaystyle \mathop{\mathbb{E}}{[x_i]}\rightarrow a=\frac{1}{1-p}$ 


### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer: ** i would not train this with an L2 loss. while doing classification, euclidean distance gives outliers a large influence on the weights and the network, something which we would not wish to do for our classification task. instead, we would rather use a negative log likelyhood lose, or cross entropy loss, which will encode the probability of successfull classification, and punish outliers in a similar way to non outliers.

**for example**:
## TODO 
Let $b_i_j$ be the probability of an image $i$ belonging to class $j$

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer:** A likely cause is that because of the depth of the network, we are seeing the effect of Vanishing Gradients going into effect here. When a model is as deep as seen in the network, vanishing gradients, a phenomenon where numerically, the gradient becomes very small, and therefore, the net will not make any change in the weights, as a result of back propogating, which in the end, will cause the net to essentially not change while training.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer:** This will probably not solve the issue. the sigmoid maps the whole rational space to [0,1], while tanh only increases the size to [-1,1]. this will probably not be enough to solve the issue. one possible solution might be changing the activation function to be ReLU, or some, non zero version of it when the input is negative, for example, LeakyReLU or ELU.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.
  
**Answer:** 
1. False. there could be vanishing gradients with a deep enough network, something we have seen while training networks in the course (sadly)
2. False. the gradient of the positive part of ReLU of the input, is 1, and therefore it is not linear by the definition of a linear system.
3. True. if the output is negative enough, the activation will always be zero, the gradient will be zero, and the weights will be 'stuck' on that negative space. therefore, the neuron will allways output 0, and be dead.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer:**

The 3 optimization methods differ in the way they iterate over the training data:

Regular GD uses all samples in the training data for each update. Meaning that the weights will only be updated after calculating the mean loss of all the samples.

min-batch SGD splits the data into mini batches and for each weight update iteration, it uses a single batch.

SGD uses a single sample for each update. Meaning that the weights are updated for eavery sample. This method converges very fast.

Naturally, for large data sets, GD will be much slower since the training process is mostly serial and we don't take advantage of parallelization (which are GPU is very good at).

It's worth noting that while SGD converges faster than GD, the convergence path is much more noisy since each sample can take the graph to a slightly different location. The situation improves when using mini-batch SGD since the weight updates are based on the mean of losses (we're using a small batch of samples).

2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

**Answer:**

Reasons GD is less common than SGD.
1. SGD converges very fast compared to GD. 
2. SGD is much less memory intensive.

As explained before, GD uses all the samples in the dataset, thus, it is memory intensive. It can become impossible to hold all the samples in memory at once with large amount of samples. In that case, GD cannot be used at all and we'll have to use either SGD or mini-batch SGD.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer:**

I would expect the number of iterations to decrease. Increasing the batch size will cause each iteration to take more time but the loss will be more accurate since it is based on more samples. Thus, less iterations will be needed to convege to the local minima. It should be noted that since each iteration will take longer, we don't know if the total time will be more/less/same with larger batches.

4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer:**

5. True. Unlike regular GD, with SGD we calculate the gradient for each sample (and update the weights for each grandied).
6. False. Gradients with SGD have **more** variance, not less. Since the samples can be very different and two different samples can have gradients in different directions. While this is true, the average of multiple gradients will be in the actual direction of the local minima, thus SGD will still converge. It will converge faster from GD because we might get convergence before even using all the data.
7. True. The bigger variance (explained above) along with the randomness of the SGD, will help escape local minima. Thus, while the gradient that uses all the losses from all the samples might point to the local minima, the SGD is more erratic and its varience will help us escape from the local minima.
8. False. As we explained above. GD might not be possible to even use due to memory limitations while SGD will be possible. So SGD uses less memory.
9. False. Neither is guaranteed to converge to a global minima. SGD is better at escaping local minima (where GD might get stuck on) but there are no guarantees in life.
10. True. When using SGD with momentus, it helps to escape the local minima. While when using SGD without momentum, we rely on variance and randomness to escape that local minima.

5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**Answer:**

False, as seen in tutorial 5, we saw that in order to properly back propagate through the layer, one must only have the following:
let $x$ be the input to the layer, while $y$ is an inner variable of the layer.
we require the following:
$argmin_{y}{f(x,y)}=g(x,\hat{y})$ where $\hat{y}$ is the optimum value for the layer.

we saw from the tutorial that in order to back propagate, one must only get $\nabla_x g(x,\hat{y})=-\nabla_{yy}f(x,y)^{-1}\cdot\nabla_{yx}f(x,y)$ in order to get that, we can solve the optimization problem directly, or via descent based solution, thus, the sentence is false.


6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer:**

7. Vanishing gradients is where, in deep networks, numerical errors become significant and cause the gradients' values of the losses to deminish to 0 (or almost 0). Exploding gradients is the same except that when the numeric errors cause the gradients to become very very large. A good example for the concept is shown in the [Numeric Algorithms when showing a LU decomposition](https://panoptotech.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=c97eba2f-a243-4375-8b26-aa22009d14d8). In either case, the network becomes untrainable since the weights will either update by too much or barely at all.
8. For deep networks, during backpropagation, the calculation of the gradient of the loss is done using the chain rule.
    - When the gradients near the output layer are small, their multiplication will cause the value to diminish fast, and by the time we finish the multiplication of the gradients we will have a near 0 value. This is the vanishing gradients.
    - When the gradients near the output layer are large, the multiplication of the gradients will increase fast (explode) and that is the exploding gradients.
9. Let the net be a conv simple net, without activations, of depth k with a width of 1, such that $y=x\cdot\prod_{i=0}^{k}{a_i}$ we can see the derivative w.r.t to the first layer will be $x\cdot\prod_{i=1}^{k}{a_i}$.

if $a_i<1, a_i= \dfrac{1}{2}$, the gradient will approach 0 when k is sufficiently large.

if $a_i>1$, for example $a_i=2$, it will explode instead

10. Assuming I encounter one of these, if the accuracy and/or loss changes significantly with each epoch (without converging ofc), it's likely exploding gradients. If the accuracy and/or the loss don't change (or barely change) between epochs, then it's likely vanishing gradients.



### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$. 
  ## TODO

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.
  
  **Answer:** This is the formal numerical definition of the derivative as we know it.
  Drawbacks
  1. Not numerically stable, dividing by a very small number can lead to an explosion of the gradient.
  2. Computationally heavy and slow, we will need to form many calculations and many steps in order to reach the optimal answer we seek

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [4]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
# grad_W =...
# grad_b =...

# TODO: Compare with autograd using torch.allclose()
# autograd_W = ...
# autograd_b = ...
# assert torch.allclose(grad_W, autograd_W)
# assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.8498, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer:**
1. Word embedding
    1. Word embedding is a projection of words into a latent space where words with close semantic meaning are "close" to each other in that latent space. When working with a language model, we care less about the specific word used and more about the semantic meaning (for example we don't care is the sentence is "yes, I will drive you to work" or "positive, I will transport you to labor". We expect both sentences to give similar result even though the words are different).
    2. Yes. The model can be trained on the "raw" sequence but as a consequence, it will have significantly worse accuracy as the training is done on the specific words. The model will also not know how to handle new words even if they have close semantic meaning to other words it was trained on. Using the embedding will let us expands the model's "vocabulary" by only adding new words to the embedding part of the model.

2. Considering the following snippet, explain:
    1. What does `Y` contain? why this output shape?
    2. How you would implement `nn.Embedding` yourself using only torch tensors. 

In [6]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer:**

2. Snippet
    1. `Y` is a tensor containing the projections (aka mapping) of each sample in `X` to a vector of size `42000`. We can see that the first `4` dims are the same as `X`, the `5`th dim is of size `42000` where the `i`th number in that dim is the position in the `i`th dim in the latent space.
    1. I will use a matrix $n\times m$ where $n$ is the amount of words and $m$ is the dimention of the latent space. To check the projection of a word, we will multiply a $1$-hot vector ($1\times n$) by the matrix. Which will essentially extract a specific row fromthe matrix.

3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
    3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**Answer:**

3. TBPTT
    1. **True**. TBPTT uses a modified version. In normal BPTT we calculate the gradient through time (TT) while in the truncated version, we only use last $S$ iterations to calculate the gradient.
    2. **False**. We need to limit the length of the BP and the length of the forwardpropagation to length $S$.
    3. **False**. As seen in HW3, we saw that even though we use TBPTT, we have "memory" of prior events, given to us by the hidden state that we do move "forward" between batches, we just not back prop through it. thus we might have lesser memory of prior events, but still, we have relation.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?


**Answer:**

2. The addition of the attention mechanism between the encoder and the decoder eliminates non relevant information from the process.
    - _With the Attention Mechanism_  - TODO:
    - _Without the Attention Mechanism_  - TODO:

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    1. Images generated by the model ($z \to x'$)?

**Answer:**
1. VAE without KL-divergence. The KL-divergence is used a sort of regularization on the encoder to make the output's distribution a normal distribution.
    1. So wo/ the regularization, the output model will be overfitting the input data. So the $x'$ will be very similar to the original $x$.
    2. On the other hand, without the regularization, the model will have a strong tendancy to decode $z\rightarrow x'$ s.t. $x'$ will be (too) similar to the data the model was trained on.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer:**

2. Regarding VAE (true/false):
    1. **False**. The distribution isn't $\mathcal{N}(\vec{0},\vec{I})$ but we map the space so we can use it as such.
    2. **False**. The decoder isn't deterministic but random with a normal distribution with a mean based on $z$ (see slide 21 in lecture 6).
    3. **True**. This is exactly what we do. Since we cannot estimate the evidence in the denominator of the KL-divergence. So we minimize its upper bound in hope that it will be tight enough to give good results.

3. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
    5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

3. Regarding GANs
    1. **False**. Ideally we would like the discrimniator to be at accuracy of 50%, such that it would not know the difference between actual and fake pictures, and thus, the generator loss wouldn't be very low, as it would sometimes "catch hits" from the discriminator
    2. **False**. The discriminator and the generator are separate models, each with its own loss. We have a label for the input of the discriminator (we know if it was generated by us of fed from a real image). When updating the weights of the descriminator we rely on BP only on the descriminator and whether or not it's a real image or not.
    3. **True**. This is exactly what we do. We sample a point from the destribution $\mathcal{N}(\vec{0},\vec{I})$ and decode it back to an image.
    4. **False**. Training the discriminator for a few epochs is something that can often cause the discriminator to become "too good". it usually is beneficial to weaken the discriminator, to allow the generator to become stronger, before giving it a strong discriminator
    5. **False**. If the output images are plausable and the discriminator has a 50% accuracy, it means that the discriminator doesn't know to diffrenciate between real and fake images. So its feedback to the generator will be useless at best or even harmful to the generator (basically random adjustments).

### Graph Neural Networks

1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes:
$$
\mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right).
$$
  1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
  1. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph).
What would be the effect of this bug on the output of your layer, $\mat{Y}$?

2. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?

**Answer:**
2. 