$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 4: Summary Questions
<a id=part4></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

<font color = 'red'>

**Answer:**

The receptive field is, in the context of CNNs, an area in the input space
which some feature is affected from. When the receptive field is large the network
can learn complicated features.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

<font color = 'red'>

**Answer:**

1. Kernel Size: kernel size make a receptive field of it's size.
in the next layer, the receptive field size is proportional to it.

2. Stride: bigger stride will cause fewer overlapping pixels
and will result with a bigger receptive field. If the stride is too big
there will be missing pixels. The stride is too big if it is bigger then
the kernel size.

3. Dilation: dilation is a like stride between pixels in the
samples. Dilation $d$ with kernel of size $k$ will make the
receptive field to be $1 + d\cdot(k-1)$.
The receptive will grow to d values which higher than 1.


3. Imagine a CNN with three convolutional layers, defined as follows:

In [None]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

<font color = 'red'>

**Answer:**

MaxPool2d layers are like basic convolution layers with $dilation=1$, $kernel\_size=2$,
$stride=2$. Also, ReLU does not affect receptive field size.

We can calculate the receptive field size of the output tensor for the Conv2d layer
using the formula:

$r_0 = \sum_{i=1}^n(k_i -1)\cdot d_i \cdot \pi_{j=1}^{j-1}s_j+1$.

Where k = kernel size, d = dilation and s = stride.


$\Rightarrow r_0 = 1\cdot (3-1)\cdot 1 + 1\cdot (2-1)\cdot 1 + 1\cdot (5-1)\cdot 1\cdot 2 + 1\cdot (2-1)\cdot 1\cdot 2\cdot 2 + 2\cdot (7-1)\cdot 1\cdot 2\cdot 2\cdot
  2 + 1 = 112$

$\Rightarrow r_0=112$

This is the receptive field size for each output tensor's "pixel".

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

<font color = 'red'>

**Answer:**

The original network the layers learn the true output.
However, residual network layers learn the delta (residual) between
$\vec{y_l}$ and $\vec{x}$.
This is the reason for the difference.

### Dropout

1. **True or false**: dropout must be placed only after the activation function.

<font color = 'red'>

**Answer:**

False.

dropout layers usually placed after activation layers, however it is not
a mandatory demand. In some cases it could be faster to place dropout before the activation
function because some elements will be dropped.

2. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

<font color = 'red'>

**Answer:**

The probability of an element to drop is $p$ and to no drop is $(1-p)$.
The output $f(x)$ will be equal to $0$ with probability $p$ or equal to $x$ with probability $(1-p)$.

Activation vector expectation with dropout will be $(1-p)\cdot E$.

$E$ is the expectation with no dropout.

$\Rightarrow$ scaling the activations by $1/(1-p)$ will maintain the value of each activation
unchanged in expectation.

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

<font color = 'red'>

**Answer:**

L2 loss is not a good loss for this task. L2 is better for regression tasks.
For classification tasks logistic loss and cross entropy will be more suitable
along other loss functions.

For example -

If the classifier identify a hotdog as a dog, then $L2\_loss=(1-0)^2=1$.

However, $cross\_entropy\_loss=-(0\cdotlog(1)+1\cdot log(0)) \rightarrow \infty $.

We can see that the cross entropy is better for this problem.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [None]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H),
        nn.Sigmoid(),
    ]*N,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

<font color = 'red'>

**Answer:**

It is probably because the model is has vanishing gradients.
The network is very deep and there is no use of batch normalizations or skip connections.

The network uses only sigmoid activation function. This activation function is
bounded by $[0,1]$ and because of that gradients will vanish after N+1 multiplications.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

<font color = 'red'>

**Answer:**

The $tanh$ derivative is in the range of $[-1,1]$.
For that reason the model will still suffer from the same problem.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
    1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
    2. The gradient of ReLU is linear with its input when the input is positive.
    3. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

<font color = 'red'>

**Answer:**

1. False: It is not always the case. For example if the network is very deep it could
cause vanishing gradients.

2. False: The gradient of ReLU is constant for positive input.
Specifically, it is equal to $1$.

3. True: For negative inputs ReLU will output 0.
Therefore, activations can remain at a constant value of zero.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

<font color = 'red'>

**Answer:**

GD: In each iteration the method uses all the data to
calculate the gradient.

SGD: In each iteration the method uses one data point which was not used in
the same epoch to calculate the gradient.
Also, SGD is faster than GD because it performs fewer calculations in each iteration
even tough there are more iterations until convergence.

mini-batch SGD: In each iteration the method uses a small batch of the data
to calculate the gradient.

2. Regarding SGD and GD:
    1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
    2. In what cases can GD not be used at all?

<font color = 'red'>

**Answer:**

1. Reason 1: SDG is usually faster than GD. As I already explained above
SGD is faster than GD because it performs fewer calculations in each iteration
even tough there are more iterations until convergence.

Reason 2: GD uses more memory than SGD because it loads the entire data set into the memory.
On the other hand, SGD loads only one sample in each iteration.

For this two reasons many times with large data sets it is preferred to work with SGD.

Note: there is a third reason which is GD can overfit the data set.

2. As I said GD needs to load all the data set into the memory so if the data set is larger than
the memory GD cannot be used.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

<font color = 'red'>

**Answer:**

Because the batch size is bigger then the gradient the model is calculating is more
precise and for that reason the number of overall iterations will be smaller.
So, We expect that the number of iterations required to converge will decrease.
Note: because the batch size is bigger it will take longer to compute each iteration.

4. For each of the following statements, state whether they're **true or false** and explain why.
    1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
    2. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
    3. SGD is less likely to get stuck in local minima, compared to GD.
    4. Training  with SGD requires more memory than with GD.
    5. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
    6. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

<font color = 'red'>

**Answer:**

1. False, SGD takes only one random sample which has not chosen yet in the epoch
and calculates the gradient only from it.

2. False, Actually it is the other way around.
Gradients obtained with GD have less variance and lead to quicker convergence compared to SGD.
But, because every iteration in SGD might be faster then SGD can be faster than GD overall

3. True, because SGD takes random sample to calculate gradient from it has high variance
which will help in escaping a local minima.

4. False, The opposite. GD loads all the data set to the memory, but SGD loads only one sample
each iteration.

5. False, GD is not guaranteed to converge to global minimum (or it was perfect)
But, both guaranteed to converge to local minimum

6. False, Newton's method can converge in fewer iteration because it uses the first and second order
derivatives which sometime saves time instead of calculate only first derivative.
Also, because the surface is curvature Newton's can even be faster.
That is even though SGD with momentum is faster probably.

5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
    1. **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

<font color = 'red'>

**Answer:**

False, We can minimize $f=y+dy$ without using a descent method at all.

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    2. How can each of these problems be caused by increased depth?
    3. Provide a numerical example demonstrating each.
    4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

<font color = 'red'>

**Answer:**

1. exploding gradients - when gradients are large the learning step can be too big
which can cause the network to miss local minimums while the gradiant
will still continue to grow.

Vanishing Gradients - when gradients are very small (close to zero) then while propagated
through the network they can become even smaller when using the chain rule.
The result could be a slow learning rate as the networks goes deep.

2. Vanishing Gradients could happen due to chain rule and Exploding Gradients can be
the result of multiplying gradients which are bigger than 1, and then it could
grow to infinity.

3. Function f with gradient=0.5 in an n layers network can result in gradient^n
so with relatively small n (even 30) 0.5^30 =~ 0.0000000001 which is almost zero.
On the other hand function f with gradient=1.5 and n = 60 1.5^60 is so big is will cause
unstable behavior

4. We can know its exploading gradients by looking at the loss and see if its getting large and slow
very fast and without control. On the other hand, if it will almost not move it will be vanishing gradient.

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

<font color = 'red'>

**Answer:**

$\frac{\partial L_{\mathcal{S}}}{\partial \hat y^{(i)} } = \frac{1}{N}
(\frac{1-y^{(i)}}{1 - \hat y^{(i)} }- \frac{y^{(i)}}{\hat y^{(i)} })
\Rightarrow
\frac{\partial L_{\mathcal{S}}}{\partial \hat y } = \frac{1}{N}
(\frac{1-y}{1- \hat y } - \frac{y}{\hat y } )$

$\delta W_1 =
\lambda W_1 + \frac{1}{N} \sum_{i=1}^N {\frac{\partial Ls}{\partial \hat y^{
(i)}} \frac{\partial \hat y^{(i)}}{W_1}} = \lambda W_1 +
\frac{1}{N^2} \sum_{i=1}^N {(\frac{1-y^{(i)}}{1- \hat y^{(i)}}- \frac{y^{(i)
}}{\hat y^{(i)}}) \cdot \phi(W_1 x^{(i)} +
b_1)}$

$\delta W_2 = \lambda W_2 +
 \frac{1}{N} \sum_{i=1}^N {\frac{\partial Ls}{\partial \hat y^{(i)}}
 \frac{\partial \hat y^{(i)}}{W_2}} =
 \lambda W_2 + \frac{1}{N^2} \sum_{i=1}^N {(\frac{1-y^{(i)}}{1- \hat y^{(i)}}-
 \frac{y^{(i)}}{\hat y^{(i)}} ) \cdot \phi ' x^{(i)} }$


$\delta b_1 =
 \frac{1}{N} \sum_{i=1}^N {\frac{\partial Ls}{\partial \hat y^{(i)} }
 \frac{\partial \hat y^{(i)} }{b_1} } =
 \frac{1}{N^2} \sum_{i=1}^N {(\frac{1-y^{(i)
 }}{1- \hat y^{(i)}}- \frac{y^{(i)}}{\hat y^{(i)}}) \cdot \phi ' W_2}
$

$\delta b_2 =
 \frac{1}{N} \sum_{i=1}^N {\frac{\partial Ls}{\partial \hat y^{(i)} }
 \frac{\partial \hat y^{(i)} }{b_2} } =
 \frac{1}{N^2} \sum_{i=1}^N {(\frac{1-y^{(i)
 }}{1- \hat y^{(i)}}- \frac{y^{(i)}}{\hat y^{(i)}})}
$

$\delta x =
 \frac{1}{N} \sum_{i=1}^N {\frac{\partial Ls}{\partial \hat y^{(i)} }
 \frac{\partial \hat y^{(i)} }{\partial x} } =
 \frac{1}{N^2} \sum_{i=1}^N {(\frac{1-y^{(i)
 }}{1- \hat y^{(i)}}- \frac{y^{(i)}}{\hat y^{(i)}}) \cdot \phi ' W_2 W_1}
$

2. Given the following code snippet, implement the custom backward function `part4_affine_backward` in `hw4/answers.py` so that it passes the `assert`s.

In [None]:
from torch.autograd import Function

from hw4.answers import part4_affine_backward

N, d_in, d_out = 100, 11, 7
dtype = torch.float64
X = torch.rand(N, d_in, dtype=dtype)
W = torch.rand(d_out, d_in, requires_grad=True, dtype=dtype)
b = torch.rand(d_out, requires_grad=True, dtype=dtype)

def affine(X, W, b):
    return 0.5 * X @ W.T + b

class AffineLayerFunction(Function):
    @staticmethod
    def forward(ctx, X, W, b):
        result = affine(X, W, b)
        ctx.save_for_backward(X, W, b)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        return part4_affine_backward(ctx, grad_output)

l1 = torch.sum(AffineLayerFunction.apply(X, W, b))
l1.backward()
W_grad1 = W.grad
b_grad1 = b.grad

l2 = torch.sum(affine(X, W, b))
W.grad = b.grad = None
l2.backward()
W_grad2 = W.grad
b_grad2 = b.grad

assert torch.allclose(W_grad1, W_grad2)
assert torch.allclose(b_grad1, b_grad2)

### Sequence models

1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    2. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

<font color = 'red'>

**Answer:**

1. Word embedding is a way to represent a work with a vector which holds the semantic properties.
With this representation we can refer to a word as a point in a space and calculate the distance
between words.

2. Yes but it might not be the best idea since each word representation will be orthogonal to each
other and for that reason the model could not calculate distance between words.
Also, the model can increase to be humongous

2. Considering the following snippet, explain:
    1. What does `Y` contain? why this output shape?
    2. How you would implement `nn.Embedding` yourself using only torch tensors.

In [None]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

<font color = 'red'>

**Answer:**

1. Y is containing the representation of X (word embedding).
The embedding dimension is 42000 abd first 4 dimensions of are the same in Y and X
becuase 5\cdot 6\cdot 7\cdot 8 = 420000$ is the total dimensional embeddings for each X tokens (43)

2. We can implement is using NN with 42 dimension inputs and 42000 dimension output.
Each token will be represented at 42000 vector of zeros but one one.

3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of $S$: State whether the following sentences are **true or false**, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length $S$.
    3. TBPTT allows the model to learn relations between input that are at most $S$ timesteps apart.

<font color = 'red'>

**Answer:**

1. True, TBPTT takes derivatives from the last S timesteps.
This is a modified version of the backpropagation algorithm.

2. False, the number of timesteps of propagation can be limited as well.
limiting the sequence length is not needed.

3. True, TBPTT takes derivatives from the last S timesteps.
Because of that the model "remember" only the last S timesteps and nothing earlier.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
    2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?

<font color = 'red'>

**Answer:**

1. The encoder hidden states make a single word, by using attention, which is the best as the next word.
The decoder searches the best word in the sequence latent space (the most suitable) and by that
the decoder gets feedback for the hidden state from the attenstion.
The decoder uses only the last hidden state without attention.

2. The model will generate words by using the previous words from the sentence because it will not use
hidden state of the decoder. This will result with making sequences with sequences that contain similar
words (meaning).


### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    2. Images generated by the model ($z \to x'$)?

<font color = 'red'>

**Answer:**

1. It will cause overfitting of the model to the input data so the input and output will
be similar. Becuase the loss is comprised only by reconstruction term then it will not use
KL-divergence term that tells the statistical measure of the difference between distributions.

2. The opposite will happen here as the input and output will not be related.
This will happen because the loss func won't be able to tell the relation to x because
x does not use KL-divergence term so the model could not learn a specific distribution.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

<font color = 'red'>

**Answer:**

1. False, the encoder does necessarily will draw the standard distribution as it
draws from guassian distribution.

2. False, we will get each time a different reconstruction because the output images are from 
a probability distribution space.

3. True, the model will produce images which are bounded by some loss if the upper bound of the
VAE loss term will be minimized.

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
    5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

<font color = 'red'>

**Answer:**

1. False, We want both of them to have a low loss since they should improve each other.
If the discriminator loss was high is will cause the generator to probably have a higher loss
as well.

2. False, the generator can improve from this and causing the discriminator to not be able to classify
right. Training should be done separately.

3. True, We can basically use any probabilistic distribution we want. We only need that the model will use
only this distribution, so it will minimize the distance between the real image date set distribution and
it's own.

4. True, by doing so the discriminator will become better than the generator but not by too much.
Because of that the generator will yield better results as the discriminator will be less bounded.

5. False, the generator and discriminator should work against each other. This way they will improve
and get better results. If only the generator will be trained it will not know how to get better as the
discriminator can't give the generator the correct and accurate feedback.

### Graph Neural Networks

1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes:
$$
\mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right).
$$
    1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
    2. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph).
What would be the effect of this bug on the output of your layer, $\mat{Y}$?

<font color = 'red'>

**Answer:**

1. Y will contain in it's rows the embedded feature of each node's vector.

2. Y will have a constant in the 5th feature of the node. This will result in that
no node in the graph will learn the 5th feature.

2. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?

<font color = 'red'>

**Answer:**

For each node with information from his neighbors we could construct a receptive field
if the neighbors are no more than L hops from the node. 