$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 2: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer:**  The receptive field is the size of the filed that contain all the unit inputs(pixels) that can influnce on specific unit output(single pixle).

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer:**
a. kernel size - if the kernel size is higher so the receptive field will be large because more inputs have influnce on the output.

b. stride - as the stride increases so does the receptive field because every single convolution inputus in hidden layer actually are outputs of previes layer convolutions,
therefore if we have large stride the receptive will be larger because there is more distances between the convolotions.

c. diliation -in this method we increase the kernel size but we don't use all the variables like we do when we actually increase the kernel size -
for example with k=3,d=1 wiil have same receptive field like k=5.

we saw that there is close formula to calculate the RF that including all those parameters.

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer:**
formula: $RF_{i} = (RF_{i-1}) + (k + d - 1) $

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer:**
It's probably happens because in residential blocks have "bias" (input) not like regular CNN so it's easy to see that will cause a different filter in output.
Also due the fact we saw completely different filter meaning we have large diffrence between x and the CNN layer output.


### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer:**
    input will remain if it will not drop in both of the layers in prob (1-p1)(1-p2).
    So q = 1 - (1-p1)(1-p2).
    

2. **True or false**: dropout must be placed only after the activation function.

**Answer:**
False - it's better to put it before activation function to avoid unnecessary comptionos but we don't have to do it - to result will be the same.

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer:**
Let's say D is the output of the dropout layer meaning D = dropout(x). D is binomial variable with prob p.

$$\frac{1}{1-p}\cdot\mathbb{E}\left[\text{D}\right]
	=\frac{1}{1-p}\left( p\cdot0 + \left(1-p\right) * x \right)
	=\frac{1-p}{1-p}\cdot x
	=x$$

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer:**
We saw that we use L2 mainly in regression problems, in those cases we want to know how much we was close to the label, not like classification problems. Also in L2 we will know only the prediction but we can't tell nothing on uncertainty.

for example, let's say we have dog(1) input and prediction of prob 0.6>0.5 -> ypred = 1, we will get L2_loss = 0 but in this case we cant tell nothing about the uncertainty of the model. while using cross entropy, Kullback-Leibler, etc. can help tell us more.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer:**
if we got plateau after only a few iterations probably because the net is too deep causing vanishing gradient problem as we learned in the course - 
multiplications of a lot of derivatives (chain rule) might cause explodig gradinent or vanshing gradient.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer:**
His wrong because the thang derviative have similarty to sigmoid derivative - both are going to zero in +- infinty and are bounded in around zero - so we expect to see the same result.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
      1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
      1. The gradient of ReLU is linear with its input when the input is positive.
      1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**Answer:**

    A - True, ReLU derviative can be 1 or 0, meaning the outcome of backprop will be 1 or zero, so in this case it will be ok.
    
    B - False, we know from chaing rule with postive input that:
    
$$\frac{\partial}{\partial x}\left(\text{ReLU}\left(f\left(x\right)\right)\right)	=\frac{\partial}{\partial x}f\left(x\right)$$

so for example if we will take $$x^4$$ we can see that the input is positive and derivative is not linear.

 C - True, if there is input of the hidden layers that is smaller or equal to 0 we will get zero on the output, also it will remain zero because the derivative will be zero so no update will happen - that's one of the reasons why to use leaky relu.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer:**
In GD we update in the direction of the gradients sum while in SGD we choose randomly gradient variable to update in his direction and in mini-batch we update in the sum of batch gradient direction.
one can tell that the GD is the most accurate(not necessarily the best), than the mini-batch SGD and then SGD. vice versa about the computational power and time.

2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

**Answer:**
1. 

    a. Computational power, memory and time.
    
    b. In GD we might stuck on local max because we compute the gradient on all the sample comparing SGD that we can update in different directions and find the global max.
    
2. 
    a. Probably when we have a huge dataset - our resources will not be sufficient.
    
    b. When we have a large derivatives we can get exploding gradient issue.
    
    c. if our data is arranged in non-convex manner.


3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer:**

    we think it will converge faster because now in every iteration we probably update in a more accurate way than before so we will get to the minimum faster.

4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer:**

    A. True, in one epoch we go over all our data set and in every step we update accordingly to random sample.   
    
    B. False, in SGD we have high variance because every time we might update in diffrent direction (stochastic) comparing GD variance 0 (Deterministic), also GD will converge faster (in manner of iterations not necessarily time).
    
    C. True, as we mentioned before .
    
    D. False, in SGD we use only one sample every step while in GD we use all of them (one can say that in one epoch both of methods use the same amount of memory because they go over all the samples).
    
    E.False, SGD have better chances to converge to gloabal minima and it's defently not garudteend to converge to local minima, in a similar way for GD it can also converge to local minima.
    
    F.False, Newton's method use the curvature info to update is step so it's probably will converge faster. there is some cases that the SGD might converge faster but it depends in the momentum and in the sample order. more possible scenario is that the USD will miss the narrow ravine (for large enough momentum).
    
    

5. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**Answer:**

    False,  we don't have to use gradient descent algorithm to solve those problems, we have optimization problems with close soultion like we saw in the course - LLS for example. 
    In adddition, we saw that we can solve those problems in 2 steps: find δu and then find δz.
    In such cases we can just solve it in deterministic way, it might take some time to inverse large matrix but it's still possible.



6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer:**

A + B. As we mentioned before those concept are result from the same reason - chain rule. In deep network we have a lot of multiplication when we backprop so if we have a lot of small derivatives (-1:1 mainly) we will get "vanishing gradient" and for large derivatives (mainly not -1:1) we will get "exploding gradient".

C. The smallest representation in FP is around $2^{-126}$ so if we have net with 26 hidden layers and every layer derivative is about $2^{-5}$ we will probably result 0 in the output (it's possible to build another example accordingly to the architecture).
in the same manner for exploding gradient the largest is around $10^{38}$ so we can take net with depth of 20 and dervitive of 100 for each and we will get overflow in the output.

D. we saw that if we have vanshing effect we will see plateau in the loss graph (stopped to learn) and for exploding we will see unstable graph.

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer:**


<img src="imgs/back.PNG" height="200">

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**Answer:**

A. We can take series of deltas such that  ${{d_ix} > {d_{i+1}x}}$  until we find deltas such that:

${f(x_0;d_i)\approx f(x_0;d_{i+1})}$ and for those we can return ${f(x_0;d_i)}$

B. 
    1. This is approximation and not the real derivative and for function with high curvature around ${x_0}$  we will need smaller deltas.
    2. in some functions it might take more resources to calculate from just compute it directly - more time to test the deltas serial (or in binary search), more computational power for calculate using very small deltas.
    
    

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [16]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
grad_W = torch.zeros_like(W)
grad_b = torch.zeros_like(b)
Wt = W.clone()
bt = b.clone()

def calc_fx_w(di, dj, delta):
        Wt[di][dj] += delta
        fi = (foo(Wt, b) - loss) / delta 
        Wt[di][dj] -= delta
        return fi
    
def calc_fx_b(di, delta):
        bt[di] += delta
        fi = (foo(W, bt) - loss) / delta  #f(x0,di)
        bt[di] -= delta
        return fi

delta = 1
diff = 10**-8

for di in range(d):   
    for dj in range(d):
        fi   = calc_fx_w(di, dj, delta)
        fi_1 = calc_fx_w(di, dj, dl)
        while (torch.abs(fi - fi_1) > diff):
            fi   = fi_1
            dl   = dl / 2
            fi_1 = calc_fx_w(Wt, b, dl)
        grad_W[di][dj] = fi_1

    dl   = delta / 2
    fi   = calc_fx_b(di, delta)
    fi_1 = calc_fx_b(di, dl)
    while torch.abs(fi - fi_1) > diff:
        fi   = fi_1
        dl   = dl / 2
        fi_1 = calc_fx_b(di, dl)
    grad_b[di] = fi_1
    
    

# TODO: Compare with autograd using torch.allclose()
autograd_W = torch.autograd.grad(loss, W)[0]
autograd_b = torch.autograd.grad(loss, b)[0]
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)


loss=tensor(1.7363, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer:**

A. word embedding is process that take word as input and output vector of numbers kind of "word features". we want that words with same meaning will have same~ representation.
B. yes, we can do it but then we will need to find representation for every word that in the data set - for example the word "abd" can be (1,1,0,1 ,0,0...) and train our model accordingly, if we will encounter new word in the test it might cause poor results.
also, the model will have hard time indicate words with similar meaning that can cause poor performance also.

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. How you would implement `nn.Embedding` yourself using only torch tensors. 

In [17]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer:**

A. Y contain the embedding of X - for every "word"/"letter" (0-42) in the tensor of X it contain embedding representation - vector of number in size of 42,000 with dictionary in size of 42. this is the resaon of the shape also.

B.something like:


In [30]:
import torch
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
reshape = torch.nn.functional.one_hot(X)
#Takes LongTensor with index values of shape (*) and returns a tensor of shape (*, num_classes) that have zeros everywhere
#except where the index of last dimension matches the corresponding value of the input tensor, in which case it will be 1.
embeding = nn.Linear(42, 42000)
Y = embeding(reshape)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**Answer:**

A. True,  TBPTT is similar to BPTT but on window in width S. 

B. False, not sure what you meant, but we also need to know the timestamps because we have derivation on the time also.

C. False, for time t, we look S timestamps older to compute TBPTT but we also have the hidden state h(t-S) the represent~ the past.

### Attention

1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?


**Answer:**

A.  The Attention mechanism help the decoder to foucus on the diffrent parts in the input sequnce using the intermediate information that can be found in the encoder. In more detail (from the tutrial paper), the Attention use the encoder outputs as key-value pairs (for each word) and the decoder hidden state as query. while key is the word encoding and the value is the white (by similarity) of the values. we call the output of the attention Content.
These hidden states different from the model without attention in that they used as attention-weighted annotations in the next hidden state.

B. Now we don't use the output of the encoder that was based also on the decoder input - the word itself, before we we get the context from the attention and the word from the input and generate the next hidden state, now the encoder learn the entire meaning of the sentence instead attention to specific word. So now the hidden states will be more oriented to learn the sentence meaning.

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
  2. It's crucial to backpropagate into the generator when training the discriminator.
  3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
  4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

### Graph Neural Networks

1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes:
$$
\mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right).
$$
  1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
  1. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph).
What would be the effect of this bug on the output of your layer, $\mat{Y}$?

2. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?