$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 4: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

<div class="alert alert-block alert-success">
<b><u>Answer 1:</b></u>

* the receptive field is a concept that describes the region in the input space that a particular a neuron 'looks at' or 'receives information from'. It is essentially the portion of the input data (for example, an image) that a particular feature of the network can "see" and is influenced by.
* When a convolution operation is applied to the input data or to the output of a previous layer, each resulting feature maps to a specific region of the previous layer. This region that the feature maps to is called its receptive field. The size of the receptive field is determined by the size of the convolutional filter.
* Each subsequent layer in a CNN has a larger receptive field in the original input, as it captures information from wider and wider areas of the input. This layered and hierarchical approach enables CNNs to effectively learn complex patterns and structures in the data.</div>

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

<div class="alert alert-block alert-success">
<b><u>Answer 2:</b></u>

* <b>Adjusting Kernel Size:</b> The size of the kernel used in each convolutional layer impacts the receptive field. A larger kernel size will result in a larger receptive field, allowing the network to capture more spatial information in each layer. However, a larger kernel also means more parameters and potentially a higher computational cost.
* <b>Stride</b>: Stride refers to the number of pixels that the convolutional filter moves at each step. A larger stride will result in a larger receptive field, as the filter jumps over more pixels at each step. However, a larger stride might cause the network to miss some local features in the imag.
* <b>Pooling:</b> Pooling is a downsampling operation that reduces the spatial dimensions of the feature maps. Max pooling and average pooling are two common types of pooling. Using a pooling layer will increase the receptive field without adding additional parameters to the model. However, pooling operations can cause some loss of information as they reduce the spatial dimensions of the feature maps.
</div>

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

<div class="alert alert-block alert-success">
<b><u>Answer 3:</b></u>

##### The receptive field size for each layer is calculated by the formula seen here -  https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807
* Conv1 layer receptive field:  1 + (1 * ((3 - 1) * 1)) = 3
After this layer (taking the stride into consideration here):  1 * 1 = 1.
* MaxPool1 layer receptive field: 3 + (1 * ((2 - 1) * 1)) = 4
After this layer:-1 1 * 2 = 2.
* Conv2 layer receptive field: 4 + (2 * ((5 - 1) * 1)) = 12
After this layer: 2 * 2 = 4.
* MaxPool2 layer receptive field: 12 + (4 * ((2 - 1) * 1)) = 16
After this layer: 4 * 2 = 8.
* Conv3 layer receptive field: 16 + (8 * ((7 - 1) * 2)) = 112

##### The receptive field of each "pixel" in the output tensor is 112, as required.
##### This indicates that each output pixel is influenced by a 112x112 region in the input image.</div>

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

<div class="alert alert-block alert-success">
<b><u>Answer 4:</b></u>

1. In a traditional CNN, the model learns the full transformation, $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, which can be seen as learning the residual between the input $\vec{x}$ and the desired output $\vec{y}_l$.
This residual is a complex mapping that goes through several non-linear transformations involving convolutions, activations, pooling operations etc., represented by the filters $\vec{\theta}_l$.
2. In a Residual Network (ResNet), each layer learns a "simpler" residual, $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$. Here, the residual is the difference between the input $\vec{x}$ and the output of the convolutional layer $f_l(\vec{x};\vec{\theta}_l)$, rather than the final output $\vec{y}_l$. The learned residual is added to the input $\vec{x}$ via a shortcut or skip connection, making it easier for the model to learn identity mappings where the output is the same as the input. This helps alleviate issues like vanishing gradients and allows training of much deeper models. And helps ignore the irrelevant features in the input image.

Despite both being ""residuals", they represent different objectives.
In the traditional CNN, the layers are learning to map the input to the output..
In the ResNet, layers are learning to map the input to a residual such that when added to the input, it improves the output.
-> the filters (which are the learned features in the model) would look different as they try to accomplish these different tasks.
</div>

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

<div class="alert alert-block alert-success">
<b><u>Answer 1:</b></u>

 if we have two consecutive dropout layers with probabilities p1 and p2, the combined effect of the two operations would be to have an effective dropout rate that's the union of the two probabilities. We can understand this by considering two independent events in probability theory.

For an input element:

It gets zeroed out with probability p1 in the first dropout layer.
If it survives, it again gets zeroed out with probability p2 in the second dropout layer.
So the total probability q that an input element gets zeroed out by both layers combined would be:

<b> q = 1 - (1 - p1) * (1 - p2) </b>

in our example: 1 - (1 - 0.1) * (1 - 0.2) = <b> 0.28 </b>

</div>

2. **True or false**: dropout must be placed only after the activation function.

<div class="alert alert-block alert-success">
<b><u>Answer 2:</b></u>

<b>False</b>
Dropout can be placed either before or after the activation function.
The position of Dropout impact the performance, but there's no definitive rule that says it must be placed in a specific position.
A common placements are after the fully connected layer or the activation function.
But again, it does not have to be placed only after the activation function. It needs to be tuned in the model's architecture to get the best results.
</div>

3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

<div class="alert alert-block alert-success">
<b><u>Answer 3:</b></u>

Assume we are training time, with dropout & scaling applied, Lets Denote:
* Let x- a single activation neuron.
* Let p - probability that x is dropped.

the expected value of x is:
$$ E[x_{train}] = p * 0 + (1 - p) * x * 1/(1 - p) = x $$

Note - The second term is the probability 1 - p that x is kept, multiplied by the value x and scaled by 1/(1 - p)
So the expected value of x during training is just x, which is exactly what we want.

</div>

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

<div class="alert alert-block alert-success">
<b><u>Answer 1:</b></u>
For a binary classification task, we typically output a probability from the model, representing the likelihood that the input image is a hotdog.
We can model this probability using a sigmoid function as the activation function in the lastlayer of the model.
For a loss function, we want to penalize the model based on how far off the predicted probability is from the actual label...

L2 loss is more appropriate for regression tasks where the output is a continuous value, as it penalizes the model based on the squared difference between the predicted and actual values.
<b> We can use a loss function suitable for probabilities, like binary cross-entropy loss, which is a common choice for binary classification problems. </b>


Let's consider a case where the model makes a perfect prediction for one class but a very poor prediction for the other class:

* For a dog image (true label=0), the model predicts 0 (correctly sure it's a dog)
* For a hotdog image (true label=1), the model predicts 0.01 (almost certain it's a dog, which is incorrect)

L2 loss & Binary cross-entropy loss for these predictions:

L2 Loss - The L2 loss is calculated as the average of the squares of the differences between true and predicted values.
$$ L2 = ((0 - 0)^2 + (1 - 0.01)^2) / 2 = 0.495 $$

Binary Cross-Entropy Loss is calculated as:
$$ CrossEntropy = - ( y*log(y_hat) + (1-y)*log(1-y_hat) ) $$
`note - log(0) is undefined, but in practice, we usually add a small value to the prediction to avoid undefined logarithms`

* For the dog image (true label=0):
$$ CrossEntropy = - ( 0*log(0) + (1-0)*log(1-0) ) = 0  $$
* For the hotdog image (true label=1):
$$ CrossEntropy = - ( 1*log(0.01) + (1-1)*log(1-0.01) ) = 4.61 $$
Average loss CrossEntropy = (0 + 4.61) / 2 = 2.30

In this case where the model perfectly predicts one class but performs very poorly on the other, the binary cross-entropy loss is significantly higher than the L2 loss.
This property of binary cross-entropy loss makes it more suitable for training binary classification models than L2 loss, as it provides a stronger incentive for the model to correct its errors, <b> especially when it's highly confident in its incorrect predictions. </b>

</div>

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

<div class="alert alert-block alert-success">
<b><u>Answer 2:</b></u>

The most likely cause is the vanishing gradients problem.
 This problem is common in deep neural networks, especially those with many layers and those that use the <u>sigmoid</u> activation function, like in this model.

The sigmoid function has a derivative that approaches 0 as its input becomes more positive or more negative.
When  a model uses sigmoid activation functions and also has many layers (in our case <u>24 sigmoids and 24 fc</u>) it result in gradients that "vanish" (getting zeroed) by the time backpropagation reaches the earliest layers.
When the gradients are very small, the changes made to the model's weights during training become very small, and the model effectively stops learning.
</div>

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

<div class="alert alert-block alert-success">
<b><u>Answer 3: No, she isn't</b></u>

While replacing sigmoid activations with $\tanh$ might help slightly, it won't completely resolve the vanishing gradient problem.
This is because both sigmoid and $\tanh$ functions saturate and produce near-zero gradients for large positive or negative inputs, leading to insignificant weight updates during backpropagation. The main issue of vanishing gradients in deep networks is not effectively addressed by simply switching to $\tanh$. To truly alleviate this problem, should consider ReLU, implementing normalization techniques, employing architectures like ResNet or many other we have seen at this course..

</div>

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
    1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
    1. The gradient of ReLU is linear with its input when the input is positive.
    1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

<div class="alert alert-block alert-success">
<b><u>Answer 4:</b></u>

1. <b> False </b>-  While ReLU activations help mitigate the vanishing gradient problem, it's not accurate to say they entirely prevent it due to the "dying ReLU" issue. In this situation, a neuron outputs 0 and its gradient becomes 0, ceasing to update during backpropagation. This essentially leads to a form of the vanishing gradient problem.
2. <b> False (note: we need to accurently define linerity with respect to positive input, see explantion)</b> - The gradient of ReLU is  1 when the input is positive. Its never proportional to the input value, but since its a constant 1, we can say that it is "linear with respect to the input".
3. <b> True </b> - If the updated weights result in the neuron's input being negative for all samples in the training set, then the output of the neuron will be zero for all subsequent forward passes, regardless of the input. During backpropagation, since the gradient of ReLU for negative inputs is zero, the weights for this neuron are not updated. This neuron is then stuck, always outputting zero — it's effectively "dead."
</div>

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

<div class="alert alert-block alert-success">
<b><u>Answer 1:</b></u>

* Regular GD - uses the entire dataset to calculate the gradient of the cost function for each iteration of the training algorithm, leading to stable but slow convergence and potentially high computational cost. For large datasets, we may run into memory issues on a GPU.
* Stochastic Gradient Descent - updates the model's weights using only a single data point at each iteration, resulting in faster but noisy convergence, helping escape local minima but with the risk of never reaching the global minimum. Also since it uses only one training sample at a time, which doesn't leverage the full potential of GPUs as they are designed to perform parallel operations. Therefore, SGD might not offer a significant speedup when used on a GPU.
* Mini-batch SGD - divides the dataset into smaller batches and updates the weights after each batch, providing a balance between computational efficiency and convergence stability, often making it the method of choice in practice. Mini-batch SGD is often the most efficient method to use with a GPU. The size of the mini-batch can be adjusted to fit the GPU's memory, and the parallel computing capabilities of GPUs can be utilized to perform computations on the entire mini-batch simultaneously, which significantly speeds up the training process.

</div>

2.  Regarding SGD and GD:
  * Provide at least two reasons for why SGD is used more often in practice compared to GD.
  * In what cases can GD not be used at all?

<div class="alert alert-block alert-success">
<b><u>Answer 2:</b></u>

Reasons for SGD being used more often than GD:
* Computational Efficiency: SGD only uses a single example at each iteration, as opposed to using the entire dataset like GD. This makes SGD significantly more memory-efficient, allowing it to handle larger datasets that cannot fit into memory. Moreover, SGD provides faster convergence as updates are made after each training example, rather than waiting to loop through the entire dataset.
* Noisy Updates Can Be Beneficial: The noisy updates in SGD can actually be a benefit as it can help the model to jump out of local minima and possibly find a better (or even the global) minimum, whereas GD can get stuck in local minima because it uses the true gradient direction.
<b> Cases can GD not be used at all </b> - As mentioned above, GD cannot be used when the dataset is too large to fit into memory or when we need to learn labeled data on streaming manner. In these cases, we can <u>only</u> use SGD or mini-batch SGD.
</div>

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

<div class="alert alert-block alert-success">
<b><u>Answer 3:</b></u>

* In increasing the mini-batch size from B to 2B, we would expect the number of iterations required to converge to <u>decrease</u>, as each iteration is now taking into account more data.
* The larger batches provide a more accurate estimation of the gradient, <u>reducing the variance in the gradient estimation</u> (what we call "drunk" GD) and potentially leading to faster convergence in terms of iterations.
* <b><u>However,</b></u> the speed of convergence per iteration may not necessarily decrease linearly with the batch size increase due to diminishing returns from GPU parallelism.
* Additionally, while larger batches may reduce the number of iterations to convergence, they often lead to a model that <u>generalizes less effectively</u> on unseen data.
</div>

4. For each of the following statements, state whether they're **true or false** and explain why.

When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
<div class="alert alert-block alert-success">
<b>Answer 4.1: True.</b>
 In SGD, we update our model's parameters using the gradient of the loss with respect to a single data point at each step, thus we perform an optimization step for each sample in our dataset in every epoch.
</div>

Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
<div class="alert alert-block alert-success">
<b>Answer 4.2: False</b>
Gradients obtained with SGD have more variance than those obtained with GD because they are calculated from a single data point, not the entire dataset. This additional noise can sometimes lead to slower convergence.
</div>

SGD is less likely to get stuck in local minima, compared to GD.
<div class="alert alert-block alert-success">
<b>Answer 4.3: True</b>
More noisy gradient updates can help the model to escape shallow local minima because the updates aren't always along the path of steepest descent, unlike GD
</div>

Training with SGD requires more memory than with GD.
<div class="alert alert-block alert-success">
<b>Answer 4.4: False.</b>
SGD only needs to store a single data point in memory at each step, whereas GD needs to store the entire dataset.
</div>

Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
<div class="alert alert-block alert-success">
<b>Answer 4.5: False</b>
 Neither SGD nor GD is guaranteed to converge to the global minimum, both can get stuck in local minima. However, in practice, SGD can often escape shallow local minima due to its noisier updates.
 Converging to global minima is guaranteed only when the loss function is convex.
</div>

Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.
<div class="alert alert-block alert-success">
<b>Answer 4.6: True</b>
While both are likely to converge, the computation of second derivatives in Newton's method is resource-intensive and can lead to longer convergence times. On the other hand, the momentum term in SGD helps accelerate convergence, especially in cases of high-curvature or bad conditioning (e.g. narrow ravines).
</div>

5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

<div class="alert alert-block alert-success">
<b><u>Bonus Answer: False</b></u>


the inner optimization problem doesn't necessarily have to be solved with a descent-based method. We can use other methods to solve the inner optimization problem, as long as the method is differentiable or its derivative can be approximated.

\begin{equation}
\min\limits_{x \in X, y \in Y}\;\; F(x,y)
\end{equation}

Subject to:

\begin{equation}
G_i(x,y) \leq 0, \quad \text{for } i \in \{ 1,2,\ldots ,I \}
\end{equation}

\begin{equation}
y \in  \arg \min \limits_{z \in Y} \left\{ f(x,z) : g_{j}(x,z) \leq 0, j \in \{ 1,2,\ldots,J \} \right\}
\end{equation}

Where:

\begin{equation}
F,f: \mathbb{R}^{n_x} \times \mathbb{R}^{n_y} \to \mathbb{R}
\end{equation}

\begin{equation}
G_i,g_j: \mathbb{R}^{n_x} \times \mathbb{R}^{n_y} \to \mathbb{R}
\end{equation}

\begin{equation}
X \subseteq \mathbb{R}^{n_x}
\end{equation}

\begin{equation}
Y \subseteq \mathbb{R}^{n_y}
\end{equation}

Sources for bi-level optimization:
* https://en.wikipedia.org/wiki/Bilevel_optimization
* https://ssabach.net.technion.ac.il/files/2017/03/SS2017.pdf
</div>

6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    2. How can each of these problems be caused by increased depth?
    3. Provide a numerical example demonstrating each.
    4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

<div class="alert alert-block alert-success">

<b><u>Answer 6:</b></u>

1. The "vanishing gradients" problem occurs when the gradients of the loss function become very small as they are propagated backwards through the layers of a deep neural network during training. This makes the weights of the initial layers change very little, effectively preventing the network from learning. The "exploding gradients" problem is when the gradients become very large, causing the weights to change drastically during training, which can make the learning process unstable and the network's performance worse.

2. We can look at the depth of the network as the 'size' of the 'chain' in the famous chain rule. If the weights or the derivative of the activation function are small (less than 1), repeatedly multiplying them across many layers can cause the gradients to become vanishingly small. If the weights or the derivative of the activation function are large (greater than 1), repeatedly multiplying them across many layers can cause the gradients to become very large.

3. Vanishing gradients: Consider a deep network with 100 layers, all using the sigmoid activation function, where all the weights are initialized to 0.5. The derivative of the sigmoid function at 0.5 is approximately 0.25. So, after 100 layers, the gradient would shrink to approximately $(0.25)^{100}$, which is practically zero. Exploding gradients: assuming all weights were initialized to 2. After 100 layers, the gradient would have grown to $2^{100}$, which is larger than all the atoms in the universe.

4. If the network is suffering from the vanishing gradients problem, we will see the training loss decreases very slowly or getting stuck. If the network is suffering from exploding gradients, the training process can become unstable, and the loss may become NaN or infinity due to numerical overflow. You may also see very large fluctuations in the loss over time.

</div>

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

<div class="alert alert-block alert-success">

<b><u>Answer 1:</b></u>

\begin{align*}
\frac{\partial L_{\mathcal{S}}}{\partial \mat{W}_2} &= \frac{1}{N}\sum_{i=1}^{N}\left(-\frac{y^{(i)}}{\hat{y}^{(i)}} +\frac{(1-y^{(i)})}{1-\hat{y}^{(i)}}\right)\varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \lambda \mat{W}_2, \\
\frac{\partial L_{\mathcal{S}}}{\partial \mat{W}_1} &= \frac{1}{N}\sum_{i=1}^{N}\left(-\frac{y^{(i)}}{\hat{y}^{(i)}} +\frac{(1-y^{(i)})}{1-\hat{y}^{(i)}}\right)\mat{W}_2\frac{d\varphi(\vec{z}^{(i)})}{d\vec{z}^{(i)}}\vec{x}^{(i)} + \lambda \mat{W}_1, \\
\frac{\partial L_{\mathcal{S}}}{\partial \vec{b}_2} &= \frac{1}{N}\sum_{i=1}^{N}\left(-\frac{y^{(i)}}{\hat{y}^{(i)}} +\frac{(1-y^{(i)})}{1-\hat{y}^{(i)}}\right), \\
\frac{\partial L_{\mathcal{S}}}{\partial \vec{b}_1} &= \frac{1}{N}\sum_{i=1}^{N}\left(-\frac{y^{(i)}}{\hat{y}^{(i)}} +\frac{(1-y^{(i)})}{1-\hat{y}^{(i)}}\right)\mat{W}_2\frac{d\varphi(\vec{z}^{(i)})}{d\vec{z}^{(i)}}, \\
\frac{\partial L_{\mathcal{S}}}{\partial \vec{x}} &= \frac{1}{N}\sum_{i=1}^{N}\left(-\frac{y^{(i)}}{\hat{y}^{(i)}} +\frac{(1-y^{(i)})}{1-\hat{y}^{(i)}}\right)\mat{W}_2\frac{d\varphi(\vec{z}^{(i)})}{d\vec{z}^{(i)}}\mat{W}_1.
\end{align*}


</div>

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
    1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
    2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

<div class="alert alert-block alert-success">

<b><u>Answer 2:</b></u>

1. Using the definition of the derivative in terms of finite differences provides a basic way to approximate gradients numerically without the use of automatic differentiation. It essentially involves a small perturbation (denoted by delta x) of each parameter and then examining the change in the output function value. This can be used to approximate the derivative, or in the case of a neural network, the gradient of the loss function with respect to each individual parameter.

2. <b> inaccuracy & stability: </b> The choice of the $\Delta\vec{x}$ can greatly affect the accuracy of the computed gradient. If $\Delta\vec{x}$ is too large, the approximation becomes poor as it deviates from the true tangent line. If $\Delta\vec{x}$ is too small, it can lead to numerical instability. <b> complexity: </b>This method requires computing the loss function separately for each individual parameter in the neural network. Given that modern neural networks can have millions (or even billions) of parameters, this method would require a prohibitive amount of computational resources and time. This contrasts sharply with automatic differentiation methods, such as backpropagation, which can compute all gradients simultaneously in a single pass through the network.


</div>

3. Given the following code snippet:
    1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
    2. Calculate the same derivatives with autograd.
    3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [4]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# Calculate gradients numerically for W and b
delta_x = 1e-9
w_gradient, b_gradient = torch.zeros_like(W), torch.zeros_like(b)

for i in range(0,d):
    updated_b = torch.clone(b)
    updated_b[i] += delta_x
    b_gradient[i] = (foo(W,updated_b) - foo(W,b)) / (delta_x)
    for j in range(0,d):
            updated_w = torch.clone(W)
            updated_w[i,j] += delta_x
            w_gradient[i,j] = (foo(updated_w,b) - foo(W,b)) / (delta_x)

# Compare with autograd using torch.allclose()
autograd_W = torch.autograd.grad(loss, W)[0]
autograd_b = torch.autograd.grad(loss, b)[0]
print(f"max difference between gradients: {torch.max(torch.abs(w_gradient - autograd_W))}")
assert torch.allclose(w_gradient, autograd_W)
assert torch.allclose(b_gradient, autograd_b)
print(f"test passed.")

loss=tensor(1.9321, dtype=torch.float64, grad_fn=<MeanBackward0>)
max difference between gradients: 3.5640439145778746e-07
test passed.


### Sequence models

1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

<div class="alert alert-block alert-success">

<b><u>Answer 1:</b></u>

1. Word embedding -  Word embeddings capture a wealth of semantic properties relating to the words, such as their meanings, their types, or even their likely neighbors. By using vectors to represent words, we enable the model to use mathematical operations to find patterns and make predictions.

2. It's typically not effective. Without word embeddings, our model will struggle to understand the semantic relationship between words, making it less accurate at sentiment analysis. Training without embeddings also leads to high-dimensionality issues in cases of large vocabularies, making the model computationally expensive and prone to overfitting. Further, it would not be able to meaningfully process words that it hasn't encountered during training, limiting its generalization capability. Therefore, despite being possible, training a sentiment analysis model without word embeddings can substantially diminish the model's performance and efficiency.

</div>

2. Considering the following snippet, explain:
    1. What does `Y` contain? why this output shape?
    2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors.

In [5]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


<div class="alert alert-block alert-success">

<b><u>Answer 2:</b></u>
1. Y contains the embedded representation of each value in X. X - a tensor of shape (5, 6, 7, 8) with values randomly selected from 0 to 41 (inclusive). This tensor is passed to the embedding which transforms each of these values to a 42000-dimensional vector. Hence, Y will have an additional dimension of size 42000, leading to the output shape (5, 6, 7, 8, 42000).
The nn.Embedding layer is a simple lookup table that stores embeddings of a fixed dictionary and size. Each value in the input tensor corresponds to an index of this lookup table. The output is obtained by replacing each index in the input with the corresponding embedding vector. This is useful in many NLP tasks where words are often represented by such embeddings.


2. BONUS - in the following code block:

</div>

In [6]:
import torch
Y_old = Y
num_embeddings = 42
embedding_dim = 42000
embedding_table = torch.randn(num_embeddings, embedding_dim)
X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))

def my_embedding(input, embedding_table):
    return embedding_table[input]

Y = my_embedding(X, embedding_table)
print(f"{Y.shape=}")
print(f"Are shapes equal? {Y.shape == Y_old.shape}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])
Are shapes equal? True


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.

3.1. TBPTT uses a modified version of the backpropagation algorithm.
<div class="alert alert-block alert-success">
<b>Answer 3.1: True</b>
Truncated Backpropagation Through Time (TBPTT) is a variant of the standard backpropagation designed for efficiently training Recurrent Neural Networks (RNNs) on long sequences. It limits the backpropagation to a fixed number of steps to save computational resources.
</div>

3.2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
<div class="alert alert-block alert-success">
<b>Answer 3.2: False</b>
TBPTT is more than limiting sequence lengths—it involves managing hidden states across truncated subsequences during training, which includes both forward and backward passes, to capture long-term dependencies effectively.
</div>

3.3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.
<div class="alert alert-block alert-success">
<b>Answer 3.3: False</b>
The model can still learn longer-term dependencies indirectly. This learning happens as information is propagated through the hidden states across different subsequences, which are connected through the model's parameters. The hidden states are updated at each timestep, and the model can learn to capture long-term dependencies by adjusting the parameters to propagate information from and across these hidden states.
</div>

### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
    2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


<div class="alert alert-block alert-success">

<b><u>Answer 1:</b></u>

1. Without attention, the encoder's hidden states must encapsulate the entire sequence's information, making it less efficient for long sequences. With attention, the decoder utilizes all the encoder's hidden states at each timestamp, effectively searching the whole sequence's latent space to select the best word. This process allows the encoder's hidden states to translate each word more effectively rather than compressing the entire sequence, and the attention mechanism provides a sort of feedback that helps select the most appropriate next word.

2. Using self-attention, where keys, queries, and values all originate from the encoder's hidden states, the model is expected to gain a more nuanced and detailed understanding of the relationships and dependencies within the source sequence itself. Self-attention allows each part of the sequence to "look" at every other part, thereby capturing the internal structure of the sequence. It enables the hidden states to focus more on the relationships within themselves, fostering a deeper understanding of both local and long-range dependencies in the data. As a result, the learned hidden states should be better at modeling the complexity of the input sequence. However, as mentioned before, care must be taken to avoid overfitting, especially in smaller datasets.
</div>

### Unsupervised learning

1. As we have seen, a variational auto encoder's loss is comprised from a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:
    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    1. Images generated by the model ($z \to x'$)?

<div class="alert alert-block alert-success">

<b><u>Answer 1:</b></u>
The KL-divergence term measures the difference between the learned latent variable distribution and a different one (in our case-  standard normal for VAE) .

* Images reconstructed by the model during training ($x\to z \to x'$):
Without the KL-divergence term, a VAE essentially becomes an autoencoder. The reconstruction loss would drive the model to learn to encode the data in such a way that it can be decoded to reproduce the input as accurately as possible. the model could overfit to the training data, as there is no term in the loss function to encourage it to learn a smooth, structured latent space. It acts as  a regulator.

* Images generated by the model ($z \to x'$):
The KL-divergence term in VAEs is crucial for ensuring that the distribution of the encoded representations (latent variables) aligns with a target distribution. When we generate new images, we sample from this distribution. Therefore, if the KL-divergence term is omitted during training, there's no guarantee to generate "new" images that are different from the pre-seen data.
</div>

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:


2.1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
<div class="alert alert-block alert-success">
<b>Answer 2.1: False. </b>

For a specific input image, the VAE model actually generates a specific distribution in the latent space that is not necessarily $\mathcal{N}(\vec{0},\vec{I})$. The KL-divergence term in the VAE loss function is used to ensure that the distribution of the encoded representations (latent variables) aligns with a target distribution (in our case- standard normal for VAE). Therefore, the latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{\mu},\vec{\sigma})$ for some $\vec{\mu},\vec{\sigma}$.
</div>


2.2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
<div class="alert alert-block alert-success">
<b>Answer 2.2: False </b>
In VAE, When the same image is fed multiple times, the mean and variance remain consistent, but the decoded outputs may vary. This is because the decoder uses a sample drawn from the Gaussian distribution. Due to the randomness in this sampling process, different reconstructions may be produced for the same input across multiple runs.

</div>

2.3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

<div class="alert alert-block alert-success">
<b>Answer 2.3: False (we maximaize the lower bound and not minimizing the upper bound..)</b>

The true objective of VAEs is the maximization of the data likelihood, which is intractable due to the complexity of the calculation. We instead maximize a lower bound "evidence lower bound" The ELBO is a lower bound on the log-likelihood of the data.
</div>

3. Regarding GANs, state whether each of the following statements is **true or false**, and explain:


3.1 Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
<div class="alert alert-block alert-success">
<b>Answer 3.1: False (define ideally.. it can also be true, read the explenation)</b> - the generator and discriminator are in a two-player game where the generator aims to generate data that the discriminator cannot distinguish from the real data, while the discriminator aims to correctly <u>distinguish between real and generated data </u>.
This does not mean we want the discriminator's loss to be <u>maximized</u>.
In fact, we want an equilibrium where the generator is generating "realistic" data (low generator loss), and the discriminator is getting about 50% accuracy, meaning it's essentially guessing whether the data is real or generated because both are very similar (which does not <u>necessarily</u> correspond to high loss). So, it's <u>not true</u> that we want to maximize the discriminator's loss. If the discriminator's loss is too high, it might suggest that the discriminator is not performing well, which is not the goal.
</div>

3.2. It's crucial to backpropagate into the generator when training the discriminator.
<div class="alert alert-block alert-success">
<b>Answer 3.2:  False </b>- When we train the discriminator, we typically freeze the generator's weights, meaning we do not backpropagate into the generator. The generator's weights are updated separately, in an attempt to generate data that can fool the discriminator. So, while backpropagation into the generator is crucial in the generator's training phase, it is not required during the training of the discriminator.
</div>

3.3 To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
<div class="alert alert-block alert-success">
<b>Answer 3.3: True</b>

In GANs, the generator network is trained to transform latent space vectors (sampled from a distribution, commonly a standard Gaussian distribution N(0, I)) into synthetic data that resemble the real data.

When we want to generate new images (or other types of data), we can sample a new latent vector from the same distribution and feed it to the trained generator. Because the generator has learned to map the latent space to the data space during training, it can use this new latent vector to generate a unique piece of synthetic data.

Thus, by sampling different latent vectors, we can generate a variety of images, effectively navigating the learned data distribution.

</div>

3.4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
<div class="alert alert-block alert-success">
<b>Answer 3.4: True</b>
It can be beneficial to pre-train the discriminator because is that at some level, competent discriminator could provide more meaningful gradients to the generator at the start of the training, as opposed to just outputting random values.
This pre-training step might help the generator to start learning more effectively from the very beginning, speeding up the overall training process.
</div>

3.5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.
<div class="alert alert-block alert-success">
<b>Answer 3.5: False </b>
The generator is creating images that are difficult for the discriminator to distinguish from real ones, and the discriminator  can't reliably tell the real and generated images apart (50% acc) then we reached a good state of equilibrium for the GAN we trained.
</div>

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

<div class="alert alert-block alert-success">
<b><u>Answer 1:</b></u>
$$ IoU = \frac{A \cap B}{A \cup B}$$
$$ Dice = \frac{2(A \cap B)}{|A| + |B|}$$

IoU and Dice Score are metrics used to measure the overlap between the predicted and actual areas in segmentation tasks. While both range from 0 to 1, Dice Score is typically more forgiving, giving higher values for the same overlap compared to IoU.

mAP (mean Average Precision), on the other hand, is used in object detection tasks. It gives a summary of model performance across multiple thresholds and classes.

The difference - IoU and Dice Score tell us how good uor model is at accurately drawing boxes around objects (segmentation). mAP tells us how good your model is at both drawing accurate boxes and correctly classifying those boxes (detection).

When evaluating a model, If we're performing segmentation, IoU and Dice Score would be more appropriate. If we are performing object detection, mAP would be more appropriate. It's often used in combination with IoU to get a better sense of the model's overall performance.
</div>

2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

<div class="alert alert-block alert-success">
<b><u>Answer 2:</b></u>

YOLO is a <u>one-stage detector</u>. The main difference between one-stage and two-stage detectors is that two-stage detectors first propose regions where objects might be present (region proposals), and then classify these regions, whereas one-stage detectors perform these tasks in a single pass.

* YOLO (You Only Look Once) splits the input image into a grid and predicts multiple bounding boxes and associated class probabilities for each grid cell.
* RPN slides over the input image to detect potential object locations, outputting a score and four bounding box coordinates for each anchor.


<b> Denote: </b>
* S: the number of grid cells along one side of the image for YOLO
* B: the number of bounding box predictions per grid cell in YOLO
* C: the number of classes
* W and H: the width and height of the feature map for RPN in Mask-RCNN
* A: the number of anchors per location in RPN

$$ Yolo - [S, S, B, (5+C)] $$
$$ Mask-RCNN - [W, H, A, (4+1)] $$

</div>