# Theoretical Questions

* This is the theoretical part of the final project. It includes theoretical questions from various topics covered in the course.
* There are 7 questions among which you need to choose 6, according to the following key:
    + Question 1 is **mandatory**.
    + Choose **one question** from questions 2-3.
    + Question 4 is **mandatory**.
    + Questions 5-6 are **mandatory**.
    + Question 7 is **mandatory**.
* Question 1 is worth 15 points, whereas the other questions worth 7 points.
* All in all, the maximal grade for this parts is 15+7*5=50 points.
* **You should answer the questions on your own. We will check for plagiarism.**
* If you need to add external images (such as graphs) to this notebook, please put them inside the 'imgs' folder. DO NOT put a reference to an external link.
* Good luck!

## Part 1: General understanding of the course material

### Question 1

1.  Relate the number of parameters in a neural network to the over-fitting phenomenon (*).
    Relate this to the design of convolutional neural networks, and explain why CNNs are a plausible choice for an hypothesis class for visual classification tasks.

    (*) In the context of classical under-fitting/over-fitting in machine learning models.

#### Answer:
More parameters in a neural network means that the learned metwork can be more complex and sueted to the training data we used in the training procces. <br>
Complex model is much more prone to overfitting, since it can "remember" more specific relations from the training set thus overfitted to it. <br>
If we consider the number of parameters of convolutional neural networks, this number depends on the number of layers (might be activation, pooling,  <br>
fully connected, convulotion etc.), channels in each layer, kernel, stride and padding sizes. So increasing any of those parameter may lead to overfitting. <br>
From the other hand, CNN tries to solve the overfitting phenomenon by reducing the number of parameters (versus FC for example) by using weighs in size of <br>
kernel_size*kernel_size only instead of the entire size of the image as weighs. Another things is the fact the CNN let us detect pattern in given image by  <br>
appling the kernel on the entire image, thus, finding pattern in the image regardless of its position in the images.


2. Consider the linear classifier model with hand-crafted features:
    $$f_{w,b}(x) = w^T \psi(x) + b$$
    where $x \in \mathbb{R}^2$, $\psi$ is a non-learnable feature extractor and assume that the classification is done by $sign(f_{w,b}(x))$. Let $\psi$ be the following feature extractor $\psi(x)=x^TQx$ where $Q \in \mathbb{R}^{2 \times 2}$ is a non-learnable positive definite matrix. Describe a distribution of the data which the model is able to approximate, but the simple linear model fails to approximate (hint: first, try to describe the decision boundary of the above classifier).

#### Answer:
The decision boundry is achived when $$f_{w,b}(x) = 0$$ because the classification is done by $sign(f_{w,b}(x))$.
So:  $$ f_{w,b}(x) = w^T \psi(x) + b = w^T x^TQx + b = 0$$ <br>
Which is equation from 2nd degree, meaninig the decision boundry is an hyperplane of 2nd degree.
Thus, the distribution of the data which the model is able to approximate are linear (included in the space of 2nd degrees field) too,
but it also can approx. data that it's features have up to 2nd degree relations between them.


3. Assume that we would like to train a Neural Network for classifying images into $C$ classes. Assume that the architecture can be stored in the memory as a computational graph with $N$ nodes where the output is the logits (namely, before applying softmax) for the current batch ($f_w: B \times Ch \times H \times W \rightarrow B \times C$). Assume that the computational graph operates on *tensor* values.
    * Implement the CE loss assuming that the labels $y$ are hard labels given in a LongTensor (as usual). **Use Torch's log_softmax and index_select functions** and implement with less as possible operations.

#### Answer:
$$ CE_LOSS(x, y) =  -1 \cdot sum(y \cdot log(model(x))) = -1 \cdot sum(y \cdot y_probs) $$ <br>

In [None]:
from torch.nn.functional import log_softmax
from torch import index_select
import torch

# Input:  model, x, y.
# Output: the loss on the current batch.

logits = model(x)
y_probs = log_softmax(input=logits, dim=1)
loss = -1 * torch.sum(index_select(input=y_probs, dim=1, incices=y))

NameError: ignored

* Using the model's function as a black box, draw the computational graph (treating both log_softmax and index_select as an atomic operations). How many nodes are there in the computational graph?

#### Answer:
There will be 7 nodes:
<center><img src="imgs/ce_loss.png" /></center>


* Now, instead of using hard labels, assume that the labels are representing some probability distribution over the $C$ classes. How would the gradient computation be affected? analyze the growth in the computational graph, memory and computation.

#### Answer:
Now the computational graph would look like this:
<center><img src="imgs/ce_loss2.png" /></center>

Meaning we will must add another step to our gradient computation - the dot product between y and y_log_prob, as now it isn't one-hot and it means we must take into account the entire probabilties in the calc (when it was one-hot we could look only at one probability - 1).
It means that both the memory and computation usage will be increased.


* Apply the same analysis in the case that we would like to double the batch size. How should we change the learning rate of the optimizer?

#### Answer:
In case we would double the batches size, the gradient computation would stay (pretty much) the same but the gradient itself will be more accurate since we consider more samples in the batches. Also we would use more GPU mem at a given time, since we would like to store the data from the current batch. The computational graph will stay the same size but the operation will change (e.g. from sum of 16 values it would increase to sum of 32 values). Finally, the learning rate must be decreased in order to maintain roughly the same sizes of steps as before, since we aggragate the gradients of the batches (which we increased the size of them).

## Part 2: Optimization & Automatic Differentiation

### Question 2: resolving gradient conflicts in multi-task learning

Assume that you want to train a model to perform two tasks: task 1 and task 2.
For each such task $i$ you have an already implemented function *loss\_i = forward_and_compute_loss_i(model,inputs)* such that given the model and the inputs it computes the loss w.r.t task $i$ (assume that the computational graph is properly constructed). We would like to train our model using SGD to succeed in both tasks as follows: in each training iteration (batch) -
* Let $g_i$ be the gradient w.r.t the $i$-th task.
* If $g_1 \cdot g_2 < 0$:
    + Pick a task $i$ at random.
    + Apply GD w.r.t only that task.
* Otherwise:
    + Apply GD w.r.t both tasks (namely $\mathcal{L}_1 + \mathcal{L}_2$).

Note that in the above formulation the gradient is a thought of as a concatination of all the gradient w.r.t all the models parameters, and $g_1 \cdot g_2$ stands for a dot product.

What parts should be modified to implement the above? Is it the optimizer, the training loop or both? Implement the above algorithm in a code cell/s below

### Question 3: manual automatic differentiation

Consider the following two-input two-output function:
$$ f(x,y) = (x^2\sin(xy+\frac{\pi}{2}), x^2\ln(1+xy)) $$
* Draw a computational graph for the above function. Assume that the unary atomic units are squaring, taking square root, $\exp,\ln$, basic trigonometric functions and the binary atomic units are addition and multiplication. You would have to use constant nodes.
* Calculate manually the forward pass.
* Calculate manually the derivative of all outputs w.r.t all inputs using a forward mode AD.
* Calculate manually the derivative of all outputs w.r.t all inputs using a backward mode AD.

## Part 3: Sequential Models

### Question 4: RNNs vs Transformers in the real life

In each one of the following scenarios decide whether to use RNN based model or a transformer based model. Justify your choice.
1. You are running a start-up in the area of automatic summarization of academic papers. The inference of the model is done on the server side, and it is very important for it to be fast.
2. You need to design a mobile application that gathers small amount of data from few apps in every second and then uses a NN to possibly generate an alert given the information in the current second and the information from the past minute.
3. You have a prediction task over fixed length sequences on which you know the following properties:
    + In each sequence there are only few tokens that the model should attend to.
    + Most of the information needed for generating a reliable prediction is located at the beginning of the sequence.
    + There is no restriction on the computational resources.

####Answer

1:

In this scenario, your startup focuses on rapidly summarizing academic papers on the server side. Speed is of utmost importance. Therefore, opting for a Transformer-based model is advantageous. Transformers are known for their parallel processing capabilities, which significantly accelerate inference. Modern transformer architectures, such as BERT and GPT, have demonstrated remarkable performance in natural language understanding and generation tasks. This aligns perfectly with the task of summarizing academic papers, as it ensures both the speed and quality of the generated summaries.

2:

For the development of a mobile application that continuously collects small data snippets from various apps every second and requires real-time processing with a focus on temporal dependencies, choosing an RNN-based model is prudent. RNNs excel in handling sequential data and capturing patterns over time. They are well-suited for scenarios where past information significantly influences the current state, as is the case with real-time data gathering. RNNs can efficiently process streaming data and make predictions that rely on the temporal relationships within the data, making them a fitting choice for this mobile application.

3:

In this scenario, the prediction task involves fixed-length sequences with a specific focus on only a few tokens, where most of the essential information is concentrated at the beginning of each sequence, and there are no computational constraints. Here, opting for a Transformer-based model is likely the more advantageous choice. Transformers are exceptionally versatile and capable of attending to specific positions within a sequence, even when the critical information is located at the beginning. They can efficiently capture long-range dependencies and intricate relationships in the data, potentially leading to superior predictive performance. Given the abundance of computational resources, configuring a transformer to precisely focus on the relevant tokens and extract valuable insights from the data becomes feasible.

## Part 4: Generative modeling

### Question 5: VAEs and GANS

Suggest a method for combining VAEs and GANs. Focus on the different components of the model and how to train them jointly (the objectives). Which drawbacks of these models the combined model may overcome? which not?

####Answer
Here’s a suggested method for combining VAEs and GANs:

Model Components:
Encoder (Q-network in VAE terminology):

Input: Data sample.
Output: Mean and variance of the latent space distribution.
Decoder (P-network in VAE terminology, or the Generator in GAN terminology):

Input: Sample from the latent space.
Output: Data sample.
Discriminator (from the GAN):

Input: Real or generated data sample.
Output: Probability that the input is a real sample.
Joint Training:
VAE Objective:

The VAE’s objective can be split into two terms:
Reconstruction loss (often Mean Squared Error between original and reconstructed image).
KL divergence between the encoder's output distribution and a prior (usually a standard normal distribution).
Together, these ensure that the VAE produces meaningful latent representations and can reconstruct data samples from these representations.
GAN Objective:

For the Generator (Decoder): Minimize the difference between the Discriminator's output for the generated samples and an array of ones (trying to fool the Discriminator).
For the Discriminator: Maximize the difference between its output for real samples and an array of ones and its output for generated samples and an array of zeros.
Combined Training:

During training, the VAE's objectives and the GAN's objectives are combined. One common approach is to alternate between updating the VAE's weights (both encoder and decoder) and the GAN's weights (both generator and discriminator).
Benefits:
Stable Training: VAEs generally have more stable training than GANs due to the explicit likelihood-based objective.
Better Reconstructions: VAEs can often produce more accurate reconstructions of data, while GANs can sometimes miss certain modes. The combined model might produce sharper and more accurate reconstructions than a standalone VAE.
Regularized Latent Space: The VAE component ensures that the latent space has good continuity and coverage.
Drawbacks:
Increased Complexity: Combining the architectures increases the model's complexity, which can make it harder to train and tune.
Mode Dropping: While the combination might alleviate some of the mode dropping seen in GANs, it might not completely eliminate it.
Training Stability: Even though the VAE component can add some stability, GANs are notorious for training instability, and some of those challenges may persist.
In summary, while VAE-GANs combine the benefits of both architectures, they also combine some of their challenges. However, the strengths of one can sometimes help mitigate the weaknesses of the other. Experimentation and careful tuning are crucial when working with such combined models.

-----------------------------------------

Certainly! When combining VAEs and GANs, the alternating training scheme will involve updating the parameters based on the VAE loss and the GAN loss in an alternating fashion. Below is a more structured approach in LaTeX.

Define the VAE Losses:
For the VAE, the objective typically includes both the reconstruction loss and the KL-divergence loss.
�
recon
=
∣
∣
�
−
Decoder
(
�
)
∣
∣
2
L
recon
​
 =∣∣x−Decoder(z)∣∣
2

�
KL
=
−
1
2
∑
�
=
1
�
(
1
+
log
⁡
(
(
�
�
)
2
)
−
(
�
�
)
2
−
(
�
�
)
2
)
L
KL
​
 =−
2
1
​
  
i=1
∑
D
​
 (1+log((σ
i
​
 )
2
 )−(μ
i
​
 )
2
 −(σ
i
​
 )
2
 )
Combined VAE Loss:

�
VAE
=
�
recon
+
�
KL
�
KL
L
VAE
​
 =L
recon
​
 +λ
KL
​
 L
KL
​

Define the GAN Losses:
For the Discriminator:
�
D-real
=
−
log
⁡
(
Discriminator
(
�
)
)
L
D-real
​
 =−log(Discriminator(x))
�
D-fake
=
−
log
⁡
(
1
−
Discriminator
(
Decoder
(
�
)
)
)
L
D-fake
​
 =−log(1−Discriminator(Decoder(z)))
Combined Discriminator Loss:

�
D
=
�
D-real
+
�
D-fake
L
D
​
 =L
D-real
​
 +L
D-fake
​

For the Generator (Decoder in VAE context):

�
G
=
−
log
⁡
(
Discriminator
(
Decoder
(
�
)
)
)
L
G
​
 =−log(Discriminator(Decoder(z)))
Alternating Training Procedure in LaTeX:
while not converged:
while not converged:
1. Sample a batch of data,
�
 from the training dataset
1. Sample a batch of data, x from the training dataset
2. Forward pass through the VAE:
2. Forward pass through the VAE:
a. Compute
�
,
�
,
 and
�
 using the Encoder
a. Compute z,μ, and σ using the Encoder
b. Compute
�
recon
,
�
KL
,
 and
�
VAE
b. Compute L
recon
​
 ,L
KL
​
 , and L
VAE
​

c. Backpropagate
�
VAE
 and update Encoder and Decoder parameters
c. Backpropagate L
VAE
​
  and update Encoder and Decoder parameters
3. For GAN training:
3. For GAN training:
a. Sample random latent vectors and generate fake data using Decoder
a. Sample random latent vectors and generate fake data using Decoder
b. Compute
�
D-real
,
�
D-fake
,
 and
�
D
b. Compute L
D-real
​
 ,L
D-fake
​
 , and L
D
​

c. Backpropagate
�
D
 and update Discriminator parameters
c. Backpropagate L
D
​
  and update Discriminator parameters
d. Compute
�
G
d. Compute L
G
​

e. Backpropagate
�
G
 and update Decoder (Generator) parameters
e. Backpropagate L
G
​
  and update Decoder (Generator) parameters
In this approach, during each iteration of training, you're first updating the Encoder and Decoder using the VAE objective, and then updating the Discriminator and Decoder using the GAN objective.


### Question 6: Diffusion Models

Show that $q(x_{t-1}|x_t,x_0)$ is tractable and is given by $\mathcal{N}(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)$ where the terms for $\tilde{\mu}(x_t,x_0)$ and $\tilde{\beta_t}$ are given in the last tutorial. Do so by explicitly computing the PDF.

## Part 5: Training Methods

### Question 7: Batch Normalization and Dropout

For both BatchNorm and Dropout analyze the following:
1. How to use them during the training phase (both in forward pass and backward pass)?
2. How differently they behave in the inference phase? How to distinguish these operation modes in code?
3. Assume you would like to perform multi-GPU training (*) to train your model. What should be done in order for BatchNorm and dropout to work properly? assume that each process holds its own copy of the model and that the processes can share information with each other.

(*): In a multi-GPU training each GPU is associated with its own process that holds an independent copy of the model. In each training iteration a (large) batch is split among these processes (GPUs) which compute the gradients of the loss w.r.t the relevant split of the data. Afterwards, the gradients from each process are then shared and averaged so that the GD would take into account the correct gradient and to assure synchornization of the model copies. Note that the proccesses are blocked between training iterations.

####Answer

Let's break down how Batch Normalization (BatchNorm) and Dropout work during the training phase, how they behave differently during inference, and what considerations are necessary for multi-GPU training.

**Batch Normalization (BatchNorm):**

1. **Training Phase (Forward Pass):**
   - During the forward pass in training, BatchNorm normalizes the activations of each layer in a mini-batch. This involves subtracting the mean and dividing by the standard deviation for each feature channel.
   - It also maintains moving averages of the mean and variance for each channel to be used during inference.
   - BatchNorm then scales and shifts the normalized values using learnable parameters (gamma and beta).
   - The normalized activations are passed through the activation function, and the result is used for further computation in the network.
   - Batch statistics (mean and variance) are computed for the current mini-batch and used for normalization.

2. **Training Phase (Backward Pass):**
   - During backpropagation, gradients are computed with respect to the normalized activations, gamma, and beta.
   - These gradients are used to update the model's parameters during optimization.

3. **Inference Phase:**
   - In the inference phase, BatchNorm uses the previously calculated moving averages of mean and variance to normalize the activations.
   - Gamma and beta, which were learned during training, are applied to scale and shift the normalized values.
   - There is no need for computing batch statistics during inference.

**Dropout:**

1. **Training Phase (Forward Pass):**
   - During the forward pass in training, Dropout randomly sets a fraction of the activations to zero. This helps prevent overfitting.
   - The probability of dropping out a neuron is a hyperparameter called the dropout rate.

2. **Training Phase (Backward Pass):**
   - During backpropagation, gradients are computed as if the dropped-out neurons were still active. This means no adjustments are made to the gradients due to dropout.

3. **Inference Phase:**
   - In the inference phase, dropout is typically turned off. All neurons are active.
   - However, it's important to note that the weights may need to be scaled during inference. This scaling is usually done by multiplying the weights by the dropout rate.

**Multi-GPU Training:**

In multi-GPU training, where each GPU has an independent copy of the model, several considerations are needed for BatchNorm and Dropout to work properly:

1. **Batch Normalization:**
   - Each GPU should compute its batch statistics independently during the forward pass.
   - After the forward pass on each GPU, the statistics (mean and variance) should be averaged across all GPUs to ensure consistent normalization.
   - These synchronized statistics should then be used for the normalization on each GPU during the forward and backward passes.
   - The gamma and beta parameters should also be synchronized or shared among all GPUs.

2. **Dropout:**
   - When using dropout in multi-GPU training, ensure that the same dropout mask is applied on all GPUs to maintain consistency.
   - If the dropout mask is generated randomly, you can use a shared random seed or synchronize the masks among GPUs to achieve this consistency.

To distinguish between training and inference modes in code, most deep learning frameworks provide a flag or mode setting. For example, in PyTorch, you can use `model.train()` to set the model in training mode and `model.eval()` to set it in evaluation (inference) mode. This affects the behavior of BatchNorm and Dropout layers as described above.